Auto-generated from vision-core TaskFactory documentation.
Do not edit manually; run python scripts/sync_supported_model_types.py.
Source: https://github.com/olibartfast/vision-core
The TaskFactory supports the following model type strings:
Object Detection:
"yolo","yolov7e2e","yolov10","yolo26","yolov4"- YOLO-based variants"yolonas"- YOLO-NAS"rtdetr"- RT-DETR family (RT-DETR v1, v2, and v4; excludes v3; includes D-FINE and DEIM v1/v2)"rtdetrul"- RT-DETR (Ultralytics implementation)"rfdetr"- RF-DETR
Instance Segmentation:
"yoloseg"- YOLOv5/YOLOv8/YOLO11"yolov10seg"- YOLOv10"yolo26seg"- YOLO26"rfdetrseg"- RF-DETR
Classification:
"torchvision-classifier"- Torchvision models (ResNet, EfficientNet, etc.)"tensorflow-classifier"- TensorFlow/Keras models"vit-classifier"- Vision Transformers
Video Classification:
"videomae"- VideoMAE"vivit"- ViViT"timesformer"- TimeSformer
Optical Flow:
"raft"- RAFT optical flow
Pose Estimation:
"yolov8pose","yolov8-pose"- YOLOv8 pose (single-stage, returns bbox + keypoints)"yolo11pose","yolo11-pose"- YOLO11 pose"yolo26pose","yolo26-pose"- YOLO26 pose"yolov5pose","yolov5-pose"- YOLOv5 pose"vitpose"- ViTPose (top-down, heatmap-based)
Depth Estimation:
"depth_anything_v2","depth-anything-v2"- Depth Anything V2
Open-Vocabulary Detection:
"owlv2"- OWLv2 open-vocabulary detection"owlvit"- OWL-ViT compatible open-vocabulary detection"groundingdino"- Grounding DINO text-conditioned detection
Open-vocabulary models use text prompts supplied at runtime through TaskConfig::text_prompts. Tokenizer assets can be passed either as file paths (tokenizer_vocab_path, tokenizer_merges_path) or preloaded text blobs (tokenizer_vocab_json, tokenizer_merges_text).
The expected ONNX contract is:
- Inputs:
pixel_values,input_ids,attention_mask - Outputs:
logits,pred_boxes, and optionalobjectness_logits
Results are returned as OpenVocabDetection entries containing bbox, score, prompt_index, and resolved label.
For export details, see export/open_vocab_detection/OWLv2.md.
Image Understanding (VLM):
"gemma4","imageunderstanding"- Vision-language model image captioning / Q&A via llama.cpp backend
Input contract: preprocess() returns two tensors — [0] UTF-8 prompt bytes, [1] raw RGB pixels with an 8-byte header [uint32 width LE][uint32 height LE][H×W×3 bytes]. When no image is provided only tensor [0] is returned (text-only mode). Output is a UTF-8 string returned as float-encoded bytes (one float per byte value).
Requires the llama.cpp LLAMACPP backend with an mmproj (vision projector) GGUF.
For model download and setup details, see export/image_understanding/ImageUnderstanding.md.
Gaussian Splatting:
"lgm","lgm-mini"- LGM (Large Gaussian Model)"grm"- GRM"gaussiansplatting", any string containing"splat"- generic alias