Supported Model Types

Auto-generated from vision-core TaskFactory documentation. Do not edit manually; run python scripts/sync_supported_model_types.py.

Source: https://github.com/olibartfast/vision-core

The TaskFactory supports the following model type strings:

Object Detection:

"yolo", "yolov7e2e", "yolov10", "yolo26", "yolov4" - YOLO-based variants
"yolonas" - YOLO-NAS
"rtdetr" - RT-DETR family (RT-DETR v1, v2, and v4; excludes v3; includes D-FINE and DEIM v1/v2)
"rtdetrul" - RT-DETR (Ultralytics implementation)
"rfdetr" - RF-DETR

Instance Segmentation:

"yoloseg" - YOLOv5/YOLOv8/YOLO11
"yolov10seg"- YOLOv10
"yolo26seg" - YOLO26
"rfdetrseg" - RF-DETR

Classification:

"torchvision-classifier" - Torchvision models (ResNet, EfficientNet, etc.)
"tensorflow-classifier" - TensorFlow/Keras models
"vit-classifier" - Vision Transformers

Video Classification:

"videomae" - VideoMAE
"vivit" - ViViT
"timesformer" - TimeSformer

Optical Flow:

"raft" - RAFT optical flow

Pose Estimation:

"yolov8pose", "yolov8-pose" - YOLOv8 pose (single-stage, returns bbox + keypoints)
"yolo11pose", "yolo11-pose" - YOLO11 pose
"yolo26pose", "yolo26-pose" - YOLO26 pose
"yolov5pose", "yolov5-pose" - YOLOv5 pose
"vitpose" - ViTPose (top-down, heatmap-based)

Depth Estimation:

"depth_anything_v2", "depth-anything-v2" - Depth Anything V2

Open-Vocabulary Detection:

"owlv2" - OWLv2 open-vocabulary detection
"owlvit" - OWL-ViT compatible open-vocabulary detection
"groundingdino" - Grounding DINO text-conditioned detection

Open-vocabulary models use text prompts supplied at runtime through TaskConfig::text_prompts. Tokenizer assets can be passed either as file paths (tokenizer_vocab_path, tokenizer_merges_path) or preloaded text blobs (tokenizer_vocab_json, tokenizer_merges_text).

The expected ONNX contract is:

Inputs: pixel_values, input_ids, attention_mask
Outputs: logits, pred_boxes, and optional objectness_logits

Results are returned as OpenVocabDetection entries containing bbox, score, prompt_index, and resolved label.

For export details, see export/open_vocab_detection/OWLv2.md.

Image Understanding (VLM):

"gemma4", "imageunderstanding" - Vision-language model image captioning / Q&A via llama.cpp backend

Input contract: preprocess() returns two tensors — [0] UTF-8 prompt bytes, [1] raw RGB pixels with an 8-byte header [uint32 width LE][uint32 height LE][H×W×3 bytes]. When no image is provided only tensor [0] is returned (text-only mode). Output is a UTF-8 string returned as float-encoded bytes (one float per byte value).

Requires the llama.cpp LLAMACPP backend with an mmproj (vision projector) GGUF.

For model download and setup details, see export/image_understanding/ImageUnderstanding.md.

Gaussian Splatting:

"lgm", "lgm-mini" - LGM (Large Gaussian Model)
"grm" - GRM
"gaussiansplatting", any string containing "splat" - generic alias

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Supported Model Types

FilesExpand file tree

supported-model-types.md

Latest commit

History

supported-model-types.md

File metadata and controls

Supported Model Types