This guide covers running inference with Motus models.
For evaluation on RoboTwin 2.0 benchmark:
📖 See detailed guide: RoboTwin Inference Guide
Quick Start:
cd inference/robotwin/Motus
# Single task evaluation
bash eval.sh <task_name>
# Multi-task batch evaluation
bash auto_eval.sh📖 See detailed guide: Real-World Inference Guide
We provide a minimal inference script that runs Motus on a single image without any robot environment.
With pre-encoded T5 embeddings (recommended, ~24GB VRAM):
# Step 1: Encode instruction to T5 embeddings (do this once)
python inference/real_world/Motus/encode_t5_instruction.py \
--instruction "Pour water from kettle to flowers" \
--output t5_embed.pt \
--wan_path pretrained_models
# Step 2: Run inference with pre-encoded embeddings
python inference/real_world/Motus/inference_example.py \
--model_config inference/real_world/Motus/utils/ac_one.yaml \
--ckpt_dir pretrained_models/Motus \
--wan_path pretrained_models \
--image examples/first_frame.png \
--instruction "Pour water from kettle to flowers" \
--t5_embeds t5_embed.pt \
--output examples/output_ac_one.pngWithout pre-encoded T5 (encode on-the-fly, ~41GB VRAM):
python inference/real_world/Motus/inference_example.py \
--model_config inference/real_world/Motus/utils/ac_one.yaml \
--ckpt_dir pretrained_models/Motus \
--wan_path pretrained_models \
--image examples/first_frame.png \
--instruction "Pour water from kettle to flowers" \
--use_t5 \
--output examples/output_ac_one.pngOutput:
examples/output_ac_one.png: Grid of condition frame + predicted future frames- Console: Predicted action chunk with shape
(action_chunk_size, action_dim)
| Issue | Likely Cause | Solution |
|---|---|---|
| OOM during inference (~41GB) | T5 encoder loaded at runtime | Use pre-encoded T5 embeddings (--t5_embeds) to reduce to 24~25GB |
| Poor action predictions | Checkpoint mismatch | Ensure using correct config for your checkpoint |
📖 Related Guides: