Motus Inference Guide

This guide covers running inference with Motus models.

Running Inference

1. RoboTwin 2.0 Simulation

For evaluation on RoboTwin 2.0 benchmark:

📖 See detailed guide: RoboTwin Inference Guide

Quick Start:

cd inference/robotwin/Motus

# Single task evaluation
bash eval.sh <task_name>

# Multi-task batch evaluation
bash auto_eval.sh

2. Real-World Inference (No Environment)

📖 See detailed guide: Real-World Inference Guide

We provide a minimal inference script that runs Motus on a single image without any robot environment.

⚠️ Important: Input images must be three-view concatenated (head + left/right wrist cameras). Use Multi-Camera Concatenation first if you have separate camera views.

With pre-encoded T5 embeddings (recommended, ~24GB VRAM):

# Step 1: Encode instruction to T5 embeddings (do this once)
python inference/real_world/Motus/encode_t5_instruction.py \
  --instruction "Pour water from kettle to flowers" \
  --output t5_embed.pt \
  --wan_path pretrained_models

# Step 2: Run inference with pre-encoded embeddings
python inference/real_world/Motus/inference_example.py \
  --model_config inference/real_world/Motus/utils/ac_one.yaml \
  --ckpt_dir pretrained_models/Motus \
  --wan_path pretrained_models \
  --image examples/first_frame.png \
  --instruction "Pour water from kettle to flowers" \
  --t5_embeds t5_embed.pt \
  --output examples/output_ac_one.png

Without pre-encoded T5 (encode on-the-fly, ~41GB VRAM):

python inference/real_world/Motus/inference_example.py \
  --model_config inference/real_world/Motus/utils/ac_one.yaml \
  --ckpt_dir pretrained_models/Motus \
  --wan_path pretrained_models \
  --image examples/first_frame.png \
  --instruction "Pour water from kettle to flowers" \
  --use_t5 \
  --output examples/output_ac_one.png

Output:

examples/output_ac_one.png: Grid of condition frame + predicted future frames
Console: Predicted action chunk with shape (action_chunk_size, action_dim)

Troubleshooting

Issue	Likely Cause	Solution
OOM during inference (~41GB)	T5 encoder loaded at runtime	Use pre-encoded T5 embeddings (`--t5_embeds`) to reduce to 24~25GB
Poor action predictions	Checkpoint mismatch	Ensure using correct config for your checkpoint

📖 Related Guides:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Motus Inference Guide

Running Inference

1. RoboTwin 2.0 Simulation

2. Real-World Inference (No Environment)

Troubleshooting

Uh oh!

FilesExpand file tree

INFERENCE.md

Latest commit

History

INFERENCE.md

File metadata and controls

Motus Inference Guide

Running Inference

1. RoboTwin 2.0 Simulation

2. Real-World Inference (No Environment)

Troubleshooting