This project explores the efficiency of Speculative Decoding by comparing standard Chain of Thought (CoT) reasoning against a token-optimized Chain of Draft (CoD) approach.
- Pipeline Automation: End-to-end training and benchmarking.
- Methodology Comparison: Direct comparison between CoT (verbose) and CoD (concise) reasoning styles.
- Robust Environment: Uses
uvfor dependency management.
We use uv to manage a single unified environment (env/, Python 3.11) holding both
Unsloth (training/quantization) and vLLM 0.19.1 (speculative-decoding inference).
Version pin note: vLLM is held at 0.19.1 because Unsloth 2026.4.8 requires
torch<2.11.0while vLLM 0.20 requirestorch==2.11.0. Seerequirements.txtfor the full rationale.
bash scripts/uv_setup_env.shCopy .env.example to .env and populate:
WANDB_API_KEY— used bysrc/train.pyfor logging to thepeft_cobW&B project.GEMINI_API_KEY— used bydata_generation/for Google Gemini Batch API calls.
Run the training pipeline to fine-tune the Target (Qwen3-14B) and Draft (Qwen3-0.6B) models. This script does NOT merge the models; it only saves the LoRA adapters.
bash scripts/train_pipeline.sh -t <type> -s <scenario>-t:cotorcod-s:easy(GSM8K),medium(MATH Lvl 1-2),hard(MATH Lvl 3-4)
Run the benchmark pipeline. This script handles ephemeral merging (Merge -> Benchmark -> Delete) to save space.
bash scripts/benchmark_pipeline.sh -t <type> -s <scenario> [-m <mode>] [-n <N>] [-k]-t,-s: same as above.-m:baseline,speculative, orboth(defaultboth).-n: cap the number of evaluation samples (default: all).-k: keep the temporary merged models on disk after benchmarking (default: delete).
Outputs land in outputs/: a per-sample CSV, a metrics JSON, and a comparison text file.
Generate baselines using an untrained vanilla model.
bash scripts/untrained_pipeline.sh <scenario>Run all experiments sequentially.
bash scripts/run_queue.sh > nightly_log.txt 2>&1.
├── CLAUDE.md # Auto-loaded guidance for Claude Code
├── configs/ # YAML configs (target_14b, draft_0-6b)
├── data/
│ ├── processed/ # Cleaned training data
│ ├── distilled/ # Synthetic data generated by Teacher
│ ├── tests/ # Held-out eval splits
│ └── raw/, analysis/ # Source dumps and analysis artifacts
├── data_generation/ # Gemini Batch API dataset generation
├── deployments/ # Dockerfile + docker-compose.yml
├── models/ # Saved LoRA adapters (and temp merged models)
├── outputs/ # Benchmark CSVs, metrics JSON, comparison text
├── scripts/
│ ├── train_pipeline.sh # Training pipeline (No merge)
│ ├── benchmark_pipeline.sh # Benchmark pipeline (Ephemeral merge)
│ ├── untrained_pipeline.sh # Baseline generation
│ ├── run_queue.sh # Run all 9 jobs sequentially
│ └── uv_setup_env.sh # Environment setup
├── src/
│ ├── train.py # SFTTrainer script
│ ├── distill_data.py # Data generation script (resumable)
│ ├── distill_untrained.py # Untrained-baseline data generation
│ ├── merge_adapter.py # Adapter merging script
│ └── answer_utils.py # Math answer extraction & comparison
└── tests/
└── benchmark.py # Speculative Decoding Benchmark
- Easy: GSM8K
- Medium: MATH (Levels 1-2)
- Hard: MATH (Levels 3-4)
For data generation details, see data_generation/README.md.
All benchmarks run on 1000 samples (easy/hard) and 655 samples (medium) using Qwen3-14B (target) + Qwen3-0.6B (draft), with K=5 speculative tokens.
| Scenario | CoT Speedup | CoD Speedup | CoT Acceptance | CoD Acceptance |
|---|---|---|---|---|
| Easy | 1.41x | 1.53x | 82.73% | 88.23% |
| Medium | 1.50x | 1.56x | 84.97% | 88.95% |
| Hard | 1.55x | 1.64x | 83.44% | 87.98% |
CoD consistently achieves higher acceptance rates (~88–89%) and throughput speedups than CoT (~83–85%), because the draft model more reliably predicts the concise, structured token sequences CoD produces.
| Scenario | Base (Untrained) | CoT Baseline | CoT + Spec | CoD Baseline | CoD + Spec |
|---|---|---|---|---|---|
| Easy | 24.9% | 75.5% | 75.7% | 92.2% | 92.1% |
| Medium | 21.2% | 77.6% | 77.9% | 82.7% | 83.1% |
| Hard | 11.2% | 50.5% | 50.8% | 61.9% | 61.7% |
Fine-tuning with LoRA (both CoT and CoD) produces large accuracy gains over untrained base models. CoD achieves higher accuracy than CoT across all difficulty levels, most notably on easy tasks (+16.7pp). Speculative decoding introduces negligible accuracy delta (<0.3pp) for both methods.
Fine-tuned models generate dramatically fewer tokens than the untrained base (~511 tokens), even in baseline (non-speculative) mode:
| Scenario | Base Tokens | CoT Tokens | CoD Tokens |
|---|---|---|---|
| Easy | 511 | 114 | 96 |
| Medium | 512 | 225 | 165 |
| Hard | 512 | 311 | 305 |
CoD reasoning is slightly more token-efficient than CoT (roughly 15–27% fewer tokens), and both are 4–5x more concise than the untrained model. This reduced sequence length is what allows CoD to attain a higher draft-acceptance rate and greater latency reduction (34–39% vs 29–35% for CoT).
- CoD > CoT for speculative decoding: shorter, more predictable token sequences yield higher acceptance rates, larger speedups, and better accuracy.
- Speculative decoding is accuracy-neutral: enabling draft speculation adds at most ±0.3pp accuracy change.
- Fine-tuning is essential: untrained base models plateau at ~11–25% accuracy; LoRA fine-tuning pushes that to 51–92%.
- Latency gains are substantial: CoD + spec decoding cuts TTFT by 35–39% and boosts throughput to 363–435 tokens/sec (vs 238–278 tok/s baseline).