This folder contains scripts to generate data using the gemini-3-pro-preview model on Vertex AI (Google Batch API).
The pipeline generates two datasets independently:
- Chain of Thought (CoT): Contains the full, high-level reasoning (
<thought>...</thought>). Saved ascot_*.jsonl. - Chain of Draft (CoD): Contains concise, summarized reasoning (
<draft>...</draft>). Saved ascod_*.jsonl.
Ensure you have the following installed:
pip install google-genai==1.56.0 datasets tqdm python-dotenvSet your API key:
export GEMINI_API_KEY="your_api_key_here"We generally target 1000 samples per run. The --chain argument determines whether you are generating CoT (thought) or CoD (draft).
For the GSM8K dataset, no complex filtering is required.
Generate Chain of Thought (CoT):
python data_generation/launch_generation.py \
--chain thought \
--dataset "gsm8k" \
--file_suffix "easy" \
--limit 1000Generate Chain of Draft (CoD):
python data_generation/launch_generation.py \
--chain draft \
--dataset "gsm8k" \
--file_suffix "easy" \
--limit 1000Note: Defaults to
thoughtif--chainis omitted.
Launch Generation Arguments:
| Argument | Description | Default |
|---|---|---|
--chain |
Type of chain: thought or draft |
thought |
--dataset |
Hugging Face dataset name | qwedsacf/competition_math |
--filter |
Filter string (e.g., level=Level 1,Level 2) |
None |
--file_suffix |
Output suffix (cot_{suffix}.jsonl) |
None |
--limit |
Max number of samples to process | 1000 |
--dry-run |
Prepare batch file but do not submit | False |
--auto_fill |
Auto-select "Fill Gap" if existing < limit | False |
--auto_extend |
Auto-select "Extend" if existing data found | False |
Process Results (Check Status & Download):
python data_generation/process_results.py \
--chain thought \
--dataset "gsm8k" \
--file_suffix "easy" \Or for draft:
python data_generation/process_results.py \
--chain draft \
--dataset "gsm8k" \
--file_suffix "easy"For "Medium" difficulty, we filter the qwedsacf/competition_math dataset for Algebra and Precalculus (Levels 1-3).
Generate CoT:
python data_generation/launch_generation.py \
--chain thought \
--dataset "qwedsacf/competition_math" \
--filter "level=Level 1,Level 2" \
--filter "type=Algebra,Intermediate Algebra,Precalculus" \
--file_suffix "medium" \
--limit 1000Generate CoD:
python data_generation/launch_generation.py \
--chain draft \
--dataset "qwedsacf/competition_math" \
--filter "level=Level 1,Level 2" \
--filter "type=Algebra,Intermediate Algebra,Precalculus" \
--file_suffix "medium" \
--limit 1000Filter for Level 4, 5.
Generate CoT:
python data_generation/launch_generation.py \
--chain thought \
--dataset "qwedsacf/competition_math" \
--filter "level=Level 3,Level 4" \
--filter "type=Algebra,Intermediate Algebra,Precalculus" \
--file_suffix "hard" \
--limit 1000Generate CoD:
python data_generation/launch_generation.py \
--chain draft \
--dataset "qwedsacf/competition_math" \
--filter "level=Level 3,Level 4" \
--filter "type=Algebra,Intermediate Algebra,Precalculus" \
--file_suffix "hard" \
--limit 1000launch_generation.py: Handles dataset loading, filtering, and submitting Batch API jobs. Creates a local state file intmp/.process_results.py: Checks batch status and downloads results. Saving logic handled here.
- CoT:
data/cot_{dataset}_{suffix}.jsonl - CoD:
data/cod_{dataset}_{suffix}.jsonl
We provide a script to verify generated data against the ground truth.
Usage:
python data_generation/analyze_data.py \
--dataset "qwedsacf/competition_math" \
--suffix "medium"