Name	Name	Last commit message	Last commit date
parent directory ..
images	images
README.md	README.md
__init__.py	__init__.py
analysis.ipynb	analysis.ipynb
analyze_data.py	analyze_data.py
batch_client.py	batch_client.py
dataset_loader.py	dataset_loader.py
fix_vocab.py	fix_vocab.py
launch_generation.py	launch_generation.py
process_results.py	process_results.py
prompts.py	prompts.py
temp_helper.py	temp_helper.py

Chain of Thought (CoT) & Chain of Draft (CoD) Data Generation

This folder contains scripts to generate data using the gemini-3-pro-preview model on Vertex AI (Google Batch API).

The pipeline generates two datasets independently:

Chain of Thought (CoT): Contains the full, high-level reasoning (<thought>...</thought>). Saved as cot_*.jsonl.
Chain of Draft (CoD): Contains concise, summarized reasoning (<draft>...</draft>). Saved as cod_*.jsonl.

Prerequisites

Ensure you have the following installed:

pip install google-genai==1.56.0 datasets tqdm python-dotenv

Set your API key:

export GEMINI_API_KEY="your_api_key_here"

Running the Pipeline

We generally target 1000 samples per run. The --chain argument determines whether you are generating CoT (thought) or CoD (draft).

1. Easy Scenario (GSM8K)

For the GSM8K dataset, no complex filtering is required.

Generate Chain of Thought (CoT):

python data_generation/launch_generation.py \
  --chain thought \
  --dataset "gsm8k" \
  --file_suffix "easy" \
  --limit 1000

Generate Chain of Draft (CoD):

python data_generation/launch_generation.py \
  --chain draft \
  --dataset "gsm8k" \
  --file_suffix "easy" \
  --limit 1000

Note: Defaults to thought if --chain is omitted.

Launch Generation Arguments:

Argument	Description	Default
`--chain`	Type of chain: `thought` or `draft`	`thought`
`--dataset`	Hugging Face dataset name	`qwedsacf/competition_math`
`--filter`	Filter string (e.g., `level=Level 1,Level 2`)	`None`
`--file_suffix`	Output suffix (`cot_{suffix}.jsonl`)	`None`
`--limit`	Max number of samples to process	`1000`
`--dry-run`	Prepare batch file but do not submit	`False`
`--auto_fill`	Auto-select "Fill Gap" if existing < limit	`False`
`--auto_extend`	Auto-select "Extend" if existing data found	`False`

Process Results (Check Status & Download):

python data_generation/process_results.py \
  --chain thought \
  --dataset "gsm8k" \
  --file_suffix "easy" \

Or for draft:

python data_generation/process_results.py \
  --chain draft \
  --dataset "gsm8k" \
  --file_suffix "easy"

2. Medium Scenario (Competition Math)

For "Medium" difficulty, we filter the qwedsacf/competition_math dataset for Algebra and Precalculus (Levels 1-3).

Generate CoT:

python data_generation/launch_generation.py \
  --chain thought \
  --dataset "qwedsacf/competition_math" \
  --filter "level=Level 1,Level 2" \
  --filter "type=Algebra,Intermediate Algebra,Precalculus" \
  --file_suffix "medium" \
  --limit 1000

Generate CoD:

python data_generation/launch_generation.py \
  --chain draft \
  --dataset "qwedsacf/competition_math" \
  --filter "level=Level 1,Level 2" \
  --filter "type=Algebra,Intermediate Algebra,Precalculus" \
  --file_suffix "medium" \
  --limit 1000

3. Hard Scenario (Competition Math)

Filter for Level 4, 5.

Generate CoT:

python data_generation/launch_generation.py \
  --chain thought \
  --dataset "qwedsacf/competition_math" \
  --filter "level=Level 3,Level 4" \
  --filter "type=Algebra,Intermediate Algebra,Precalculus" \
  --file_suffix "hard" \
  --limit 1000

Generate CoD:

python data_generation/launch_generation.py \
  --chain draft \
  --dataset "qwedsacf/competition_math" \
  --filter "level=Level 3,Level 4" \
  --filter "type=Algebra,Intermediate Algebra,Precalculus" \
  --file_suffix "hard" \
  --limit 1000

Technical Details

Scripts Structure

launch_generation.py: Handles dataset loading, filtering, and submitting Batch API jobs. Creates a local state file in tmp/.
process_results.py: Checks batch status and downloads results. Saving logic handled here.

Output Files

CoT: data/cot_{dataset}_{suffix}.jsonl
CoD: data/cod_{dataset}_{suffix}.jsonl

Data Analysis

We provide a script to verify generated data against the ground truth.

Usage:

python data_generation/analyze_data.py \
  --dataset "qwedsacf/competition_math" \
  --suffix "medium"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Chain of Thought (CoT) & Chain of Draft (CoD) Data Generation

Prerequisites

Running the Pipeline

1. Easy Scenario (GSM8K)

2. Medium Scenario (Competition Math)

3. Hard Scenario (Competition Math)

Technical Details

Scripts Structure

Output Files

Data Analysis

FilesExpand file tree

data_generation

Directory actions

More options

Directory actions

More options

Latest commit

History

data_generation

Folders and files

parent directory

README.md

Chain of Thought (CoT) & Chain of Draft (CoD) Data Generation

Prerequisites

Running the Pipeline

1. Easy Scenario (GSM8K)

2. Medium Scenario (Competition Math)

3. Hard Scenario (Competition Math)

Technical Details

Scripts Structure

Output Files

Data Analysis