A research framework designed to investigate how Large Language Models (LLMs) handle increasing amounts of irrelevant context, specifically distinguishing between the effects of raw context length and semantic interference.
"Context Collapse" occurs when a model's performance on a specific task degrades as the input prompt grows. This project aims to answer: Does the model fail because the prompt is long, or because the irrelevant context contains "distractors" that look like the target information?
- H1: Raw context length (random unrelated text) has a minimal impact on accuracy for modern LLMs.
- H2: Semantic similarity between the task and the irrelevant context drives significant performance degradation (distractor adoption).
- H3: Retrieval tasks are more susceptible to context collapse than reasoning or arithmetic tasks.
The framework evaluates models across a grid of conditions:
- Arithmetic: Multi-step math problems.
- Retrieval: Extracting specific facts from a cluttered context.
- Logic: Reasoning through deductive or inductive problems.
- Instruction Following: Adhering to strict formatting and behavioral constraints.
- Control: 0 tokens of noise.
- Random Unrelated: Noise blocks of 250, 1000, and 4000 words using unrelated text.
- Similar Irrelevant: Noise blocks of 250, 1000, and 4000 words using semantically similar but incorrect distractors.
- Correctness: Automated scoring of answer accuracy.
- Compliance: Binary check for adherence to output format (e.g., "answer in one word").
- Latency: Wall-clock time per trial.
- Distractor Adoption: Detects if the model's incorrect answer was sourced from the noise block.
- Clone the repository.
- Create and activate a virtual environment.
- Install dependencies:
pip install -r requirements.txt
- Configure your
.envfile withOPENROUTER_API_KEY.
To run a full experimental suite:
python main.pyThis generates a unique Run ID (UUID) and saves raw results to data/runs/<uuid>.jsonl.
To generate a full analysis report (plots, CSVs, and summary articles) for the most recent run:
python -m src.analysis.report_builder --run latestOutputs are saved to data/analysis/<uuid>/.
src/: Core logic for task generation, model interfacing, and evaluation.src/analysis/: Statistical tools and report generators.data/:tasks/: Generated task definitions.runs/: Raw JSONL experimental data.analysis/: Processed metrics and visualization artifacts.
four_phases/: Research documentation (Questions, Design, Specs, Architecture).
MIT