Skip to content

sanderblue/ai-mechanistic-interpretability

Repository files navigation

interp — a small mechanistic-interpretability lab

ci python license

Two foundational techniques for looking inside a transformer — the logit lens and causal tracing (activation patching) — built on one model-adapter interface so the same experiment runs on GPT-2 (via TransformerLens) and Qwen3-0.6B (via nnsight), fast enough to iterate on a laptop.

GPT-2: P(" oxygen") crystallises in the upper-mid layers. Qwen3: the France→Paris fact lives on the subject token, then hands off to the final token.

All four figures with interpretation: docs/results.md · the techniques and backend engineering notes: docs/methodology.md

Contents

What it does

Technique Question it answers Output
Logit lens At each layer, what token is the model "currently betting on"? a probability-vs-depth curve
Causal tracing Where in the network does a specific fact live? a [layer × token] recovery heatmap

Both run on small models (gpt2, qwen3-0.6b), are driven by YAML configs, and write a figure plus a reproducible JSON result per run.

This is the "learn the method cheaply" stage of a larger project on confabulation — when a model fabricates a fact it was never given, what differs internally between a fact it has and one it lacks? The repo owns the tools on small models first, built to scale to the larger one later (see Roadmap).

Install

Requires Python ≥ 3.10. Backends are optional extras — install only what you need.

python -m venv .venv && source .venv/bin/activate
pip install -e ".[tl,nnsight,dev]"     # or just ".[tl]" / ".[nnsight]"

Runs on CPU, Apple-silicon MPS, or CUDA — the device is auto-detected (cuda → mps → cpu).

Usage

A run is fully described by its config file:

python -m interp.run <experiment> --config <config.yaml> [--model ...] [--device ...] [--tag ...]
make e01   # logit lens on GPT-2          (downloads ~500MB the first time)
make e02   # causal tracing on GPT-2
python -m interp.run logit_lens   --config configs/logit_lens_qwen3.yaml   # same, on Qwen3
python -m interp.run causal_tracing --config configs/causal_tracing_qwen3.yaml

Outputs land in outputs/<experiment>/<tag>/ as result.json, the resolved config.yaml, and a .png figure — with captured provenance (device, dtype, seed, library versions, git SHA). You can also call the library directly:

from interp import load_model, logit_lens

model = load_model("qwen3-0.6b")          # or "gpt2"
result = logit_lens(model, "Water is made of hydrogen and", " oxygen")
print(result.crossover_layer)             # layer where " oxygen" becomes top-1

Architecture

Three layers, each independent of the others:

 experiments/      logit_lens · causal_tracing        (registered plug-ins)
       │ uses
 interp core       lenses · patching · metrics · viz  (backend-agnostic logic)
       │ via
 ModelAdapter      run_with_cache · forward · unembed (one interface)
       ├── TransformerLensAdapter   →  GPT-2
       └── NNsightAdapter           →  Qwen3 (and, later, larger models)

The adapter is the load-bearing idea: experiment code speaks in abstract sites ("the residual stream after layer 7") and never touches a backend- or architecture-specific hook name. An integration test loads GPT-2 under both backends and checks they produce the same next-token distribution (KL ≈ 0), so results don't depend on which backend produced them. The fiddly parts of making that true on each stack are written up in docs/methodology.md.

Extending it

The common cases are one-file changes:

To add… Do this
a new experiment drop a @register_experiment("name") class in experiments/, add a config
a new model add one line to MODEL_REGISTRY in interp/models/__init__.py
a new architecture add a Layout in interp/models/layouts.py
a new hook site add a Site value + its mapping in each adapter

Testing

make test              # fast, hermetic unit tests (no model downloads) — this is CI
make test-integration  # model-backed tests on both backends (downloads weights)
make lint              # ruff check + format

CI runs only the unit tests and lint, with neither backend installed — the green badge reflects code correctness, not network luck.

Roadmap

Scoped to the two techniques and the core they sit on. Built so the larger arc slots in without changing that core:

  • Known-vs-absent trajectories — does a fact the model has sharpen gradually, while a fabricated one only commits late?
  • An abstention direction — fit a probe separating "I never mentioned that" from confabulation, then steer it.
  • Scale up — the nnsight adapter is already the API a larger model would use.

Notes & limitations

  • Prompts are chosen to be ones the model actually gets right; GPT-2-small's modest confidence on some facts is shown, not hidden.
  • Causal tracing uses interchange (resample) corruption, not ROME's Gaussian embedding noise — it needs only the patch primitive, so it's robust across both backends — why. (Noise is supported on TransformerLens.)
  • Some MPS kernels aren't bit-deterministic, so seeds fix sampling/corruption but not every float on Apple silicon.

References

  • nostalgebraist (2020), interpreting GPT: the logit lens.
  • Meng, Bau, Andonian, Belinkov (2022), Locating and Editing Factual Associations in GPT (ROME).
  • TransformerLens · nnsight

License

MIT — see LICENSE.

About

A small, extensible mechanistic-interpretability lab — logit lens & activation patching on GPT-2 and Qwen3 behind a unified backend adapter. Config-driven, tested, laptop-friendly.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors