Arrakis is a library to conduct, track and visualize mechanistic interpretability experiments.
-
Updated
Apr 14, 2026 - Jupyter Notebook
Arrakis is a library to conduct, track and visualize mechanistic interpretability experiments.
[NeurIPS 2025 MechInterp Workshop - Spotlight] Official implementation of the paper "RelP: Faithful and Efficient Circuit Discovery in Language Models via Relevance Patching"
Lightweight representation engineering dataflow operations for agent developers.
Real-time 3D visualisation of SAE feature activations inside GPT-2, token by token
Investigating whether language models encode anticipated social consequences in their activations. Uses a 2x2 factorial design crossing truth × social valence to show that models are more sensitive to expected approval/disapproval than to truth itself.
Training and exploration of linear probes into Othello-GPT by Li et al. (2022)
Open-source EU AI Act Annex IV documentation toolkit. Mechanistic interpretability + circuit discovery for transformers. One function call generates a structured, hash-chained evidence package.
Implementation and analysis of Sparse Autoencoders for neural network interpretability research. Features interactive visualization dashboard and W&B integration.
Does Quantization Kill Interpretability? Scaling study across 5 models (124M-2.8B): RTN destroys induction heads in small models, GPTQ preserves them at all scales.
Evaluating how a model 'knowing what it knows' changes from base to instruct
Testing role-based pathways on small LLMs
Knowledge Activation Mapping & Understanding Interface (KAMUI) — A Transformer Interpretability Framework Built From Scratch in PyTorch.
A small, extensible mechanistic-interpretability lab — logit lens & activation patching on GPT-2 and Qwen3 behind a unified backend adapter. Config-driven, tested, laptop-friendly.
Mechanistic interpretability toolkit for comparing transformer activations, token shifts, and activation patching behaviour.
When does activation steering actually work? A reliability audit of steering vectors on GPT-2-small.
A Flax-based library for examining transformers, based on TransformerLens.
Reverse engineering the circuit responsible for the "greater than" capability in a language model
Probing where in Pythia's residual stream the decision to be sycophantic is already 'decided', using linear classifiers on per-layer activations against a small labeled sycophancy dataset.
Hands-on exploration of GPT-2 and transformer internals for text generation using TransformerLens — attention, mechanistic interpretability and sampling, explained step by step.
Logit Lens terminal visualizer (nostalgebraist, 2020) — decodes GPT-2's intermediate layer predictions using the unembedding matrix, built with TransformerLens and Rich.
Add a description, image, and links to the transformerlens topic page so that developers can more easily learn about it.
To associate your repository with the transformerlens topic, visit your repo's landing page and select "manage topics."