transformerlens

Here are 25 public repositories matching this topic...

yash-srivastava19 / arrakis

Arrakis is a library to conduct, track and visualize mechanistic interpretability experiments.

python pypi transformer garcon interpretability explainable-ai mechanistic-interpretability anthropic transformerlens research-tooling

Updated Apr 14, 2026
Jupyter Notebook

FarnoushRJ / RelP

Star

[NeurIPS 2025 MechInterp Workshop - Spotlight] Official implementation of the paper "RelP: Faithful and Efficient Circuit Discovery in Language Models via Relevance Patching"

language-model circuit-analysis interpretability explainable-ai interpretable-machine-learning explainability llms mechanistic-interpretability transformerlens neurips-2025

Updated Nov 3, 2025
Python

krnel-ai / krnel-graph

Star

Lightweight representation engineering dataflow operations for agent developers.

transformers pytorch dataflow parquet huggingface huggingface-transformers duckdb pylance mechanistic-interpretability lancedb transformerlens representation-engineering pragmatic-interpretability

Updated May 27, 2026
Python

09Catho / axon

Star

Real-time 3D visualisation of SAE feature activations inside GPT-2, token by token

python threejs machine-learning deep-learning websocket 3d-visualization sparse-autoencoder fastapi gpt2 mechanistic-interpretability transformerlens llm-interpretability

Updated May 19, 2026
JavaScript

stchakwdev / Pinocchio-Vector-Test

Star

Investigating whether language models encode anticipated social consequences in their activations. Uses a 2x2 factorial design crossing truth × social valence to show that models are more sensitive to expected approval/disapproval than to truth itself.

language-models ai-safety interpretability deception-detection mechanistic-interpretability transformerlens

Updated Dec 18, 2025
Python

zilaeric / othello-gpt-probing

Star

Training and exploration of linear probes into Othello-GPT by Li et al. (2022)

probe othello gpt interpretability explainability transformerlens

Updated Jun 29, 2023
Jupyter Notebook

designer-coderajay / glassbox-mech

Star

Open-source EU AI Act Annex IV documentation toolkit. Mechanistic interpretability + circuit discovery for transformers. One function call generates a structured, hash-chained evidence package.

Updated Jun 15, 2026
Python

ashioyajotham / exploring_saes

Star

Implementation and analysis of Sparse Autoencoders for neural network interpretability research. Features interactive visualization dashboard and W&B integration.

sparse-autoencoders interpretability activation-functions neuron-activity wandb transformerlens mech-interp

Updated Nov 21, 2025
Python

lciric / does-quantization-kill-interpretability

Star

Does Quantization Kill Interpretability? Scaling study across 5 models (124M-2.8B): RTN destroys induction heads in small models, GPTQ preserves them at all scales.

pythia quantization ai-safety sparse-autoencoder mechanistic-interpretability gptq transformerlens transformer-circuits induction-heads scaling-study

Updated Mar 11, 2026
Python

mduffster / epistemic_status

Star

Evaluating how a model 'knowing what it knows' changes from base to instruct

pytorch llm mechanistic-interpretability transformerlens

Updated Jan 21, 2026
Python

mduffster / self-referent-test

Star

Testing role-based pathways on small LLMs

research transformers pytorch ai-safety interpretability attention-mechanisms ai-alignment llm mechanistic-interpretability transformerlens

Updated Dec 11, 2025
Python

RithvikReddy0-0 / KAMUI

Star

Knowledge Activation Mapping & Understanding Interface (KAMUI) — A Transformer Interpretability Framework Built From Scratch in PyTorch.

nlp deep-learning transformers pytorch artificial-intelligence gpt llm mechanistic-interpretability transformerlens

Updated Jun 8, 2026
Python

sanderblue / ai-mechanistic-interpretability

Star

A small, extensible mechanistic-interpretability lab — logit lens & activation patching on GPT-2 and Qwen3 behind a unified backend adapter. Config-driven, tested, laptop-friendly.

pytorch interpretability gpt2 mechanistic-interpretability transformerlens qwen causal-tracing activation-patching nnsight qwen3 logit-lens

Updated Jun 19, 2026
Python

DipinDevSaji / mechinterp-probe

Star

Mechanistic interpretability toolkit for comparing transformer activations, token shifts, and activation patching behaviour.

pytorch ai-safety gpt-2 streamlit mechanistic-interpretability transformerlens activation-patching llm-interpretability

Updated May 23, 2026
Python

azrabano23 / steering-audit

Star

When does activation steering actually work? A reliability audit of steering vectors on GPT-2-small.

pytorch ai-safety interpretability ai-alignment gpt-2 llm mechanistic-interpretability transformerlens representation-engineering activation-steering

Updated Jun 8, 2026
Python

alexjackson1 / tx

Star

A Flax-based library for examining transformers, based on TransformerLens.

deep-learning transformers flax jax transformerlens

Updated Feb 11, 2024
Python

ashioyajotham / greater-than-circuit

Star

Reverse engineering the circuit responsible for the "greater than" capability in a language model

attention-mechanism ablation-studies mechanistic-interpretability transformerlens activation-patterns gpt-2-small

Updated May 7, 2026
HTML

msmichellesamson / residual-stream-sycophancy

Star

Probing where in Pythia's residual stream the decision to be sycophantic is already 'decided', using linear classifiers on per-layer activations against a small labeled sycophancy dataset.

python scikit-learn pytorch matplotlib transformerlens interpretability-experiments

Updated May 4, 2026
Python

ydvlalit03 / Transformer--From-Scratch

Star

Hands-on exploration of GPT-2 and transformer internals for text generation using TransformerLens — attention, mechanistic interpretability and sampling, explained step by step.

python nlp deep-learning transformers interpretability gpt-2 transformerlens

Updated Jun 4, 2026
Python

aragorn-w / logit-lens

Star

Logit Lens terminal visualizer (nostalgebraist, 2020) — decodes GPT-2's intermediate layer predictions using the unembedding matrix, built with TransformerLens and Rich.

interpretability gpt-2 llm mechanistic-interpretability transformerlens logit-lens

Updated Mar 31, 2026
Python

Improve this page

Add a description, image, and links to the transformerlens topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the transformerlens topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

transformerlens

Here are 25 public repositories matching this topic...

yash-srivastava19 / arrakis

FarnoushRJ / RelP

krnel-ai / krnel-graph

09Catho / axon

stchakwdev / Pinocchio-Vector-Test

zilaeric / othello-gpt-probing

designer-coderajay / glassbox-mech

ashioyajotham / exploring_saes

lciric / does-quantization-kill-interpretability

mduffster / epistemic_status

mduffster / self-referent-test

RithvikReddy0-0 / KAMUI

sanderblue / ai-mechanistic-interpretability

DipinDevSaji / mechinterp-probe

azrabano23 / steering-audit

alexjackson1 / tx

ashioyajotham / greater-than-circuit

msmichellesamson / residual-stream-sycophancy

ydvlalit03 / Transformer--From-Scratch

aragorn-w / logit-lens

Improve this page

Add this topic to your repo