🧠 Needle in the Noise: Probing LLM Internal Robustness via KV Cache Corruption

"Can a Large Language Model 'ignore' a corrupted memory, or will it hallucinate into madness?"

📖 Abstract

Current research on LLM robustness focuses heavily on input prompts (e.g., prompt injection) or weight perturbation. However, little is known about the Internal Robustness of the model's transient memory—the KV Cache.

This project investigates the resilience of LLMs (specifically Qwen2.5/Llama-3) against dynamic KV Cache corruption during decoding. By injecting noise, shuffling, or irrelevant context ("distractors") into the KV Cache at inference time, we aim to measure the "Soft-Masking" capability of the Attention mechanism.

Does the model collapse immediately, or can it filter out the "noise" and focus on the "needle" of its own reasoning?

🚀 Motivation

The "Soft-Masking" Hypothesis: We hypothesize that well-trained LLMs possess an inherent ability to suppress irrelevant or erroneous activation patterns in their KV Cache, similar to how they handle noisy input prompts.
Hardware Fault Tolerance: In scenarios like edge computing or cosmic ray bit-flips, can a model survive partial memory corruption without crashing?
Defense Against Attacks: Understanding how models react to internal noise could lead to new defense mechanisms against prompt injection or activation steering attacks.
Interpretability: By identifying which layers are most sensitive to KV corruption, we can better understand the semantic vs. syntactic roles of different transformer blocks.

🛠 Methodology

We implement a Runtime KV Cache Interceptor using PyTorch hooks. This allows us to modify the Key and Value matrices before the Attention computation step, without altering the model weights.

The "Attacker" Types

We support three modes of KV Corruption:

Gaussian Noise (): Replaces a percentage of KV pairs with random noise matching the layer's statistics. Tests numerical stability.
Zero Ablation (Sparsification): Zeroes out a percentage of KV pairs. Tests redundancy.
Distractor Injection (The "Hallucination" Test): Replaces the current context's KV Cache with pre-computed KV pairs from a completely unrelated prompt (e.g., mixing a Math problem with a News article).

Evaluation

Task: GSM8K (Grade School Math) - Requires multi-step reasoning, making it highly sensitive to context loss.
Metric: Exact Match Accuracy & Semantic Drift (Embedding similarity between clean and corrupted outputs).

📂 Repository Structure

.
├── experiments/          # Scripts to run ablation studies
├── src/
│   ├── corruptor.py      # Core logic: PyTorch hooks for KV manipulation
│   ├── evaluator.py      # GSM8K evaluation loop
│   └── utils.py          # Data loading and helper functions
├── results/              # JSON logs and plots (auto-generated)
├── notebook/             # Jupyter notebooks for visualization
├── requirements.txt
└── README.md

💻 Usage

Installation

git clone https://github.com/yourusername/internal-robustness.git
cd internal-robustness
pip install -r requirements.txt

Running the Experiment

To run a sweep of corruption rates on Qwen2.5-1.5B:

python experiments/run_sweep.py \
    --model_name "Qwen/Qwen2.5-1.5B-Instruct" \
    --noise_type "distractor" \
    --corruption_rates 0.1 0.3 0.5 0.7 0.9 \
    --output_dir "./results/qwen_distractor"

Visualization

python scripts/plot_results.py --result_dir "./results/qwen_distractor"

📊 Results

All experiments use GSM8K (n=100 unless noted) on Qwen2.5-7B-Instruct and Qwen2.5-1.5B-Instruct, corrupting the KV cache across all 28 layers during decoding.

Tier 1 & 2: Noise-Type Comparison (All Layers)

Corruption Rate	7B Gaussian	7B Zeros	7B Distractor	1.5B Gaussian	1.5B Zeros	1.5B Distractor
0.00 (baseline)	0.87	0.87	0.87	0.66	0.66	0.66
0.02	0.02	0.75	0.86	0.00	0.39	0.55
0.05	0.00	0.71	0.83	0.00	0.19	0.56
0.08	0.03	0.63	0.81	0.01	0.13	0.40
0.10	0.02	0.62	0.81	0.00	0.15	0.41
0.15	0.02	0.50	0.72	0.06	0.06	0.27
0.20	0.03	0.36	0.63	0.01	0.05	0.27
0.25	0.01	0.26	0.53	0.01	0.00	0.12
0.30	0.02	0.05	0.50	0.01	0.00	0.07
0.40	0.04	0.01	0.34	0.00	0.01	0.05
0.50	0.01	0.00	0.10	0.01	0.00	0.02

Tier 3: Layer Sensitivity (7B, Gaussian Noise)

Corruption applied only to specific layer bands (7 layers each), tested at rates around the cliff (0.01, 0.02, 0.04).

Layer Band	Rate 0.01	Rate 0.02	Rate 0.04	Role
L0-6 (early)	0.71	0.64	0.51	Most robust — tolerates up to 4% with >50% acc
L7-13 (early-mid)	0.79	0.51	0.20	Moderate — degrades gradually
L14-20 (late-mid)	0.79	0.86	0.81	Virtually immune — accuracy preserved even at 4%
L21-27 (late)	0.43	0.12	0.01	Most fragile — collapses at 2%
L0-13 (first half)	0.44	0.10	0.04	Collapses when both early bands corrupted together
L14-27 (second half)	0.34	0.04	0.01	Collapses — dominated by fragile L21-27

Tier 4: Qualitative Examples (7B, Gaussian, n=10)

Corruption Rate	Accuracy
0.00	0.90
0.05	0.10
0.10	0.00
0.15+	0.00

Key Findings

Answering the hypotheses:

The "Cliff": The cliff depends dramatically on noise type. Gaussian noise is catastrophic at just 2% corruption. Zeros degrade gradually (7B: 87% → 50% at 15%, cliff at ~30%). Distractor noise is the most tolerable — the 7B model retains 50% accuracy even at 30% corruption. The original hypothesis of a ~30% cliff holds only for structured noise types, not random noise.
Layer Sensitivity: The hypothesis that "middle layers are most sensitive" is partially wrong. In fact, late layers (L21-27) are the most fragile (collapsing at 2% corruption), while late-middle layers (L14-20) are virtually immune (maintaining 86% accuracy even at 2% corruption). Early layers (L0-6) show moderate robustness.

Interpretation:

Noise type matters more than corruption rate. Gaussian noise (out-of-distribution activations) is instantly fatal, while distractor noise (real activations from unrelated text) is tolerable up to ~30%. This suggests the attention mechanism can soft-mask structured-but-irrelevant signals, but cannot handle statistically anomalous activations.
Model scale provides no protection against Gaussian noise but helps significantly against structured noise. The 7B model retains 50% accuracy at 30% distractor corruption, while the 1.5B model drops to 7% at the same rate.
Late layers (L21-27) are critical for reasoning. These layers likely encode task-specific computation (arithmetic, answer extraction). Corrupting them destroys output even at low rates, while corrupting L14-20 has almost no effect — suggesting those layers have high redundancy or perform less critical processing.
Error compounding explains the Gaussian cliff. GSM8K requires chained multi-step reasoning. Gaussian noise produces out-of-distribution attention patterns that corrupt each subsequent decoding step, while structured noise (zeros, distractors) preserves enough signal for the model to self-correct.

💡 What Does This Mean?

Random noise is instantly fatal

Replacing just 2% of the KV cache with random numbers destroys the model completely (87% → 2% accuracy). This is like scrambling 2% of someone's neurons mid-thought — the brain can't recover.

Random noise produces activation patterns the model has never seen during training. The attention mechanism doesn't know how to ignore something that looks nothing like real data. Each corrupted step compounds into the next, and within a few tokens the output is pure hallucination.

But the model CAN ignore "real but irrelevant" information

Distractor noise (memories from an unrelated conversation) is far more survivable — the 7B model still gets 50% of math problems right with 30% of its memory replaced by text about something completely different.

Distractor KV pairs are still "in-distribution" — they look like real activations. The attention mechanism was trained to focus on relevant context and suppress irrelevant context. It's doing exactly what it learned to do: filtering signal from noise. This is the "Soft-Masking" hypothesis partially confirmed.

Not all layers matter equally

This is perhaps the most interesting finding:

Layers 14-20 are almost irrelevant — you can corrupt them and accuracy barely changes. They seem to be doing something redundant or non-critical.
Layers 21-27 are the "reasoning engine" — corrupting even 2% here kills performance. These final layers are where the model actually computes the answer.
Early layers (0-6) are moderately robust — they encode basic language understanding that has some redundancy.

Think of it like a factory assembly line. Messing with the final quality-control station (late layers) ruins everything. Messing with a middle packaging step (L14-20) barely matters because other steps compensate. Messing with the raw material intake (early layers) causes moderate problems.

Bigger models aren't tougher against random noise

Both the 7B and 1.5B model collapse at the same 2% Gaussian cliff. Scale doesn't buy you resilience against out-of-distribution corruption. But the 7B model is more resilient against structured noise (zeros, distractors) — more parameters means more redundancy for in-distribution filtering.

Implications

Hardware fault tolerance: Even tiny memory errors (bit flips, cosmic rays) could be catastrophic if they produce out-of-distribution activations. But if errors zero out values instead, models are surprisingly resilient.
Security: Activation-steering attacks that inject "realistic-looking" corruptions are harder for the model to detect than random perturbations — but ironically, random perturbations are far more damaging.
Interpretability: The late layers (21-27) are where reasoning actually happens in Qwen2.5-7B. Layers 14-20 appear to be a "buffer zone" with high redundancy — a potential target for future pruning or efficiency research.

🤝 Contributing

We welcome contributions! Specifically, we are looking for:

Support for more models (Llama-3, Mistral).
New noise types (e.g., Adversarial Perturbation).
Visualization tools for Attention Maps under corruption.

📜 Citation

If you use this code or ideas in your research, please cite:

@misc{internal-robustness-2024,
  author = {Your Name},
  title = {Needle in the Noise: Probing LLM Internal Robustness},
  year = {2024},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/yourusername/internal-robustness}}
}

Disclaimer: This project is for research purposes only. It involves "hacking" the internal states of LLMs and may produce unpredictable outputs.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
results		results
.gitattributes		.gitattributes
.gitignore		.gitignore
PLAN.md		PLAN.md
README.md		README.md
analyze.py		analyze.py
experiment.py		experiment.py
pixi.lock		pixi.lock
pixi.toml		pixi.toml
run_all.py		run_all.py
run_log.json		run_log.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧠 Needle in the Noise: Probing LLM Internal Robustness via KV Cache Corruption

📖 Abstract

🚀 Motivation

🛠 Methodology

The "Attacker" Types

Evaluation

📂 Repository Structure

💻 Usage

Installation

Running the Experiment

Visualization

📊 Results

Tier 1 & 2: Noise-Type Comparison (All Layers)

Tier 3: Layer Sensitivity (7B, Gaussian Noise)

Tier 4: Qualitative Examples (7B, Gaussian, n=10)

Key Findings

💡 What Does This Mean?

Random noise is instantly fatal

But the model CAN ignore "real but irrelevant" information

Not all layers matter equally

Bigger models aren't tougher against random noise

Implications

🤝 Contributing

📜 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🧠 Needle in the Noise: Probing LLM Internal Robustness via KV Cache Corruption

📖 Abstract

🚀 Motivation

🛠 Methodology

The "Attacker" Types

Evaluation

📂 Repository Structure

💻 Usage

Installation

Running the Experiment

Visualization

📊 Results

Tier 1 & 2: Noise-Type Comparison (All Layers)

Tier 3: Layer Sensitivity (7B, Gaussian Noise)

Tier 4: Qualitative Examples (7B, Gaussian, n=10)

Key Findings

💡 What Does This Mean?

Random noise is instantly fatal

But the model CAN ignore "real but irrelevant" information

Not all layers matter equally

Bigger models aren't tougher against random noise

Implications

🤝 Contributing

📜 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages