"Can a Large Language Model 'ignore' a corrupted memory, or will it hallucinate into madness?"
Current research on LLM robustness focuses heavily on input prompts (e.g., prompt injection) or weight perturbation. However, little is known about the Internal Robustness of the model's transient memoryβthe KV Cache.
This project investigates the resilience of LLMs (specifically Qwen2.5/Llama-3) against dynamic KV Cache corruption during decoding. By injecting noise, shuffling, or irrelevant context ("distractors") into the KV Cache at inference time, we aim to measure the "Soft-Masking" capability of the Attention mechanism.
Does the model collapse immediately, or can it filter out the "noise" and focus on the "needle" of its own reasoning?
- The "Soft-Masking" Hypothesis: We hypothesize that well-trained LLMs possess an inherent ability to suppress irrelevant or erroneous activation patterns in their KV Cache, similar to how they handle noisy input prompts.
- Hardware Fault Tolerance: In scenarios like edge computing or cosmic ray bit-flips, can a model survive partial memory corruption without crashing?
- Defense Against Attacks: Understanding how models react to internal noise could lead to new defense mechanisms against prompt injection or activation steering attacks.
- Interpretability: By identifying which layers are most sensitive to KV corruption, we can better understand the semantic vs. syntactic roles of different transformer blocks.
We implement a Runtime KV Cache Interceptor using PyTorch hooks. This allows us to modify the Key and Value matrices before the Attention computation step, without altering the model weights.
We support three modes of KV Corruption:
- Gaussian Noise (): Replaces a percentage of KV pairs with random noise matching the layer's statistics. Tests numerical stability.
- Zero Ablation (Sparsification): Zeroes out a percentage of KV pairs. Tests redundancy.
- Distractor Injection (The "Hallucination" Test): Replaces the current context's KV Cache with pre-computed KV pairs from a completely unrelated prompt (e.g., mixing a Math problem with a News article).
- Task: GSM8K (Grade School Math) - Requires multi-step reasoning, making it highly sensitive to context loss.
- Metric: Exact Match Accuracy & Semantic Drift (Embedding similarity between clean and corrupted outputs).
.
βββ experiments/ # Scripts to run ablation studies
βββ src/
β βββ corruptor.py # Core logic: PyTorch hooks for KV manipulation
β βββ evaluator.py # GSM8K evaluation loop
β βββ utils.py # Data loading and helper functions
βββ results/ # JSON logs and plots (auto-generated)
βββ notebook/ # Jupyter notebooks for visualization
βββ requirements.txt
βββ README.md
git clone https://github.com/yourusername/internal-robustness.git
cd internal-robustness
pip install -r requirements.txt
To run a sweep of corruption rates on Qwen2.5-1.5B:
python experiments/run_sweep.py \
--model_name "Qwen/Qwen2.5-1.5B-Instruct" \
--noise_type "distractor" \
--corruption_rates 0.1 0.3 0.5 0.7 0.9 \
--output_dir "./results/qwen_distractor"
python scripts/plot_results.py --result_dir "./results/qwen_distractor"
All experiments use GSM8K (n=100 unless noted) on Qwen2.5-7B-Instruct and Qwen2.5-1.5B-Instruct, corrupting the KV cache across all 28 layers during decoding.
| Corruption Rate | 7B Gaussian | 7B Zeros | 7B Distractor | 1.5B Gaussian | 1.5B Zeros | 1.5B Distractor |
|---|---|---|---|---|---|---|
| 0.00 (baseline) | 0.87 | 0.87 | 0.87 | 0.66 | 0.66 | 0.66 |
| 0.02 | 0.02 | 0.75 | 0.86 | 0.00 | 0.39 | 0.55 |
| 0.05 | 0.00 | 0.71 | 0.83 | 0.00 | 0.19 | 0.56 |
| 0.08 | 0.03 | 0.63 | 0.81 | 0.01 | 0.13 | 0.40 |
| 0.10 | 0.02 | 0.62 | 0.81 | 0.00 | 0.15 | 0.41 |
| 0.15 | 0.02 | 0.50 | 0.72 | 0.06 | 0.06 | 0.27 |
| 0.20 | 0.03 | 0.36 | 0.63 | 0.01 | 0.05 | 0.27 |
| 0.25 | 0.01 | 0.26 | 0.53 | 0.01 | 0.00 | 0.12 |
| 0.30 | 0.02 | 0.05 | 0.50 | 0.01 | 0.00 | 0.07 |
| 0.40 | 0.04 | 0.01 | 0.34 | 0.00 | 0.01 | 0.05 |
| 0.50 | 0.01 | 0.00 | 0.10 | 0.01 | 0.00 | 0.02 |
Corruption applied only to specific layer bands (7 layers each), tested at rates around the cliff (0.01, 0.02, 0.04).
| Layer Band | Rate 0.01 | Rate 0.02 | Rate 0.04 | Role |
|---|---|---|---|---|
| L0-6 (early) | 0.71 | 0.64 | 0.51 | Most robust β tolerates up to 4% with >50% acc |
| L7-13 (early-mid) | 0.79 | 0.51 | 0.20 | Moderate β degrades gradually |
| L14-20 (late-mid) | 0.79 | 0.86 | 0.81 | Virtually immune β accuracy preserved even at 4% |
| L21-27 (late) | 0.43 | 0.12 | 0.01 | Most fragile β collapses at 2% |
| L0-13 (first half) | 0.44 | 0.10 | 0.04 | Collapses when both early bands corrupted together |
| L14-27 (second half) | 0.34 | 0.04 | 0.01 | Collapses β dominated by fragile L21-27 |
| Corruption Rate | Accuracy |
|---|---|
| 0.00 | 0.90 |
| 0.05 | 0.10 |
| 0.10 | 0.00 |
| 0.15+ | 0.00 |
Answering the hypotheses:
- The "Cliff": The cliff depends dramatically on noise type. Gaussian noise is catastrophic at just 2% corruption. Zeros degrade gradually (7B: 87% β 50% at 15%, cliff at ~30%). Distractor noise is the most tolerable β the 7B model retains 50% accuracy even at 30% corruption. The original hypothesis of a ~30% cliff holds only for structured noise types, not random noise.
- Layer Sensitivity: The hypothesis that "middle layers are most sensitive" is partially wrong. In fact, late layers (L21-27) are the most fragile (collapsing at 2% corruption), while late-middle layers (L14-20) are virtually immune (maintaining 86% accuracy even at 2% corruption). Early layers (L0-6) show moderate robustness.
Interpretation:
- Noise type matters more than corruption rate. Gaussian noise (out-of-distribution activations) is instantly fatal, while distractor noise (real activations from unrelated text) is tolerable up to ~30%. This suggests the attention mechanism can soft-mask structured-but-irrelevant signals, but cannot handle statistically anomalous activations.
- Model scale provides no protection against Gaussian noise but helps significantly against structured noise. The 7B model retains 50% accuracy at 30% distractor corruption, while the 1.5B model drops to 7% at the same rate.
- Late layers (L21-27) are critical for reasoning. These layers likely encode task-specific computation (arithmetic, answer extraction). Corrupting them destroys output even at low rates, while corrupting L14-20 has almost no effect β suggesting those layers have high redundancy or perform less critical processing.
- Error compounding explains the Gaussian cliff. GSM8K requires chained multi-step reasoning. Gaussian noise produces out-of-distribution attention patterns that corrupt each subsequent decoding step, while structured noise (zeros, distractors) preserves enough signal for the model to self-correct.
Replacing just 2% of the KV cache with random numbers destroys the model completely (87% β 2% accuracy). This is like scrambling 2% of someone's neurons mid-thought β the brain can't recover.
Random noise produces activation patterns the model has never seen during training. The attention mechanism doesn't know how to ignore something that looks nothing like real data. Each corrupted step compounds into the next, and within a few tokens the output is pure hallucination.
Distractor noise (memories from an unrelated conversation) is far more survivable β the 7B model still gets 50% of math problems right with 30% of its memory replaced by text about something completely different.
Distractor KV pairs are still "in-distribution" β they look like real activations. The attention mechanism was trained to focus on relevant context and suppress irrelevant context. It's doing exactly what it learned to do: filtering signal from noise. This is the "Soft-Masking" hypothesis partially confirmed.
This is perhaps the most interesting finding:
- Layers 14-20 are almost irrelevant β you can corrupt them and accuracy barely changes. They seem to be doing something redundant or non-critical.
- Layers 21-27 are the "reasoning engine" β corrupting even 2% here kills performance. These final layers are where the model actually computes the answer.
- Early layers (0-6) are moderately robust β they encode basic language understanding that has some redundancy.
Think of it like a factory assembly line. Messing with the final quality-control station (late layers) ruins everything. Messing with a middle packaging step (L14-20) barely matters because other steps compensate. Messing with the raw material intake (early layers) causes moderate problems.
Both the 7B and 1.5B model collapse at the same 2% Gaussian cliff. Scale doesn't buy you resilience against out-of-distribution corruption. But the 7B model is more resilient against structured noise (zeros, distractors) β more parameters means more redundancy for in-distribution filtering.
- Hardware fault tolerance: Even tiny memory errors (bit flips, cosmic rays) could be catastrophic if they produce out-of-distribution activations. But if errors zero out values instead, models are surprisingly resilient.
- Security: Activation-steering attacks that inject "realistic-looking" corruptions are harder for the model to detect than random perturbations β but ironically, random perturbations are far more damaging.
- Interpretability: The late layers (21-27) are where reasoning actually happens in Qwen2.5-7B. Layers 14-20 appear to be a "buffer zone" with high redundancy β a potential target for future pruning or efficiency research.
We welcome contributions! Specifically, we are looking for:
- Support for more models (Llama-3, Mistral).
- New noise types (e.g., Adversarial Perturbation).
- Visualization tools for Attention Maps under corruption.
If you use this code or ideas in your research, please cite:
@misc{internal-robustness-2024,
author = {Your Name},
title = {Needle in the Noise: Probing LLM Internal Robustness},
year = {2024},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/yourusername/internal-robustness}}
}
Disclaimer: This project is for research purposes only. It involves "hacking" the internal states of LLMs and may produce unpredictable outputs.











