The idea
When a language model reads a sentence, it builds up a rich internal state at every layer. At each token position, that state is a vector of thousands of numbers encoding everything the model has inferred so far — grammatical role, likely continuations, semantic context, positional information.
Most tools for understanding these states require you to work with the numbers directly, which is impractical for most purposes. TinyNLA takes a different approach, inspired by Anthropic's Natural Language Autoencoder work: train a system that can describe that internal state in plain English (the verbalizer), and then reconstruct the original numbers from just that description (the reconstructor). If the reconstruction is close to the original, the description captured the essential information.
The practical implication: if this works well, you could read what a model was "thinking" at any point, edit that description, and observe the effect on behavior — without ever touching the raw numbers.
The test: take a vector, convert it to English, convert the English back to a vector, and measure how close you got. Low error means the description preserved the essential structure. TinyNLA targets GPT-2 and Qwen2.5 models up to 7B parameters.
Try it — the round-trip in action Interactive
Choose a sentence below. The demo walks through what TinyNLA actually does at each step: read the model's internal state at the final token, translate it to a plain English description, then reconstruct the state from that description and compare what the model would predict in each case.
How it works Built
TinyNLA is a two-component system, following the same AV/AR architecture as Anthropic's NLA paper:
- Activation Verbalizer — a distilgpt2-based model fine-tuned to take a residual stream activation vector and produce a plain English description. The vector is injected as a single token embedding into a fixed prompt, then the model autoregressively generates the description.
- Activation Reconstructor — takes the plain English description produced by the verbalizer and produces a reconstructed activation vector. Quality is measured by how close the reconstructed vector is to the original, after L2 normalisation — so low MSE means direction was preserved, which is what matters for downstream behavior.
Both vectors are L2-normalised before comparison, so the round-trip MSE directly measures direction agreement. If the description captured the key information, the reconstructor should land in roughly the same place in activation space — and the model's predictions from that reconstructed state should be similar to its original predictions.
Results Layer 1 only
Three compounding architecture failures kept early results uninterpretable. Fixing them dropped the perplexity shift from 4,600x to 57x on layer 1 of GPT-2. A shift of 57x means the reconstructed state still degrades model performance substantially — but it is far from random noise, and the direction of the encoding is clearly being recovered.
| Version | Token processing | Encoder | Perplexity shift (layer 1) |
|---|---|---|---|
| v1 | Averaged across tokens — loses position | DistilBERT (wrong architecture type) | ~4,600x |
| v2 | Averaged across tokens — loses position | DistilBERT (wrong architecture type) | ~2,000x |
| v3 | Per-token — preserves position | distilgpt2 (causally compatible) | 57x |
Round-trip perplexity shift — lower is better
These numbers are from layer 1 of GPT-2 only. Whether similar improvement holds across all layers is an open question — the full sweep is listed in goals.
Finding
Using a plain English description to condition the reconstruction outperforms hand-crafted feature labels. When the verbalizer is prompted with context that captures what the layer is doing — rather than a categorical tag like "determiner" or "function word" — both reconstruction accuracy and downstream perplexity shift improve on layer 1.
This aligns with the finding from Anthropic's NLA paper on larger models. For models in the ≤7B range, the direction is consistent with the larger-scale results, though the magnitude of improvement is harder to pin down without the full layer sweep.
Why this matters for debugging: if a model can reliably describe its own internal state in plain English, a practitioner can read those descriptions, identify what changed at the point a hallucination occurred, and intervene without directly manipulating activation space. That's the target. Current results show the direction is viable at one layer on one model.
Goals
- Full layer sweep — run the AV/AR round-trip across all layers of GPT-2 and Qwen2.5 to find where description quality is highest and where it degrades.
- Unfreeze reconstructor layers — test whether additional trainable capacity in the AR improves direction recovery beyond what the v3 architecture fix produced.
- CKA metrics — add Centered Kernel Alignment alongside MSE for a more principled measure of representational alignment between original and reconstructed states.
- Connect to the eval platform — use verbalizer output as explanations for failures caught by the behavioral probe suites, closing the loop between behavioral detection and mechanistic explanation.