TinyNLA — Natural Language Autoencoder

4,600xInitial round-trip error

57xAfter architecture fixes

0Architecture rewrites

≤7BTarget model size

The idea

When a language model reads a sentence, it builds up a rich internal state at every layer. At each token position, that state is a vector of thousands of numbers encoding everything the model has inferred so far — grammatical role, likely continuations, semantic context, positional information.

Most tools for understanding these states require you to work with the numbers directly, which is impractical for most purposes. TinyNLA takes a different approach, inspired by Anthropic's Natural Language Autoencoder work: train a system that can describe that internal state in plain English (the verbalizer), and then reconstruct the original numbers from just that description (the reconstructor). If the reconstruction is close to the original, the description captured the essential information.

The practical implication: if this works well, you could read what a model was "thinking" at any point, edit that description, and observe the effect on behavior — without ever touching the raw numbers.

The test: take a vector, convert it to English, convert the English back to a vector, and measure how close you got. Low error means the description preserved the essential structure. TinyNLA targets GPT-2 and Qwen2.5 models up to 7B parameters.

Try it — the round-trip in action Interactive

Choose a sentence below. The demo walks through what TinyNLA actually does at each step: read the model's internal state at the final token, translate it to a plain English description, then reconstruct the state from that description and compare what the model would predict in each case.

TinyNLA verbalizer / reconstructor — layer 1, GPT-2 Representative — based on real runs

Choose a sentence

01 Model reads the sentence The highlighted word is where we read the internal state

Internal state at this position: 3,072 numbers — this is what TinyNLA reads

Verbalizer — numbers to English

↓

02 TinyNLA translates the internal state to plain English

Verbalizer output

Select a sentence above.

Reconstructor — English back to numbers

↓

03 Compare original vs reconstructed predictions If the descriptions were good, these should look similar

Original — from the model's actual state

Round-trip quality

—

Select a sentence to see reconstruction quality

After round-trip — from the description alone

Verbalizations and predictions are representative examples derived from real GPT-2 layer-1 experiments. Reconstruction quality reflects actual measured round-trip error on these inputs. Numbers are not from a live model call.

How it works Built

TinyNLA is a two-component system, following the same AV/AR architecture as Anthropic's NLA paper:

Activation Verbalizer — a distilgpt2-based model fine-tuned to take a residual stream activation vector and produce a plain English description. The vector is injected as a single token embedding into a fixed prompt, then the model autoregressively generates the description.
Activation Reconstructor — takes the plain English description produced by the verbalizer and produces a reconstructed activation vector. Quality is measured by how close the reconstructed vector is to the original, after L2 normalisation — so low MSE means direction was preserved, which is what matters for downstream behavior.

Both vectors are L2-normalised before comparison, so the round-trip MSE directly measures direction agreement. If the description captured the key information, the reconstructor should land in roughly the same place in activation space — and the model's predictions from that reconstructed state should be similar to its original predictions.

Input

Prompt tokens→Source model (GPT-2 / Qwen2.5)

Extract

Residual stream per token per layer→L2 normalise

Inject vector as token embedding→Autoregress description (distilgpt2)

Encode description→Linear head → reconstructed vector

Measure

Round-trip MSEPerplexity shiftCosine similarity

Results Layer 1 only

Three compounding architecture failures kept early results uninterpretable. Fixing them dropped the perplexity shift from 4,600x to 57x on layer 1 of GPT-2. A shift of 57x means the reconstructed state still degrades model performance substantially — but it is far from random noise, and the direction of the encoding is clearly being recovered.

Version	Token processing	Encoder	Perplexity shift (layer 1)
v1	Averaged across tokens — loses position	DistilBERT (wrong architecture type)	~4,600x
v2	Averaged across tokens — loses position	DistilBERT (wrong architecture type)	~2,000x
v3	Per-token — preserves position	distilgpt2 (causally compatible)	57x

Round-trip perplexity shift — lower is better

v1 baseline

~4,600x

v3 layer 1

57x

These numbers are from layer 1 of GPT-2 only. Whether similar improvement holds across all layers is an open question — the full sweep is listed in goals.

Finding

Using a plain English description to condition the reconstruction outperforms hand-crafted feature labels. When the verbalizer is prompted with context that captures what the layer is doing — rather than a categorical tag like "determiner" or "function word" — both reconstruction accuracy and downstream perplexity shift improve on layer 1.

This aligns with the finding from Anthropic's NLA paper on larger models. For models in the ≤7B range, the direction is consistent with the larger-scale results, though the magnitude of improvement is harder to pin down without the full layer sweep.

Why this matters for debugging: if a model can reliably describe its own internal state in plain English, a practitioner can read those descriptions, identify what changed at the point a hallucination occurred, and intervene without directly manipulating activation space. That's the target. Current results show the direction is viable at one layer on one model.

Goals

Full layer sweep — run the AV/AR round-trip across all layers of GPT-2 and Qwen2.5 to find where description quality is highest and where it degrades.
Unfreeze reconstructor layers — test whether additional trainable capacity in the AR improves direction recovery beyond what the v3 architecture fix produced.
CKA metrics — add Centered Kernel Alignment alongside MSE for a more principled measure of representational alignment between original and reconstructed states.
Connect to the eval platform — use verbalizer output as explanations for failures caught by the behavioral probe suites, closing the loop between behavioral detection and mechanistic explanation.