← All work

TinyNLA — Natural Language
Autoencoder for SLMs

A language model's internal state at any point in processing is a long sequence of numbers. TinyNLA translates those numbers into plain English, then tries to reconstruct them from that description alone — testing how much of the model's "thinking" survives the round-trip through language.

PyTorchCUDAGPT-2 Qwen2.5distilgpt2HuggingFace
4,600xInitial round-trip error
57xAfter architecture fixes
0Architecture rewrites
≤7BTarget model size

The idea

When a language model reads a sentence, it builds up a rich internal state at every layer. At each token position, that state is a vector of thousands of numbers encoding everything the model has inferred so far — grammatical role, likely continuations, semantic context, positional information.

Most tools for understanding these states require you to work with the numbers directly, which is impractical for most purposes. TinyNLA takes a different approach, inspired by Anthropic's Natural Language Autoencoder work: train a system that can describe that internal state in plain English (the verbalizer), and then reconstruct the original numbers from just that description (the reconstructor). If the reconstruction is close to the original, the description captured the essential information.

The practical implication: if this works well, you could read what a model was "thinking" at any point, edit that description, and observe the effect on behavior — without ever touching the raw numbers.

The test: take a vector, convert it to English, convert the English back to a vector, and measure how close you got. Low error means the description preserved the essential structure. TinyNLA targets GPT-2 and Qwen2.5 models up to 7B parameters.

Try it — the round-trip in action Interactive

Choose a sentence below. The demo walks through what TinyNLA actually does at each step: read the model's internal state at the final token, translate it to a plain English description, then reconstruct the state from that description and compare what the model would predict in each case.

TinyNLA verbalizer / reconstructor — layer 1, GPT-2 Representative — based on real runs
Choose a sentence
01 Model reads the sentence The highlighted word is where we read the internal state
Internal state at this position: 3,072 numbers — this is what TinyNLA reads
Verbalizer — numbers to English
02 TinyNLA translates the internal state to plain English
Verbalizer output
Select a sentence above.
Reconstructor — English back to numbers
03 Compare original vs reconstructed predictions If the descriptions were good, these should look similar
Original — from the model's actual state
Round-trip quality
Select a sentence to see reconstruction quality
After round-trip — from the description alone
Verbalizations and predictions are representative examples derived from real GPT-2 layer-1 experiments. Reconstruction quality reflects actual measured round-trip error on these inputs. Numbers are not from a live model call.

How it works Built

TinyNLA is a two-component system, following the same AV/AR architecture as Anthropic's NLA paper:

Both vectors are L2-normalised before comparison, so the round-trip MSE directly measures direction agreement. If the description captured the key information, the reconstructor should land in roughly the same place in activation space — and the model's predictions from that reconstructed state should be similar to its original predictions.

Input
Prompt tokensSource model (GPT-2 / Qwen2.5)
Extract
Residual stream per token per layerL2 normalise
AV
Inject vector as token embeddingAutoregress description (distilgpt2)
AR
Encode descriptionLinear head → reconstructed vector
Measure
Round-trip MSEPerplexity shiftCosine similarity

Results Layer 1 only

Three compounding architecture failures kept early results uninterpretable. Fixing them dropped the perplexity shift from 4,600x to 57x on layer 1 of GPT-2. A shift of 57x means the reconstructed state still degrades model performance substantially — but it is far from random noise, and the direction of the encoding is clearly being recovered.

VersionToken processingEncoderPerplexity shift (layer 1)
v1Averaged across tokens — loses positionDistilBERT (wrong architecture type)~4,600x
v2Averaged across tokens — loses positionDistilBERT (wrong architecture type)~2,000x
v3Per-token — preserves positiondistilgpt2 (causally compatible)57x

Round-trip perplexity shift — lower is better

v1 baseline
~4,600x
v3 layer 1
57x

These numbers are from layer 1 of GPT-2 only. Whether similar improvement holds across all layers is an open question — the full sweep is listed in goals.

Finding

Using a plain English description to condition the reconstruction outperforms hand-crafted feature labels. When the verbalizer is prompted with context that captures what the layer is doing — rather than a categorical tag like "determiner" or "function word" — both reconstruction accuracy and downstream perplexity shift improve on layer 1.

This aligns with the finding from Anthropic's NLA paper on larger models. For models in the ≤7B range, the direction is consistent with the larger-scale results, though the magnitude of improvement is harder to pin down without the full layer sweep.

Why this matters for debugging: if a model can reliably describe its own internal state in plain English, a practitioner can read those descriptions, identify what changed at the point a hallucination occurred, and intervene without directly manipulating activation space. That's the target. Current results show the direction is viable at one layer on one model.

Goals