Problem
A model can pass every accuracy benchmark and still fail in production: agreeing with a user it should correct, hallucinating a tool call that doesn't exist, or miscalibrating its own confidence in a multi-step plan. These failures are behavioral, not statistical, and they appear exactly when a system has enough autonomy to matter.
The platform addresses this at two levels. The behavioral layer catches failures externally through structured probing. The interpretability layer explains them internally through residual stream analysis.
Try a probe Interactive
Select a probe suite to see the prompt structure it uses, a representative verdict from an evaluation run, and the score.
Architecture Built
The platform runs as a single FastAPI service with SSE streaming for live evaluation progress. The original two-process design communicated through the filesystem and produced race conditions under load. Consolidating into a single async process eliminated the races and cut latency by roughly 60%, measured on the same probe suite before and after on the same hardware.
Evaluation latency — two-process vs single-process
Reduction from removing file-IPC round-trips and switching to async SSE orchestration. Same probe suite, same hardware, before and after.
Interpretability layer Early-stage
Behavioral probes tell you that a model failed. The interpretability layer tells you why. It extracts per-token residual stream trajectories, captures attention metrics, and computes cross-model geometric similarity. This layer shares technology with TinyNLA. Current coverage is single-layer measurements on GPT-2.
Two functionally distinct behaviors (green vs teal) show cosine similarity 0.995 at GPT-2 layer 5, but KL divergence over output distributions is approximately 7. One layer, one model — the full sweep is a goal below.
Finding
At GPT-2 layer 5, activation vectors with cosine similarity 0.995 produced output distributions with KL divergence around 7. Geometrically near-identical, functionally separated. Distance-based probing methods that assume geometric proximity predicts behavioral similarity miss exactly this failure class.
Goals
- Layer-wise sweep — map the geometry/function decoupling pattern across all layers, not just layer 5.
- CKA metrics — add Centered Kernel Alignment alongside cosine similarity.
- Multi-turn traces — extend probe suites from single-turn to multi-step agent traces where failures compound.
- Live endpoint — host a constrained eval server so the probe explorer runs against a real model.