← All work

LLM Infrastructure &
Interpretability Platform

Nine behavioral probe suites catch how an open-weight LLM fails. Residual stream analysis explains why. Both layers run on a single FastAPI service targeting Qwen2.5 and GPT-2 on constrained hardware.

FastAPISSEPyTorch CUDAHuggingFaceMLflowDocker
0Probe suites
~60%Latency reduction
0Model families
4 GBVRAM budget

Problem

A model can pass every accuracy benchmark and still fail in production: agreeing with a user it should correct, hallucinating a tool call that doesn't exist, or miscalibrating its own confidence in a multi-step plan. These failures are behavioral, not statistical, and they appear exactly when a system has enough autonomy to matter.

The platform addresses this at two levels. The behavioral layer catches failures externally through structured probing. The interpretability layer explains them internally through residual stream analysis.

Try a probe Interactive

Select a probe suite to see the prompt structure it uses, a representative verdict from an evaluation run, and the score.

Probe Suite Explorer 9 / 9 implemented
Prompts and verdicts are representative examples from real probe runs, fixed here rather than calling a live model endpoint.

Architecture Built

The platform runs as a single FastAPI service with SSE streaming for live evaluation progress. The original two-process design communicated through the filesystem and produced race conditions under load. Consolidating into a single async process eliminated the races and cut latency by roughly 60%, measured on the same probe suite before and after on the same hardware.

Evaluation latency — two-process vs single-process

Two-process
baseline
Single-process
~60% faster

Reduction from removing file-IPC round-trips and switching to async SSE orchestration. Same probe suite, same hardware, before and after.

Client
SSE streamLive dashboard
API
FastAPIAsync orchestratorMLflow logger
Behavioral
Probe runner9 probe suitesScorer
Interpretability
TinyNLAResidual streamGeometry metrics
Models
Qwen2.5-3BQwen2.5-7BGPT-2

Interpretability layer Early-stage

Behavioral probes tell you that a model failed. The interpretability layer tells you why. It extracts per-token residual stream trajectories, captures attention metrics, and computes cross-model geometric similarity. This layer shares technology with TinyNLA. Current coverage is single-layer measurements on GPT-2.

function A function B cosine similarity: 0.995 PCA dim 0

Two functionally distinct behaviors (green vs teal) show cosine similarity 0.995 at GPT-2 layer 5, but KL divergence over output distributions is approximately 7. One layer, one model — the full sweep is a goal below.

Finding

At GPT-2 layer 5, activation vectors with cosine similarity 0.995 produced output distributions with KL divergence around 7. Geometrically near-identical, functionally separated. Distance-based probing methods that assume geometric proximity predicts behavioral similarity miss exactly this failure class.

Goals