LLM Infrastructure & Interpretability Platform

0Probe suites

~60%Latency reduction

0Model families

4 GBVRAM budget

Problem

A model can pass every accuracy benchmark and still fail in production: agreeing with a user it should correct, hallucinating a tool call that doesn't exist, or miscalibrating its own confidence in a multi-step plan. These failures are behavioral, not statistical, and they appear exactly when a system has enough autonomy to matter.

The platform addresses this at two levels. The behavioral layer catches failures externally through structured probing. The interpretability layer explains them internally through residual stream analysis.

Try a probe Interactive

Select a probe suite to see the prompt structure it uses, a representative verdict from an evaluation run, and the score.

Probe Suite Explorer 9 / 9 implemented

Prompts and verdicts are representative examples from real probe runs, fixed here rather than calling a live model endpoint.

Architecture Built

The platform runs as a single FastAPI service with SSE streaming for live evaluation progress. The original two-process design communicated through the filesystem and produced race conditions under load. Consolidating into a single async process eliminated the races and cut latency by roughly 60%, measured on the same probe suite before and after on the same hardware.

Evaluation latency — two-process vs single-process

Two-process

baseline

Single-process

~60% faster

Reduction from removing file-IPC round-trips and switching to async SSE orchestration. Same probe suite, same hardware, before and after.

Client

SSE stream←Live dashboard

API

FastAPI→Async orchestrator→MLflow logger

Behavioral

Probe runner→9 probe suites→Scorer

Interpretability

TinyNLA→Residual stream→Geometry metrics

Models

Qwen2.5-3BQwen2.5-7BGPT-2

Interpretability layer Early-stage

Behavioral probes tell you that a model failed. The interpretability layer tells you why. It extracts per-token residual stream trajectories, captures attention metrics, and computes cross-model geometric similarity. This layer shares technology with TinyNLA. Current coverage is single-layer measurements on GPT-2.

Two functionally distinct behaviors (green vs teal) show cosine similarity 0.995 at GPT-2 layer 5, but KL divergence over output distributions is approximately 7. One layer, one model — the full sweep is a goal below.

Finding

At GPT-2 layer 5, activation vectors with cosine similarity 0.995 produced output distributions with KL divergence around 7. Geometrically near-identical, functionally separated. Distance-based probing methods that assume geometric proximity predicts behavioral similarity miss exactly this failure class.

Goals

Layer-wise sweep — map the geometry/function decoupling pattern across all layers, not just layer 5.
CKA metrics — add Centered Kernel Alignment alongside cosine similarity.
Multi-turn traces — extend probe suites from single-turn to multi-step agent traces where failures compound.
Live endpoint — host a constrained eval server so the probe explorer runs against a real model.