Active2026-Present

LLM Control System — Runtime Reliability for Generative Models

Built a full-stack system to monitor and control LLM generation at the token level, with adaptive instability detection and real-time intervention.

View Source View Live

PythonMLOpsLLMControl SystemsSystem Design

Key Insight

"Detect and correct unstable token generation in real time using entropy-based control."

System Capabilities

Token-by-token observability with probability and entropy extraction
Real-time stability detection (entropy collapse, repetition loops, lock-in)
Adaptive policy control with mid-generation intervention
Confidence scoring with audit trail

Built a full-stack system to monitor and control LLM generation at the token level.

The system tracks entropy during decoding, detects instability patterns such as repetition loops and entropy collapse, and applies corrective actions (temperature adjustment, regeneration).

Includes: • Adaptive control loop (signal → decision → intervention) • Token-level observability (entropy + instability traces) • Confidence scoring based on generation stability • FastAPI backend + interactive UI dashboard • Support for 7B models (Mistral / Qwen via API)

Result: Prevents degenerate outputs and improves generation stability in real time.

Key Highlights

Token-level observability via manual decoding loop with per-token probability and entropy extraction
Three-category stability detector system: entropy_collapse, repetition_loop, low_entropy_lock
Formal policy controller with explicit intervention rules mapped to instability signals
Adaptive regeneration and temperature adjustment — mid-generation intervention without restart
Confidence scoring weighted by entropy (50%), instability events (35%), regenerations (15%)
82% instability reduction and +0.07 average confidence improvement on adversarial prompts
+0.06 to +0.10 confidence uplift in compare mode on typical workloads
Run history API and JSON trace persistence with full decision audit trail
MPS/CUDA acceleration with Mistral 7B FP16 (~14GB VRAM required)
Interactive Next.js dashboard for baseline vs adaptive comparison with live metrics

Tech Stack

Core ML

Token-level ControlEntropy AnalysisLogit ExtractionMistral 7B

Control Policy

Stability DetectorsAdaptive RegenerationTemperature Adjustment

Backend

FastAPIPythonUvicorn

Frontend

Next.jsReal-time TelemetryComparison Dashboard

Infrastructure

MPS/CUDAModel ServingJSON Trace Logging

Challenges

Designing stability detectors that reliably trigger on actual failures without excessive false positives
Managing mid-generation intervention policy without causing output quality degradation
Computing token-level entropy efficiently during generation without performance bottlenecks
Calibrating confidence metric weights to align with user perception of output reliability
Handling model instability across diverse prompt domains without overfitting to training patterns

Key Learnings

Token-level observability enables reliable failure detection — aggregate metrics alone miss early collapse signals
Control policy specification must be explicit and formal — heuristic rules are more debuggable than learned policies for critical systems
Confidence metrics are only useful if they accurately predict downstream failure — weighted scoring requires empirical validation
Mid-generation intervention works but introduces latency trade-offs — cold-start regeneration may be more cost-effective than adaptive adjustment
Interactive dashboards for compare mode are essential for understanding control system behavior — summary metrics alone don't build trust

View all projects