LLM Reliability Evaluation Platform
back to projects
Active2026-Present

LLM Reliability Evaluation Platform

Production-grade system for measuring LLM behavioral consistency via Monte Carlo sampling. Benchmarked Qwen 2.5-7B across 240 inference calls — v2 grid sweep found temperature IS a lever, but top-p dominates 3.1×.

PythonFastAPINext.jsLLMNLPResearch

Standard benchmarks evaluate accuracy on a single inference pass. This platform evaluates something more deployment-relevant: behavioral consistency — how stable a model's outputs are across independent inference runs on identical inputs.

Built after recognising that open-source LLM evaluation tooling optimises for leaderboard scores but gives almost no signal about whether a model will behave predictably in production. The gap is especially sharp for 7B-class models being deployed in high-stakes domains.

The platform runs Monte Carlo sampling (n=5 per prompt) across a full temperature × top-p grid sweep and computes instability, confidence, semantic dispersion, and entropy for each inference. A DBSCAN clustering pass groups semantically similar outputs to identify whether a model converges or diverges on a given prompt.

v2 ran 240 inference calls (48 prompts × 5 MC samples) and produced a significant revision to the v1 finding: temperature is not irrelevant — it interacts with top-p, but top-p dominates reliability at 3.1× the effect magnitude. The optimal configuration for Qwen 2.5-7B is T=0.1, top-p=0.5. v2 also achieved 0% null generation, compared to 11.8% in v1, through improved retry logic and failure classification.

Key Highlights

  • Monte Carlo sampler: n=5 independent inference passes per prompt across full temperature × top-p grid
  • Semantic clustering via sentence-transformers + DBSCAN to group divergent outputs
  • Reliability metrics: instability, confidence, entropy, semantic dispersion, cluster dominance
  • v2: 240 inference calls (48 prompts × 5 MC samples) — full grid sweep across T and top-p
  • v2 finding: top-p dominates reliability at 3.1× the effect magnitude of temperature
  • v2 finding: optimal config for Qwen 2.5-7B is T=0.1, top-p=0.5 (coding instability 0.163)
  • v2 finding: 86.5% escalation rate — philosophical prompts show highest instability
  • v2 finding: 0% null generation (vs 11.8% in v1) after retry logic and failure classification
  • v1 finding: temperature appeared negligible — revised in v2 after full grid sweep
  • Live Next.js dashboard with real-time telemetry, escalation badges, and offline JSON reports
  • Published benchmark dataset and research analysis notebook on Kaggle (v1 + v2)

Tech Stack

Frontend

Next.jsTypeScriptVercel

Backend

FastAPIPythonngrok

ML / Eval

sentence-transformersDBSCANQwen 2.5-7B-Instruct

Infrastructure

Kaggle T4 GPUDocker

Challenges

  • Designing a benchmark that stress-tests failure modes without being trivially gameable
  • Handling 11.8% null generation silently — required retry logic and failure classification
  • Frontend entropy mapping bug zeroed out entropy values in first dataset — required clean rerun
  • Keeping the FastAPI backend alive across a 2.5-hour T4 session with retry and heartbeat logic

Key Learnings

  • Behavioral consistency is a more deployment-relevant signal than single-pass accuracy
  • Temperature and top-p interact — top-p dominates reliability at 3.1×; single-axis sweeps miss this
  • v1 findings can be wrong — the grid sweep in v2 meaningfully revised the temperature conclusion
  • Silent failure modes (null generation) are more dangerous than noisy ones — retry logic is non-optional
  • Evaluation infrastructure is itself a research contribution, not just scaffolding