LLM Reliability Evaluation Platform
Production-grade system for measuring LLM behavioral consistency via Monte Carlo sampling. Benchmarked Qwen 2.5-7B across 240 inference calls — v2 grid sweep found temperature IS a lever, but top-p dominates 3.1×.
Standard benchmarks evaluate accuracy on a single inference pass. This platform evaluates something more deployment-relevant: behavioral consistency — how stable a model's outputs are across independent inference runs on identical inputs.
Built after recognising that open-source LLM evaluation tooling optimises for leaderboard scores but gives almost no signal about whether a model will behave predictably in production. The gap is especially sharp for 7B-class models being deployed in high-stakes domains.
The platform runs Monte Carlo sampling (n=5 per prompt) across a full temperature × top-p grid sweep and computes instability, confidence, semantic dispersion, and entropy for each inference. A DBSCAN clustering pass groups semantically similar outputs to identify whether a model converges or diverges on a given prompt.
v2 ran 240 inference calls (48 prompts × 5 MC samples) and produced a significant revision to the v1 finding: temperature is not irrelevant — it interacts with top-p, but top-p dominates reliability at 3.1× the effect magnitude. The optimal configuration for Qwen 2.5-7B is T=0.1, top-p=0.5. v2 also achieved 0% null generation, compared to 11.8% in v1, through improved retry logic and failure classification.
Key Highlights
- Monte Carlo sampler: n=5 independent inference passes per prompt across full temperature × top-p grid
- Semantic clustering via sentence-transformers + DBSCAN to group divergent outputs
- Reliability metrics: instability, confidence, entropy, semantic dispersion, cluster dominance
- v2: 240 inference calls (48 prompts × 5 MC samples) — full grid sweep across T and top-p
- v2 finding: top-p dominates reliability at 3.1× the effect magnitude of temperature
- v2 finding: optimal config for Qwen 2.5-7B is T=0.1, top-p=0.5 (coding instability 0.163)
- v2 finding: 86.5% escalation rate — philosophical prompts show highest instability
- v2 finding: 0% null generation (vs 11.8% in v1) after retry logic and failure classification
- v1 finding: temperature appeared negligible — revised in v2 after full grid sweep
- Live Next.js dashboard with real-time telemetry, escalation badges, and offline JSON reports
- Published benchmark dataset and research analysis notebook on Kaggle (v1 + v2)
Tech Stack
Frontend
Backend
ML / Eval
Infrastructure
Challenges
- Designing a benchmark that stress-tests failure modes without being trivially gameable
- Handling 11.8% null generation silently — required retry logic and failure classification
- Frontend entropy mapping bug zeroed out entropy values in first dataset — required clean rerun
- Keeping the FastAPI backend alive across a 2.5-hour T4 session with retry and heartbeat logic
Key Learnings
- Behavioral consistency is a more deployment-relevant signal than single-pass accuracy
- Temperature and top-p interact — top-p dominates reliability at 3.1×; single-axis sweeps miss this
- v1 findings can be wrong — the grid sweep in v2 meaningfully revised the temperature conclusion
- Silent failure modes (null generation) are more dangerous than noisy ones — retry logic is non-optional
- Evaluation infrastructure is itself a research contribution, not just scaffolding
