Building an Indian Multilingual LLM: Why I Had to Solve Reliability First
The original plan was straightforward: build a multilingual LLM fine-tuned for Indian conversational patterns — Hindi, English, and the code-switched hybrid that most Indians actually speak in.
The gap was real. Most open-source models handle formal Hindi reasonably well. They fall apart on the casual, culturally-grounded conversation that's representative of how 1.4 billion people actually communicate. I built the full pipeline — dataset curation from three complementary sources, tokenisation, LoRA adapter initialisation, fine-tuning on a Kaggle T4 GPU, inference evaluation, deployment packaging. Six notebooks, clean end-to-end.
The problem I ran into wasn't the training pipeline. It was evaluation.
While debugging my fine-tuned model's outputs, I kept noticing something that bothered me. Run the same prompt at the same temperature setting and the answer would be subtly different. Sometimes meaningfully different. Run it five times and you'd get five outputs that were individually plausible but semantically inconsistent with each other.
This wasn't a failure that showed up in standard benchmarks. MMLU, ROUGE scores, HumanEval — all of these evaluate accuracy on a single inference pass. They tell you whether a model can produce the right answer. They don't tell you whether it reliably produces consistent answers.
For a multilingual model targeting real conversational deployment — where the same user asks the same kind of question repeatedly and expects stable, predictable responses — that distinction matters enormously. I couldn't in good conscience ship something as a reliable inference layer until I understood this failure mode properly.
So before continuing to fine-tune, I built the evaluation infrastructure first.
That infrastructure grew into the overarching LLM Inference Systems project. What started as simple Monte Carlo sampling evolved into a production-grade safety net: a 9-category guardrail classifier that automatically intercepts self-harm, jailbreaks, and extremism with severity-scaled escalation logic.
But catching explicit violations wasn't enough; the model's stochastic instability still persisted. So I built an empirical reliability guard. By conducting thorough grid sweeps, we established that temperature dominates output spread while top-p fine-tunes it. Now, when the system detects high-entropy inference (above the 0.25 empirical threshold) that indicates a confused or "hallucinating" generation path, it automatically triggers a grid-sweep-optimized fallback — forcing the model into a highly constrained probability space (T=0.1, top_p=0.5) to reliably recover stability without failing the request.
To make all this truly deployable, I stopped treating model weights as the only artifact that matters. I implemented strict Pydantic-powered deployment registries that demand absolute provenance. Every release is gated by Go/No-Go evaluation verdicts across 55+ automated regression tests. The registry binds model weights, explicit policy configurations, and "behavioral DNA snapshots" (frozen telemetry traces recording exact entropy patterns) using unbroken SHA-256 fingerprint chains.
The multilingual LLM is still a goal. What changed is the order of operations. Building invariant tests, adversarial safeguard classifiers, and strict snapshotting wasn't a detour — it's the prerequisite for shifting AI from a research curiosity into a production asset. I just had to build the tools to see that clearly.