Building an Indian Multilingual LLM: Why I Had to Solve Reliability First
The original plan was straightforward: build a multilingual LLM fine-tuned for Indian conversational patterns — Hindi, English, and the code-switched hybrid that most Indians actually speak in.
The gap was real. Most open-source models handle formal Hindi reasonably well. They fall apart on the casual, culturally-grounded conversation that's representative of how 1.4 billion people actually communicate. I built the full pipeline — dataset curation from three complementary sources, tokenisation, LoRA adapter initialisation, fine-tuning on a Kaggle T4 GPU, inference evaluation, deployment packaging. Six notebooks, clean end-to-end.
The problem I ran into wasn't the training pipeline. It was evaluation.
While debugging my fine-tuned model's outputs, I kept noticing something that bothered me. Run the same prompt at the same temperature setting and the answer would be subtly different. Sometimes meaningfully different. Run it five times and you'd get five outputs that were individually plausible but semantically inconsistent with each other.
This wasn't a failure that showed up in standard benchmarks. MMLU, ROUGE scores, HumanEval — all of these evaluate accuracy on a single inference pass. They tell you whether a model can produce the right answer. They don't tell you whether it reliably produces consistent answers.
For a multilingual model targeting real conversational deployment — where the same user asks the same kind of question repeatedly and expects stable, predictable responses — that distinction matters enormously. I couldn't in good conscience ship something as a reliable inference layer until I understood this failure mode properly.
So before continuing to fine-tune, I built the evaluation infrastructure first.
That infrastructure became the LLM Reliability Evaluation Platform. Monte Carlo sampling across a temperature × top-p grid, semantic clustering via DBSCAN, instability and entropy metrics per prompt. 240 inference calls on Qwen 2.5-7B across two benchmark runs.
What it found changed the direction of the multilingual work in ways I didn't expect. The v1 finding — that temperature has negligible effect on output consistency — turned out to be wrong. A full grid sweep in v2 revealed that top-p dominates reliability at 3.1× the effect magnitude of temperature. The right parameter configuration matters. And the failure modes aren't uniformly distributed — philosophical and abstract prompts produce the highest instability, which has direct implications for a conversational AI handling culturally nuanced Indian dialogue.
The multilingual LLM is still the goal. What's changed is the order of operations.
Before fine-tuning for cultural and linguistic alignment, I need to understand the base model's behavioral surface — where it's stable, where it diverges, how parameter choices interact with prompt complexity. The reliability evaluation work is Phase 1. The alignment fine-tuning is Phase 2.
Building safety and reliability infrastructure before scaling inference isn't a detour. It's the right order of operations. I just had to build the tools to see that clearly.
The benchmark dataset and research analysis are published on Kaggle. The multilingual pipeline notebooks are on GitHub.