Indian Desi Multilingual LLM — Training Pipeline
End-to-end multilingual LLM training pipeline targeting Hindi/English code-switching. Dataset curation, LoRA fine-tuning, inference evaluation, and deployment packaging across 6 Kaggle notebooks.
Built a complete training pipeline for an Indian multilingual LLM — from raw data curation through LoRA fine-tuning to inference evaluation and deployment packaging.
The goal was to address a real gap: most open-source LLMs handle formal Hindi reasonably well but break down on the casual, code-switched Hindi-English that represents how most Indians actually communicate. The project focuses on building the infrastructure that makes multilingual AI reliable and safe for production.
Recognised mid-project that Sarvam AI had advanced significantly in the same space with more compute and a dedicated research team. Rather than continuing a follower project, pivoted toward the more interesting problem hiding inside the work: building rigorous behavioral reliability evaluation infrastructure for open-source LLMs. That pivot became the LLM Reliability Evaluation Platform.
Key Highlights
- Canonical dataset curated from 3 complementary sources — chatbot dataset for tone, large-scale conversation corpus for diversity, sentence-pair dataset for structural grounding
- Unified schema normalising all sources into a clean, consistent format
- LoRA adapter initialisation on a multilingual encoder-decoder base model
- 6-notebook pipeline: tokenisation → model setup → LoRA init → fine-tuning → inference eval → deployment packaging
- Handles Hindi-English code-switching and Devanagari / Romanised Hinglish script diversity
- Persona safety CI: checkpoint testing against adversarial prompts before deployment
Tech Stack
Core
Fine-tuning
Evaluation
Infrastructure
Challenges
- Code-switching detection — Indian speakers mix languages mid-sentence unpredictably
- Script diversity — Hindi written in Devanagari vs Romanised Hinglish are treated as different inputs
- Evaluation subjectivity — BLEU scores don't capture cultural nuance or conversational naturalness
- Recognising when to pivot: continued investment in the model itself was low-differentiation
Key Learnings
- Building safety layers before scaling inference is the right order of operations
- Pivoting is a research decision, not a failure — finding the more interesting problem matters
- Cultural context matters more than raw accuracy metrics for conversational AI
- Dataset quality and schema consistency have more leverage than model architecture choices at this scale
