Indian Desi Multilingual LLM — Training Pipeline
back to projects
Completed2025

Indian Desi Multilingual LLM — Training Pipeline

End-to-end multilingual LLM training pipeline targeting Hindi/English code-switching. Dataset curation, LoRA fine-tuning, inference evaluation, and deployment packaging across 6 Kaggle notebooks.

PythonLLMNLPLoRAHuggingFaceTransformers

Built a complete training pipeline for an Indian multilingual LLM — from raw data curation through LoRA fine-tuning to inference evaluation and deployment packaging.

The goal was to address a real gap: most open-source LLMs handle formal Hindi reasonably well but break down on the casual, code-switched Hindi-English that represents how most Indians actually communicate. The project focuses on building the infrastructure that makes multilingual AI reliable and safe for production.

Recognised mid-project that Sarvam AI had advanced significantly in the same space with more compute and a dedicated research team. Rather than continuing a follower project, pivoted toward the more interesting problem hiding inside the work: building rigorous behavioral reliability evaluation infrastructure for open-source LLMs. That pivot became the LLM Reliability Evaluation Platform.

Key Highlights

  • Canonical dataset curated from 3 complementary sources — chatbot dataset for tone, large-scale conversation corpus for diversity, sentence-pair dataset for structural grounding
  • Unified schema normalising all sources into a clean, consistent format
  • LoRA adapter initialisation on a multilingual encoder-decoder base model
  • 6-notebook pipeline: tokenisation → model setup → LoRA init → fine-tuning → inference eval → deployment packaging
  • Handles Hindi-English code-switching and Devanagari / Romanised Hinglish script diversity
  • Persona safety CI: checkpoint testing against adversarial prompts before deployment

Tech Stack

Core

PythonPyTorchHuggingFace Transformers

Fine-tuning

LoRAQLoRAPEFT

Evaluation

sentence-transformersCustom benchmark suite

Infrastructure

Kaggle T4 GPUDockerGitHub Actions

Challenges

  • Code-switching detection — Indian speakers mix languages mid-sentence unpredictably
  • Script diversity — Hindi written in Devanagari vs Romanised Hinglish are treated as different inputs
  • Evaluation subjectivity — BLEU scores don't capture cultural nuance or conversational naturalness
  • Recognising when to pivot: continued investment in the model itself was low-differentiation

Key Learnings

  • Building safety layers before scaling inference is the right order of operations
  • Pivoting is a research decision, not a failure — finding the more interesting problem matters
  • Cultural context matters more than raw accuracy metrics for conversational AI
  • Dataset quality and schema consistency have more leverage than model architecture choices at this scale