Completed2025

Conservative Auto-Regeneration Policy for AI-Generated Financial Narratives

Production validation system for AI-generated outputs in financial services — threshold optimisation, conservative AND escalation policy, priority-scored human review queue, and audit-grade provenance.

View Live

PythonNLPFinTechMLSystem Design

Generative AI in financial services isn't a research problem — it's a compliance and reliability problem. The question isn't whether a model can produce a good output. The question is whether you can prove it, consistently, to an auditor.

This project implements a closed-loop validation pipeline for AI-generated narratives in a financial services context. The system evaluates each generated output against two independent validation signals — embedding similarity and ROUGE-L — and applies a conservative AND policy: auto-regeneration is only triggered when both metrics fall below their calibrated thresholds simultaneously. Single-metric failure routes to human review rather than auto-regen, erring on the side of caution.

Thresholds are not hand-tuned. They are computed via a bootstrap percentile sweep across a labeled validation set, optimising for precision-recall trade-offs appropriate to a regulated environment. Every artifact — thresholds, policy decisions, enriched validation data — is written with provenance metadata (timestamp, config snapshot, pipeline stage) so the full decision trail is reconstructable.

The human review queue is not a flat list. Cases are ranked by a priority score derived from both metric deficits, so reviewers work the highest-risk outputs first. PII sanitization is applied before any data leaves the pipeline for external storage.

Result: Prevents silent model degradation by guaranteeing failing narratives are caught before they reach downstream consumers.

Key Highlights

Conservative AND policy: auto-regen only when both embedding similarity AND ROUGE-L fall below threshold — single-metric failure routes to human review
Threshold calibration via bootstrap percentile sweep across labeled validation set
Priority-scored human review queue ranked by combined metric deficit — highest-risk cases surface first
Audit-grade provenance: every artifact written with timestamp, config snapshot, and pipeline stage metadata
PII sanitization applied before public artifact export
Full pipeline: Extracted Text → Failure-mode metrics (ROUGE drift, format violations, consistency) → Δ Analysis → Conservative AND Policy Gate → Priority-Scored Human Review
Published on Kaggle with sanitized validation dataset

Tech Stack

Core

PythonPandasNumPyScikit-learn

NLP / Eval

ROUGE-LSentence EmbeddingsEmbedding Similarity

Pipeline

Bootstrap Percentile SweepProvenance TrackingPII Sanitization

Challenges

Calibrating thresholds conservative enough for a regulated environment without making auto-regen so rare it adds no value
Designing an AND policy that handles missing modalities correctly without collapsing to always-escalate
Provenance tracking that is lightweight enough not to slow the pipeline but complete enough to satisfy audit requirements

Key Learnings

Conservative AND policy outperforms OR policy for regulated contexts — false positives (unnecessary human review) are far cheaper than false negatives (bad output reaching downstream)
Bootstrap percentile thresholds are more robust than point estimates when labeled data is limited
Priority-scored queues change reviewer behaviour — flat queues get triaged manually and inconsistently

View all projects