Conservative Auto-Regeneration Policy for AI-Generated Financial Narratives
Production quality-gating system for AI-generated outputs in financial services — threshold optimisation, conservative AND escalation policy, priority-scored human review queue, and audit-grade provenance.
Generative AI in financial services isn't a research problem — it's a compliance and reliability problem. The question isn't whether a model can produce a good output. The question is whether you can prove it, consistently, to an auditor.
This project implements a closed-loop quality assurance pipeline for AI-generated narratives in a financial services context. The system evaluates each generated output against two independent quality signals — embedding similarity and ROUGE-L — and applies a conservative AND policy: auto-regeneration is only triggered when both metrics fall below their calibrated thresholds simultaneously. Single-metric failure routes to human review rather than auto-regen, erring on the side of caution.
Thresholds are not hand-tuned. They are computed via a bootstrap percentile sweep across a labeled validation set, optimising for precision-recall trade-offs appropriate to a regulated environment. Every artifact — thresholds, policy decisions, enriched validation data — is written with provenance metadata (timestamp, config snapshot, pipeline stage) so the full decision trail is reconstructable.
The human review queue is not a flat list. Cases are ranked by a priority score derived from both metric deficits, so reviewers work the highest-risk outputs first. PII sanitization is applied before any data leaves the pipeline for external storage.
Key Highlights
- Conservative AND policy: auto-regen only when both embedding similarity AND ROUGE-L fall below threshold — single-metric failure routes to human review
- Threshold calibration via bootstrap percentile sweep across labeled validation set
- Priority-scored human review queue ranked by combined metric deficit — highest-risk cases surface first
- Audit-grade provenance: every artifact written with timestamp, config snapshot, and pipeline stage metadata
- PII sanitization applied before public artifact export
- Full pipeline: enrichment → threshold sweep → policy application → queue ranking → provenance write
- Published on Kaggle with sanitized validation dataset
Tech Stack
Core
NLP / Eval
Pipeline
Challenges
- Calibrating thresholds conservative enough for a regulated environment without making auto-regen so rare it adds no value
- Designing an AND policy that handles missing modalities correctly without collapsing to always-escalate
- Provenance tracking that is lightweight enough not to slow the pipeline but complete enough to satisfy audit requirements
Key Learnings
- Conservative AND policy outperforms OR policy for regulated contexts — false positives (unnecessary human review) are far cheaper than false negatives (bad output reaching downstream)
- Bootstrap percentile thresholds are more robust than point estimates when labeled data is limited
- Priority-scored queues change reviewer behaviour — flat queues get triaged manually and inconsistently