Conservative Auto-Regeneration Policy for AI-Generated Financial Narratives
back to projects
Completed2025

Conservative Auto-Regeneration Policy for AI-Generated Financial Narratives

Production quality-gating system for AI-generated outputs in financial services — threshold optimisation, conservative AND escalation policy, priority-scored human review queue, and audit-grade provenance.

PythonNLPFinTechMLSystem Design

Generative AI in financial services isn't a research problem — it's a compliance and reliability problem. The question isn't whether a model can produce a good output. The question is whether you can prove it, consistently, to an auditor.

This project implements a closed-loop quality assurance pipeline for AI-generated narratives in a financial services context. The system evaluates each generated output against two independent quality signals — embedding similarity and ROUGE-L — and applies a conservative AND policy: auto-regeneration is only triggered when both metrics fall below their calibrated thresholds simultaneously. Single-metric failure routes to human review rather than auto-regen, erring on the side of caution.

Thresholds are not hand-tuned. They are computed via a bootstrap percentile sweep across a labeled validation set, optimising for precision-recall trade-offs appropriate to a regulated environment. Every artifact — thresholds, policy decisions, enriched validation data — is written with provenance metadata (timestamp, config snapshot, pipeline stage) so the full decision trail is reconstructable.

The human review queue is not a flat list. Cases are ranked by a priority score derived from both metric deficits, so reviewers work the highest-risk outputs first. PII sanitization is applied before any data leaves the pipeline for external storage.

Key Highlights

  • Conservative AND policy: auto-regen only when both embedding similarity AND ROUGE-L fall below threshold — single-metric failure routes to human review
  • Threshold calibration via bootstrap percentile sweep across labeled validation set
  • Priority-scored human review queue ranked by combined metric deficit — highest-risk cases surface first
  • Audit-grade provenance: every artifact written with timestamp, config snapshot, and pipeline stage metadata
  • PII sanitization applied before public artifact export
  • Full pipeline: enrichment → threshold sweep → policy application → queue ranking → provenance write
  • Published on Kaggle with sanitized validation dataset

Tech Stack

Core

PythonPandasNumPyScikit-learn

NLP / Eval

ROUGE-LSentence EmbeddingsEmbedding Similarity

Pipeline

Bootstrap Percentile SweepProvenance TrackingPII Sanitization

Challenges

  • Calibrating thresholds conservative enough for a regulated environment without making auto-regen so rare it adds no value
  • Designing an AND policy that handles missing modalities correctly without collapsing to always-escalate
  • Provenance tracking that is lightweight enough not to slow the pipeline but complete enough to satisfy audit requirements

Key Learnings

  • Conservative AND policy outperforms OR policy for regulated contexts — false positives (unnecessary human review) are far cheaper than false negatives (bad output reaching downstream)
  • Bootstrap percentile thresholds are more robust than point estimates when labeled data is limited
  • Priority-scored queues change reviewer behaviour — flat queues get triaged manually and inconsistently