Production eval harness for a safety-critical LLM application — Case study

Methodology context — this case study uses Zhianrui's provenance-first methodology, with §6 covering how we engineer AI systems specifically. Start there if the eval-harness vocabulary is unfamiliar.

Eval harness architecture: ground truth, multi-judge, deploy gate

The brief

Sector: Regulated health-tech. Need: the client had built an LLM-driven triage assistant that helped clinical staff route patient queries to the appropriate care path. The system was working — but every time the upstream model provider released an update, the team spent two weeks revalidating it manually before allowing the new model into production. They needed an evaluation harness that could decide automatically whether a model upgrade was safe to ship.

Workstream decomposition

WS-1: Failure-mode taxonomy. Catalogue the specific failure modes the regulated domain cannot tolerate — under-triage of high-acuity cases, demographic-specific calibration drift, hallucinated drug interactions, off-topic redirections that delay urgent care.
WS-2: Eval dataset construction. Build a 1,400-case ground-truth dataset combining: (a) historical adjudicated cases from the client's own logs, (b) synthetic edge cases authored by clinicians to probe specific failure modes, (c) adversarial prompts designed to surface known LLM weak spots.
WS-3: Multi-judge scoring. Move past single-LLM-as-judge. Three independent judges per case — two LLM judges from different model families, one human-in-the-loop sample — with disagreement-triggered escalation.
WS-4: Deploy gate + regression detection. Wire the eval harness into the client's deployment pipeline. A new model version cannot reach production unless it (a) matches or beats the incumbent on the aggregate score, (b) does not regress on any single failure-mode subset by more than the configured tolerance, (c) passes the demographic-calibration check.

Method highlight

The most consequential design decision was to score models on failure-mode subsets rather than aggregate scores. A new model that improves the average score by 3% but regresses by 12% on the under-triage subset is worse in this domain — and aggregate-score evals miss that. The harness reports per-subset deltas as a matter of course; the deploy gate's tolerance is configured per subset.

The second consequential decision was to refuse synthetic-only eval data. Synthetic cases are useful for probing known failure modes, but they overfit to what the team already knows to look for. The harness requires a non-trivial fraction of every release evaluation to come from recent production logs that have been adjudicated since the last release — keeping the eval honest about live distribution drift.

Deliverable shape

Eval harness as an installable service, runnable in CI and locally.
1,400-case ground-truth dataset with versioning, source attribution, and adjudication notes.
Multi-judge scoring infrastructure with disagreement-triggered escalation queue.
Deploy-gate integration into the client's CD pipeline; configurable per-subset thresholds.
Documentation: how to add a failure mode, how to refresh the dataset from production logs, how to interpret a regression report.
Research dossier: 38 source rows documenting failure-mode selection rationale.

Outcomes

Time from upstream model release to client production deployment dropped from ~14 days to ~2 days , with the team's confidence in the deploy decision increasing rather than decreasing. Three would-have-shipped model upgrades have been caught and rolled back by the harness in the year since deployment — two for under-triage regressions, one for a calibration drift on a specific demographic subset. None of the three would have been caught by aggregate-score evaluation.

The harness is now used as the reference architecture for evaluating two further LLM-driven systems the client has since deployed.

← All case studies