Building eval harnesses that survive contact with production

A benchmark score is what a model achieved on a frozen dataset. An eval harness is what you trust to decide whether a new model can ship. The two are not the same thing — and treating them as the same is why most production LLM teams revalidate every model upgrade by hand for two weeks.

A few principles, drawn from harnesses we have built and watched succeed and fail in production.

Score on subsets, not aggregates

A new model that improves the average score by 3% but regresses by 12% on a single failure-mode subset is worse in any production setting that cannot tolerate that failure mode. Aggregate scores hide this completely. The harness should report per-subset deltas as a default; the deploy-gate's tolerance should be configured per subset. If you cannot articulate the subsets, you have not yet built the harness — you have built a benchmark.

The subsets should be derived from your actual failure modes, not borrowed from generic taxonomies. "Hallucination" is not a useful subset. "Hallucinated drug interactions in the medication-routing prompt set" is.

Refuse synthetic-only data

Synthetic eval cases are useful for probing failure modes you already know to look for. They overfit to exactly that — the failure modes you knew about when you authored them. A harness whose evaluation data is entirely synthetic will tell you, with high confidence, that your model is good at the failures you already understood.

A useful harness requires a non-trivial fraction of every release evaluation to come from recent production logs that have been adjudicated since the last release. This keeps the evaluation honest about live distribution drift — the user behaviour and corpus content shifts that synthetic cases will not anticipate.

Multi-judge over single-judge

LLM-as-judge with a single judge model has a known failure mode: the judge has correlated weaknesses with the system being evaluated. A response the judge cannot itself produce, it cannot reliably evaluate. The fix is uncorrelated judges — at minimum, two LLM judges from different model families, plus a human-in-the-loop sample queue triggered by judge disagreement.

The disagreement queue does double duty. It surfaces the cases the eval is least confident about, which are usually the cases worth a human looking at. It also flags judge drift over time — if the same disagreements keep recurring, the judges themselves need to be updated.

Wire it to a deploy gate, not a dashboard

A harness that produces a report on a dashboard is a harness whose recommendations are optional. A harness wired to the deploy pipeline as a gate — block release on aggregate regression, block release on per-subset regression beyond tolerance, block release on calibration check failure — is a harness whose recommendations are non-optional. The cost of optionality is the two-week manual revalidation that most teams have learned to expect.

The deploy gate's tolerance is a product decision, not an engineering one. It is where the team makes explicit how much regression on each axis is acceptable for the throughput gain a new model brings. That conversation is uncomfortable. Avoiding it is the actual reason most teams do not have a deploy gate.

Rotate the ground truth, not just the model

Ground-truth eval data ages. The world moves; user behaviour shifts; the documents in your retrieval corpus change. A harness whose ground truth is two years old is measuring against a world that no longer exists. Build a refresh cycle into the harness from day one — not because you will need it now, but because by the time you notice you need it, you will have shipped a regression.

A harness with stale ground truth produces high-confidence wrong answers. So does an LLM with weak provenance. The structural problem is the same.

← All insights