When Your Eval Set Stops Telling the Truth

Build the eval set before the feature is the advice. We stand by it. But six months in, a different problem appears: the eval set passes, the score climbs, and users start complaining anyway.

The eval set has stopped telling the truth. It's still answering the question you asked it; the question just no longer matches the product.

Four ways an eval set lies

1. Optimized to death

The team has iterated against the same 50 examples for nine months. Score is now 97%. Production quality is flat. You're not measuring quality anymore — you're measuring "how well does the model do on these specific 50 things." That's overfitting to the eval, dressed up as progress.

2. Stale

The eval set was built from production inputs from Q3 last year. Since then, product surface area grew, you added two languages, mobile users shifted the input distribution. The eval still reflects the old world. Your "regressions" pass it; your real users hit something it never saw.

3. Wrong granularity

The eval checks "does the output mention shipping costs?" The real bug is "the shipping cost amount is wrong by a factor of 10." The eval passes a wrong answer with a correct-shaped output. The granularity is too coarse to catch the failure mode you actually care about.

4. Biased sampling

The eval was built from the inputs that were easiest to label — short, English, happy-path. Long inputs, PT-BR, edge cases, adversarial: under-represented. The eval is honest about the slice it covers and silent about the slice it doesn't.

The audit — four checks

Run these quarterly. Each should take under an hour with logs at hand.

1. Hold-out drift

You kept 20% of examples as a test set the team doesn't iterate against, right? Compare train score vs. test score over the last six months. If train climbed and test stalled — overfitting. Refresh the train set with new examples.

2. Tag-pass-rate distribution

Bucket eval results by tag (language, input length, category, customer segment). Pass rate of 92% averaged might be 99% on "EN, short" and 71% on "PT-BR, long." If any tag is more than 10 points below average and has real production volume, your eval is hiding a regression in that slice.

3. Production-vs-eval divergence

Plot eval pass rate next to one production outcome metric (thumbs-up rate, escalation rate, refund rate) on a six-month chart. If the two trend together — good, eval is honest. If they diverge for more than two weeks — eval is lying. Trust production.

4. Adversarial freshness

Take five real complaints from the last 30 days where users said the output was wrong. Try to express each as a new eval case. If your existing eval already catches all five — great. If even one slips through, the eval missed a class of failure your users found. Add it; investigate whether there's a category behind it.

Killing examples, not just adding

Every team adds examples to their eval set over time. Almost no team removes them. The result: a sprawling eval that's slow to run, biased toward whatever the team thought was interesting in month two, and harder to keep meaningful.

Once per quarter, delete:

Examples your current model passes 100 times in a row. They've stopped being a signal. Keep one or two as canaries; drop the rest.
Examples for product surfaces you killed. If the feature is gone, its evals shouldn't haunt the suite.
Duplicates of duplicates. Near-identical inputs that all test the same property. Keep one well-labeled.

The goal is not a bigger eval set. It is an eval set that takes under five minutes to run end-to-end and surfaces real signal. A 200-example set the team actually runs on every change beats a 2,000-example set they avoid because it's slow.

The quarterly refresh ritual

Block 90 minutes once a quarter. Two engineers, the PM, real production logs. Goal:

Run the four audit checks above. Write the result down.
Sample 30 new inputs from the last 30 days of production logs. Label them.
Drop 15 examples from the existing set per the kill criteria.
Re-run the full eval against current production prompt. Record the new baseline.

That's it. Quarterly is enough — more often turns into ceremony, less often lets staleness compound.

Close

Eval sets are not write-once artifacts. They are living test suites that need pruning, refreshing, and honest audits — the same as the production code they describe.

The teams that ship AI features that stay good aren't the ones with the biggest eval sets. They're the ones whose evals still tell the truth in week 52.