Every team building with LLMs eventually has the same realization: "I can't tell if my changes are helping or hurting." Prompt tweaks feel like an improvement until a user surfaces an output worse than what shipped last month. The model version changes and silently degrades a third of your responses. A vendor swap fixes one bug and introduces three.

The fix is not better prompts. The fix is evals.

An eval set is the LLM equivalent of a test suite: a fixed, versioned collection of inputs and expected outputs (or pass/fail rules) you can run on every change. Without one, you are guessing. With one, you are engineering.

The shortcut that costs you

Most teams skip evals because building them feels slower than just shipping. It is — for the first three weeks. After that, the math inverts. Teams without evals spend more and more time on regression hunts, "did the model get worse?" debates, and emergency rollbacks. Teams with evals ship faster because every change is a measurable change.

If you've ever asked the question "do you think the new prompt is better?", you needed an eval set last month.

What an eval set actually is

At a minimum: a CSV, JSON file, or table with three columns.

Input — the prompt or user message your system receives.
Expected output — the answer, or the rule the answer must satisfy.
Tags — categories so you can spot which kinds of input are regressing (e.g. "PT-BR", "edge-case", "tool-use").

And a script that runs your system against every row, compares the output to the expected output, and reports pass / fail / score. That's it. The first version can be a Python notebook. It does not need a UI.

Three eval flavors

You'll usually need a mix. Different tasks call for different graders.

1. Ground truth

You know the exact right answer. Classification, extraction, structured output — these have ground truth. Grading is a string or schema match. Cheap, deterministic, fast.

Use when: the task has one correct answer (sentiment, category, JSON extraction, math).

2. Rubric-based

You don't have one right answer but you have rules. "The output must include a greeting, must be under 200 words, must not promise a discount." Run each rule as a check.

Use when: the task is generative but constrained (drafts, summaries, customer replies).

3. Model-graded

A separate, often stronger model scores the output against a rubric you write. Useful for fluent generation tasks where you can describe quality but not enumerate it. Treat it as noisier than the first two — and always spot-check it against human judgment on a sample.

Use when: the task is open-ended (long answers, explanations, brainstorms). Don't use it as your only eval — pair with at least one rubric check.

Building your first 50 examples

You do not need 5,000. You need 50 — well chosen.

Steal from production. Pull 200 real inputs from logs (last 30 days). If you don't have logs yet, log first, then come back.
Sample for diversity. Don't pick the 50 easiest. Mix happy paths, ambiguous inputs, malformed requests, long inputs, short inputs, and at least 5 things that look adversarial.
Sanitize. Strip PII. Replace names, emails, account numbers with realistic-looking placeholders. Don't commit raw customer data to a repo.
Label by hand. The first 50 must be labeled by a human who knows the domain. This is the painful, irreplaceable part. Budget two focused hours.
Tag aggressively. "Returns question", "PT-BR input", "multi-turn", "tool-use required". Tags are how you'll later say "our PT-BR pass rate dropped 12% this week."

After this, you have an eval set. Run your current production system against it. Whatever score you get is the floor — any future change must beat it or you don't merge.

The cadence

Run evals on:

Every prompt change. No exceptions. Fast feedback loop only works if it's fast — keep the eval set small enough that a full run takes under five minutes.
Every model swap. Including minor version bumps. "Same model, new snapshot" has burned more launches than we can count.
Every retrieval-index update. If your system grabs context from a vector store, changes to that store can silently change behavior.
Weekly, as a freshness check. Even with nothing changed on your end, drift happens. Schedule a Friday-morning run.

The trap: when your eval improves but your users don't

The most expensive failure mode of an eval set is when the team starts optimizing the eval instead of the product. Score goes up, users complain more.

Two safeguards:

Hold out a slice. Keep 20% of your examples in a "test set" the team can't iterate against. Check it monthly. If train score climbs but test score stalls, you're overfitting to the eval.
Pair eval scores with one production metric. User thumbs-up rate, escalation rate, refund rate — something with real-world signal. If eval and production diverge for more than two weeks, your eval set is wrong; fix it.

Close

Evals aren't glamorous. They're not in the demo. They're not what gets the launch tweet. They are what makes the system you launched still good in week 12.

If you only do one thing this week before writing more prompt logic, write the eval set. The rest of the system gets easier from there.