Alejandro Rioja.
AI Agents Operations

The Eval Harness I Use to Ship AI Agents Without Fear

Alejandro Rioja
Alejandro Rioja
6 min read
TL;DR

Shipping agents without fear comes from one thing: an eval harness. A fixed set of graded test cases, scored automatically (assertions plus an LLM judge), run before every prompt or model change. If the score holds, ship. The test set is built from real production failures.

Free newsletter

Every Wednesday. 28,400+ operators. Zero fluff.

Table of contents

Open Table of contents

Start with a test set built from real failures

The harness is only as good as its test cases, and the best test cases come from production, not your imagination. Every time an agent fails in the wild, I capture the exact input (I log every run with a trace ID — see how to debug an agent in production) and turn it into an eval case:

typescript
interface EvalCase {
  id: string;
  input: AgentInput;        // the exact production input
  expected?: string;        // ground truth, when there is one
  assertions: Assertion[];  // hard checks that must pass
  rubric?: string;          // for the LLM judge, when output is open-ended
}

Two practices matter here. Pull from production, so your evals test what actually breaks, not what you guessed might. And cover the spread — happy path, edge cases, adversarial inputs, and the empty/malformed inputs that cause silent failures. A test set of 30-50 well-chosen cases catches far more than 500 lazy ones. I’d rather have 40 cases that each represent a real failure mode than a thousand that all test the same easy path.

Score with assertions first, an LLM judge second

Not every output needs a model to grade it. I reach for the cheapest scorer that works.

Hard assertions for anything structured. Does the output parse as valid JSON? Does it contain the required field? Is the extracted date in range? Did it call the right tool with the right arguments? These are deterministic, free, and unambiguous — write as many as you can.

typescript
const assertions: Assertion[] = [
  (out) => isValidJSON(out),
  (out) => parse(out).category in ALLOWED_CATEGORIES,
  (out) => parse(out).confidence >= 0 && parse(out).confidence <= 1,
];

An LLM judge for the open-ended rest — tone, helpfulness, “did this actually answer the question.” Here you give a model the input, the output, and a rubric, and ask it to score. Two rules keep the judge honest: make the rubric specific (a 1-5 scale with described anchors beats “rate the quality”), and use a strong model as the judge — judging is a reasoning task, so this is a place I happily pay for Sonnet even when the agent itself runs on Haiku per the cost math. A vague rubric or a weak judge gives you noise that looks like signal.

Run the harness before every change

The harness exists to answer one question: did this change make the agent better or worse? So I run it before every prompt edit, model swap, or tool change.

bash
# baseline on main
npm run eval -- --suite=booking-agent > baseline.json

# make the change, then re-run
npm run eval -- --suite=booking-agent > candidate.json

# compare
npm run eval:diff baseline.json candidate.json

The diff shows aggregate score, per-case pass/fail, and — crucially — which specific cases regressed. An aggregate that ticks up while three cases silently break is not an improvement; it’s a trade I want to see and approve, not one that sneaks through. Watching the per-case diff is how you avoid “fixed one thing, broke two others,” the failure mode that makes people afraid of their own prompts.

Set a regression gate and let it block

Once you trust the harness, wire it into the path to production as a gate. My rule is blunt: a change that drops the score below the baseline threshold doesn’t ship. Not “I’ll look into it later” — it’s blocked, same as a failing CI test.

typescript
const PASS_THRESHOLD = 0.90; // 90% of cases must pass
if (candidate.passRate < PASS_THRESHOLD || candidate.passRate < baseline.passRate) {
  throw new Error(`Eval regression: ${candidate.passRate} < ${baseline.passRate}`);
}

This is what converts evals from a nice-to-have into the thing that lets you move fast. The gate is what makes “ship without fear” literally true: the worst case for a bad change is a red eval run, not a production incident. And because the test set grows every time something breaks, the gate gets stricter and more protective over time on its own.

Account for non-determinism in scoring

A subtlety that trips people up: the same input can score differently across runs because the model samples differently. If you run each case once, you’ll see phantom regressions — a case “broke” that’s really just sampling noise.

Two mitigations. Run evals at temperature: 0 to shrink variance (it won’t fully eliminate it). And for cases you’ve seen flicker, run them N times and take the pass rate, not a single pass/fail. A case that passes 9/10 is in better shape than one that passes 5/10 even though both can show a green single run. This is the same volume-over-anecdote principle I use when debugging intermittent failures — one run is an opinion, fifty runs are data.

Close the loop with production monitoring

The eval harness tests against known cases. Production throws novel ones. So the loop is: monitor live behavior, catch a new failure mode, turn it into an eval case, fix it, and now it’s permanently guarded. The monitoring side — tracking success rate, output validity, and cost per run on live traffic — is what I cover in how I measure whether an AI agent is actually working. Evals and monitoring are two halves of the same system: monitoring finds the bugs, evals make sure they stay dead.

That feedback loop is the real product. Any single eval set goes stale; a process that converts every production failure into a permanent test gets stronger every week. That’s how an agent goes from “scary to touch” to something I’ll refactor on a Friday afternoon without flinching.

FAQ

What goes into an AI agent eval set?

Real production inputs turned into graded cases — happy path, edge cases, adversarial and malformed inputs — each with hard assertions and, for open-ended outputs, an LLM-judge rubric. 30-50 cases drawn from actual failures beat hundreds of synthetic ones that all test the easy path.

Should I use an LLM to grade agent outputs?

Use hard assertions wherever the output is structured (valid JSON, correct field, right tool call) — they’re free and deterministic. Reserve an LLM judge for open-ended qualities like tone and helpfulness, with a specific rubric and a strong judge model so you get signal, not noise.

How do I stop a prompt change from silently breaking production?

Run the eval harness before every change and diff against a baseline, watching per-case regressions, not just the aggregate score. Then gate deploys on the result so any change that drops below the baseline threshold is blocked like a failing test.

How do I handle non-determinism in evals?

Run at temperature 0 to reduce variance, and for cases that flicker, run them multiple times and score the pass rate instead of a single run. A case that passes 9 of 10 times is healthier than one that passes 5 of 10, even if a single run shows both green.

Keep reading

Get the AI playbook in your inbox

Every Wednesday. 28,400+ operators. Zero fluff.

↵ to see all results esc esc to close