How I Measure Whether an AI Agent Is Actually Working

Alejandro Rioja

June 17, 2026 7 min read

TL;DR

Most operators skip evals entirely and just assume their agents work. My framework: build a golden set of 5–10 known-good inputs with expected outputs, define pass/fail criteria in plain English, and spot-check logs weekly. Don't build an elaborate eval system before you have 10 real runs — that's the trap that kills momentum.

Free newsletter

Every Wednesday. 28,400+ operators. Zero fluff.

Open Table of contents

The problem no one talks about: agents drift silently
What an eval actually is (for operators)
Building your golden set
Defining pass/fail criteria
Using Claude as a judge
Where to look: Cloudflare Worker logs
The spreadsheet eval: ugly but it works
What NOT to do
The operator’s bottom line

The problem no one talks about: agents drift silently

When a human employee starts doing their job wrong, you usually notice. When an AI agent starts producing garbage, it keeps producing garbage — quietly, at scale, until something breaks badly enough that a human finally looks.

I’ve had a content agent that started appending “As an AI language model” disclaimers after a model update. I’ve had an event promoter agent that stopped including ticket links because a prompt variable name changed. Neither failed loudly. Both just degraded.

The fix isn’t building a NASA-grade monitoring system. It’s having a simple, repeatable check that catches drift before it compounds.

What an eval actually is (for operators)

Engineers use the word “eval” to mean running a benchmark on a model. For operators, I mean something simpler: a repeatable test that tells you whether your agent is still doing what you built it to do.

Three components:

Golden set — 5–10 real inputs you’ve seen before, with expected outputs you already know are good
Pass/fail criteria — plain-English rules for what counts as passing
A scheduled check — you or your assistant actually runs the test on a cadence

That’s it. You don’t need a framework. You need discipline.

Building your golden set

Pull from your production logs. Find 5–10 real inputs where you already know what a good output looks like. These are your ground truth.

For my content pipeline agent, the golden set is 5 published posts that passed my voice checklist when I wrote them manually. For my Pickleland event promoter, it’s 5 past Facebook posts that got above-average engagement (comments + shares, not just likes).

Rules for a good golden set:

Real inputs, not hypotheticals you made up
Include at least one edge case (a tricky input, a short input, an input with unusual formatting)
Keep expected outputs documented — a screenshot, a text file, a row in a spreadsheet
Never delete from the golden set; only add

When the agent was last confirmed working, write down exactly what “good” looked like. That becomes your expected output.

Defining pass/fail criteria

Vague criteria are useless. “The output should be good” will pass every time because you’ll rationalize it.

Write your criteria as checklist items that a non-expert could evaluate. Here’s the actual criteria I use for my content pipeline agent:

Content agent pass/fail checklist:

Post has a TL;DR in the first 100 words
No phrases like “in today’s fast-paced world” or “As an AI”
At least one concrete number or statistic
Word count is between 800 and 2000
All internal links resolve (no 404s)

For the Pickleland event promoter:

Event promoter pass/fail checklist:

Event name matches the source calendar
Date and time are correct
Ticket link is present and not broken
Copy is under 280 words
Post doesn’t use generic filler phrases (“Come join us for a fun time!”)

If 4 of 5 checklist items pass, the run is a pass. If 3 or fewer pass, it’s a fail and I investigate before the next run.

Using Claude as a judge

For agents where outputs are long or complex, I use Claude Sonnet as an automated judge. This is faster than manual review and catches things I’d skim past.

Here’s the judge prompt I use for the content agent:

code

You are evaluating a blog post written by an AI agent. Your job is to check whether it meets the operator's standards.

Evaluate the following post against these criteria:
1. Starts with a direct answer or TL;DR in the first 100 words (YES/NO)
2. Contains at least one concrete number or specific example (YES/NO)
3. Free of AI-speak filler ("As an AI", "in today's fast-paced world", "delve", "it's worth noting") (YES/NO)
4. Word count is between 800 and 2000 words (YES/NO)
5. Tone matches the reference: direct, first-person, opinionated, no fluff (YES/NO)

For each criterion, respond YES or NO with one sentence of explanation.
At the end, output PASS if 4 or 5 criteria are YES, FAIL otherwise.

Post to evaluate:
---
{{post_content}}
---

I run this as a Cloudflare Worker that pulls the latest draft, fires this prompt, and writes the result to a Google Sheet. The whole thing takes 8 seconds and costs about $0.003 per run.

For the event promoter, the judge prompt is simpler:

code

You are checking an AI-generated Facebook event post for accuracy and quality.

Source data:
- Event name: {{event_name}}
- Date: {{event_date}}
- Time: {{event_time}}
- Ticket URL: {{ticket_url}}

Generated post:
---
{{generated_post}}
---

Check:
1. Does the post correctly state the event name? (YES/NO)
2. Does the post correctly state the date and time? (YES/NO)
3. Does the post include the exact ticket URL? (YES/NO)
4. Is the post under 280 words? (YES/NO)
5. Is the tone inviting without using generic filler phrases? (YES/NO)

Output PASS if all 5 are YES, FAIL if any are NO. List which items failed.

Where to look: Cloudflare Worker logs

If you’re running agents on Cloudflare Workers (which I do for most of my lightweight ones), the built-in log tail is your best friend. You don’t need a third-party logging service to start.

What I check in weekly spot-reviews:

Errors and exceptions — anything that crashed or timed out
Token counts — if a run suddenly uses 3x the normal tokens, something changed
Latency spikes — a sudden slowdown usually means the prompt got longer or the model is struggling
Output length drift — if average output went from 600 words to 200 words, the agent changed behavior

I spend 15 minutes every Monday morning on this. I have a simple Notion checklist: open logs for each agent, note anything anomalous, compare token usage against last week’s baseline. That’s the entire process.

The spreadsheet eval: ugly but it works

Before I had any automation, I ran evals in a Google Sheet. I still use this for new agents in the first 4 weeks.

Structure:

Run date	Input	Expected output (summary)	Actual output (summary)	Pass/Fail	Notes
2026-05-01	”Write a post about AI agents”	Direct, opinionated, 1000+ words, TL;DR present	950 words, TL;DR present, strong voice	Pass	Slightly short
2026-05-08	Same	Same	400 words, generic, no TL;DR	Fail	Model drift after update

Five rows a week. Takes 10 minutes. If you have two fails in a row, you stop the agent and fix the prompt before continuing.

This is embarrassingly low-tech. It’s also how I caught three prompt regressions before they went to production.

What NOT to do

Don’t build the eval system before you have 10 real runs. I’ve seen founders spend two weeks building a sophisticated eval pipeline for an agent they’ve only run twice. You don’t know enough about what “good” looks like until you have real production data.

Don’t eval on synthetic inputs you made up. Synthetic test cases miss the weird edge cases that production throws at you. Always start with real logs.

Don’t eval everything. Pick the 3–5 agents where failure would actually hurt — customer-facing outputs, anything that posts publicly, anything that triggers a payment. Skip the internal utility agents until you have headspace.

Don’t automate too early. A spreadsheet you actually use beats a Datadog dashboard you forget to check. Start manual, automate after you’ve run the check 10 times and know what you’re actually looking for.

The operator’s bottom line

Evals don’t have to be engineering-grade to be useful. A golden set of 5–10 real inputs, a checklist of pass/fail criteria, and 15 minutes of log-checking every Monday will catch 80% of agent drift before it compounds. Start there. If you’re still running agents without any eval, you’re flying blind — and eventually something will fail publicly enough that you’ll wish you’d spent the 20 minutes.

Want help setting up evals for your agents? Get in touch — I run production agent audits for operator teams.

Keep reading

AI Agents

How I Built Courtlines: A Club-Management SaaS, Engineered With Claude

The story behind Courtlines, the operating system for racket-sport clubs and studios — why I built it, what it does, and how using Claude as my primary engineering partner let one operator ship a full multi-tenant SaaS.

AI Agents

How I Built Quads, a Mobile Board Game, With Claude — From a 2-Hour Hackathon to the App Store

Quads started as a 2-hour hackathon idea on a trip to Colombia and became a real mobile board game on iOS and Android. Here's exactly how I built it with Claude — parallel agent worktrees, the game AI, offline-first tricks, and the gotchas nobody warns you about.

AI Agents

How to Write AI Agent System Prompts That Don't Fail in Production

Updated for 2026. A practitioner's guide to writing AI agent system prompts that hold up in production — five layers, real examples from 30+ agents, and the maintenance habits that prevent silent drift.

Keep reading

Get the AI playbook in your inbox

Every Wednesday. 28,400+ operators. Zero fluff.

How I Measure Whether an AI Agent Is Actually Working

Table of contents

The problem no one talks about: agents drift silently

What an eval actually is (for operators)

Building your golden set

Defining pass/fail criteria

Using Claude as a judge

Where to look: Cloudflare Worker logs

The spreadsheet eval: ugly but it works

What NOT to do

The operator’s bottom line

Related posts

How I Built Courtlines: A Club-Management SaaS, Engineered With Claude

How I Built Quads, a Mobile Board Game, With Claude — From a 2-Hour Hackathon to the App Store

How to Write AI Agent System Prompts That Don't Fail in Production

Get the AI playbook in your inbox