How I Measure Whether an AI Agent Is Actually Working
Most operators skip evals entirely and just assume their agents work. My framework: build a golden set of 5–10 known-good inputs with expected outputs, define pass/fail criteria in plain English, and spot-check logs weekly. Don't build an elaborate eval system before you have 10 real runs — that's the trap that kills momentum.
Every Wednesday. 28,400+ operators. Zero fluff.
✓ Check your inbox — click the confirmation link to complete sign-up.
✓ You're subscribed!
✓ You're already on the list.
Table of contents
Open Table of contents
The problem no one talks about: agents drift silently
When a human employee starts doing their job wrong, you usually notice. When an AI agent starts producing garbage, it keeps producing garbage — quietly, at scale, until something breaks badly enough that a human finally looks.
I’ve had a content agent that started appending “As an AI language model” disclaimers after a model update. I’ve had an event promoter agent that stopped including ticket links because a prompt variable name changed. Neither failed loudly. Both just degraded.
The fix isn’t building a NASA-grade monitoring system. It’s having a simple, repeatable check that catches drift before it compounds.
What an eval actually is (for operators)
Engineers use the word “eval” to mean running a benchmark on a model. For operators, I mean something simpler: a repeatable test that tells you whether your agent is still doing what you built it to do.
Three components:
- Golden set — 5–10 real inputs you’ve seen before, with expected outputs you already know are good
- Pass/fail criteria — plain-English rules for what counts as passing
- A scheduled check — you or your assistant actually runs the test on a cadence
That’s it. You don’t need a framework. You need discipline.
Building your golden set
Pull from your production logs. Find 5–10 real inputs where you already know what a good output looks like. These are your ground truth.
For my content pipeline agent, the golden set is 5 published posts that passed my voice checklist when I wrote them manually. For my Pickleland event promoter, it’s 5 past Facebook posts that got above-average engagement (comments + shares, not just likes).
Rules for a good golden set:
- Real inputs, not hypotheticals you made up
- Include at least one edge case (a tricky input, a short input, an input with unusual formatting)
- Keep expected outputs documented — a screenshot, a text file, a row in a spreadsheet
- Never delete from the golden set; only add
When the agent was last confirmed working, write down exactly what “good” looked like. That becomes your expected output.
Defining pass/fail criteria
Vague criteria are useless. “The output should be good” will pass every time because you’ll rationalize it.
Write your criteria as checklist items that a non-expert could evaluate. Here’s the actual criteria I use for my content pipeline agent:
Content agent pass/fail checklist:
- Post has a TL;DR in the first 100 words
- No phrases like “in today’s fast-paced world” or “As an AI”
- At least one concrete number or statistic
- Word count is between 800 and 2000
- All internal links resolve (no 404s)
For the Pickleland event promoter:
Event promoter pass/fail checklist:
- Event name matches the source calendar
- Date and time are correct
- Ticket link is present and not broken
- Copy is under 280 words
- Post doesn’t use generic filler phrases (“Come join us for a fun time!”)
If 4 of 5 checklist items pass, the run is a pass. If 3 or fewer pass, it’s a fail and I investigate before the next run.
Using Claude as a judge
For agents where outputs are long or complex, I use Claude Sonnet as an automated judge. This is faster than manual review and catches things I’d skim past.
Here’s the judge prompt I use for the content agent:
You are evaluating a blog post written by an AI agent. Your job is to check whether it meets the operator's standards.
Evaluate the following post against these criteria:
1. Starts with a direct answer or TL;DR in the first 100 words (YES/NO)
2. Contains at least one concrete number or specific example (YES/NO)
3. Free of AI-speak filler ("As an AI", "in today's fast-paced world", "delve", "it's worth noting") (YES/NO)
4. Word count is between 800 and 2000 words (YES/NO)
5. Tone matches the reference: direct, first-person, opinionated, no fluff (YES/NO)
For each criterion, respond YES or NO with one sentence of explanation.
At the end, output PASS if 4 or 5 criteria are YES, FAIL otherwise.
Post to evaluate:
---
{{post_content}}
---I run this as a Cloudflare Worker that pulls the latest draft, fires this prompt, and writes the result to a Google Sheet. The whole thing takes 8 seconds and costs about $0.003 per run.
For the event promoter, the judge prompt is simpler:
You are checking an AI-generated Facebook event post for accuracy and quality.
Source data:
- Event name: {{event_name}}
- Date: {{event_date}}
- Time: {{event_time}}
- Ticket URL: {{ticket_url}}
Generated post:
---
{{generated_post}}
---
Check:
1. Does the post correctly state the event name? (YES/NO)
2. Does the post correctly state the date and time? (YES/NO)
3. Does the post include the exact ticket URL? (YES/NO)
4. Is the post under 280 words? (YES/NO)
5. Is the tone inviting without using generic filler phrases? (YES/NO)
Output PASS if all 5 are YES, FAIL if any are NO. List which items failed.Where to look: Cloudflare Worker logs
If you’re running agents on Cloudflare Workers (which I do for most of my lightweight ones), the built-in log tail is your best friend. You don’t need a third-party logging service to start.
What I check in weekly spot-reviews:
- Errors and exceptions — anything that crashed or timed out
- Token counts — if a run suddenly uses 3x the normal tokens, something changed
- Latency spikes — a sudden slowdown usually means the prompt got longer or the model is struggling
- Output length drift — if average output went from 600 words to 200 words, the agent changed behavior
I spend 15 minutes every Monday morning on this. I have a simple Notion checklist: open logs for each agent, note anything anomalous, compare token usage against last week’s baseline. That’s the entire process.
The spreadsheet eval: ugly but it works
Before I had any automation, I ran evals in a Google Sheet. I still use this for new agents in the first 4 weeks.
Structure:
| Run date | Input | Expected output (summary) | Actual output (summary) | Pass/Fail | Notes |
|---|---|---|---|---|---|
| 2026-05-01 | ”Write a post about AI agents” | Direct, opinionated, 1000+ words, TL;DR present | 950 words, TL;DR present, strong voice | Pass | Slightly short |
| 2026-05-08 | Same | Same | 400 words, generic, no TL;DR | Fail | Model drift after update |
Five rows a week. Takes 10 minutes. If you have two fails in a row, you stop the agent and fix the prompt before continuing.
This is embarrassingly low-tech. It’s also how I caught three prompt regressions before they went to production.
What NOT to do
Don’t build the eval system before you have 10 real runs. I’ve seen founders spend two weeks building a sophisticated eval pipeline for an agent they’ve only run twice. You don’t know enough about what “good” looks like until you have real production data.
Don’t eval on synthetic inputs you made up. Synthetic test cases miss the weird edge cases that production throws at you. Always start with real logs.
Don’t eval everything. Pick the 3–5 agents where failure would actually hurt — customer-facing outputs, anything that posts publicly, anything that triggers a payment. Skip the internal utility agents until you have headspace.
Don’t automate too early. A spreadsheet you actually use beats a Datadog dashboard you forget to check. Start manual, automate after you’ve run the check 10 times and know what you’re actually looking for.
The operator’s bottom line
Evals don’t have to be engineering-grade to be useful. A golden set of 5–10 real inputs, a checklist of pass/fail criteria, and 15 minutes of log-checking every Monday will catch 80% of agent drift before it compounds. Start there. If you’re still running agents without any eval, you’re flying blind — and eventually something will fail publicly enough that you’ll wish you’d spent the 20 minutes.
Related: The agent stack I use to run 30+ production agents · Event-triggered vs scheduled agents: which pattern for which job · The cheapest way to run a content agent on Cloudflare
Want help setting up evals for your agents? Get in touch — I run production agent audits for operator teams.
Every Wednesday. 28,400+ operators. Zero fluff.
✓ Check your inbox — click the confirmation link to complete sign-up.
✓ You're subscribed!
✓ You're already on the list.
Get the AI playbook in your inbox
Every Wednesday. 28,400+ operators. Zero fluff.
Check your inbox.
We sent you a confirmation email — click the link inside to complete your subscription. Check spam if you don't see it within a minute.
You're subscribed.
Welcome — the next edition lands in your inbox soon.
You're already on the list — look for it every Wednesday.