Alejandro Rioja.
AI Agents

How I Measure Whether an AI Agent Is Actually Working

Alejandro Rioja
Alejandro Rioja
7 min read
TL;DR

Most operators skip evals entirely and just assume their agents work. My framework: build a golden set of 5–10 known-good inputs with expected outputs, define pass/fail criteria in plain English, and spot-check logs weekly. Don't build an elaborate eval system before you have 10 real runs — that's the trap that kills momentum.

Free newsletter

Every Wednesday. 28,400+ operators. Zero fluff.

Table of contents

Open Table of contents

The problem no one talks about: agents drift silently

When a human employee starts doing their job wrong, you usually notice. When an AI agent starts producing garbage, it keeps producing garbage — quietly, at scale, until something breaks badly enough that a human finally looks.

I’ve had a content agent that started appending “As an AI language model” disclaimers after a model update. I’ve had an event promoter agent that stopped including ticket links because a prompt variable name changed. Neither failed loudly. Both just degraded.

The fix isn’t building a NASA-grade monitoring system. It’s having a simple, repeatable check that catches drift before it compounds.

What an eval actually is (for operators)

Engineers use the word “eval” to mean running a benchmark on a model. For operators, I mean something simpler: a repeatable test that tells you whether your agent is still doing what you built it to do.

Three components:

  1. Golden set — 5–10 real inputs you’ve seen before, with expected outputs you already know are good
  2. Pass/fail criteria — plain-English rules for what counts as passing
  3. A scheduled check — you or your assistant actually runs the test on a cadence

That’s it. You don’t need a framework. You need discipline.

Building your golden set

Pull from your production logs. Find 5–10 real inputs where you already know what a good output looks like. These are your ground truth.

For my content pipeline agent, the golden set is 5 published posts that passed my voice checklist when I wrote them manually. For my Pickleland event promoter, it’s 5 past Facebook posts that got above-average engagement (comments + shares, not just likes).

Rules for a good golden set:

When the agent was last confirmed working, write down exactly what “good” looked like. That becomes your expected output.

Defining pass/fail criteria

Vague criteria are useless. “The output should be good” will pass every time because you’ll rationalize it.

Write your criteria as checklist items that a non-expert could evaluate. Here’s the actual criteria I use for my content pipeline agent:

Content agent pass/fail checklist:

For the Pickleland event promoter:

Event promoter pass/fail checklist:

If 4 of 5 checklist items pass, the run is a pass. If 3 or fewer pass, it’s a fail and I investigate before the next run.

Using Claude as a judge

For agents where outputs are long or complex, I use Claude Sonnet as an automated judge. This is faster than manual review and catches things I’d skim past.

Here’s the judge prompt I use for the content agent:

code
You are evaluating a blog post written by an AI agent. Your job is to check whether it meets the operator's standards.

Evaluate the following post against these criteria:
1. Starts with a direct answer or TL;DR in the first 100 words (YES/NO)
2. Contains at least one concrete number or specific example (YES/NO)
3. Free of AI-speak filler ("As an AI", "in today's fast-paced world", "delve", "it's worth noting") (YES/NO)
4. Word count is between 800 and 2000 words (YES/NO)
5. Tone matches the reference: direct, first-person, opinionated, no fluff (YES/NO)

For each criterion, respond YES or NO with one sentence of explanation.
At the end, output PASS if 4 or 5 criteria are YES, FAIL otherwise.

Post to evaluate:
---
{{post_content}}
---

I run this as a Cloudflare Worker that pulls the latest draft, fires this prompt, and writes the result to a Google Sheet. The whole thing takes 8 seconds and costs about $0.003 per run.

For the event promoter, the judge prompt is simpler:

code
You are checking an AI-generated Facebook event post for accuracy and quality.

Source data:
- Event name: {{event_name}}
- Date: {{event_date}}
- Time: {{event_time}}
- Ticket URL: {{ticket_url}}

Generated post:
---
{{generated_post}}
---

Check:
1. Does the post correctly state the event name? (YES/NO)
2. Does the post correctly state the date and time? (YES/NO)
3. Does the post include the exact ticket URL? (YES/NO)
4. Is the post under 280 words? (YES/NO)
5. Is the tone inviting without using generic filler phrases? (YES/NO)

Output PASS if all 5 are YES, FAIL if any are NO. List which items failed.

Where to look: Cloudflare Worker logs

If you’re running agents on Cloudflare Workers (which I do for most of my lightweight ones), the built-in log tail is your best friend. You don’t need a third-party logging service to start.

What I check in weekly spot-reviews:

I spend 15 minutes every Monday morning on this. I have a simple Notion checklist: open logs for each agent, note anything anomalous, compare token usage against last week’s baseline. That’s the entire process.

The spreadsheet eval: ugly but it works

Before I had any automation, I ran evals in a Google Sheet. I still use this for new agents in the first 4 weeks.

Structure:

Run dateInputExpected output (summary)Actual output (summary)Pass/FailNotes
2026-05-01”Write a post about AI agents”Direct, opinionated, 1000+ words, TL;DR present950 words, TL;DR present, strong voicePassSlightly short
2026-05-08SameSame400 words, generic, no TL;DRFailModel drift after update

Five rows a week. Takes 10 minutes. If you have two fails in a row, you stop the agent and fix the prompt before continuing.

This is embarrassingly low-tech. It’s also how I caught three prompt regressions before they went to production.

What NOT to do

Don’t build the eval system before you have 10 real runs. I’ve seen founders spend two weeks building a sophisticated eval pipeline for an agent they’ve only run twice. You don’t know enough about what “good” looks like until you have real production data.

Don’t eval on synthetic inputs you made up. Synthetic test cases miss the weird edge cases that production throws at you. Always start with real logs.

Don’t eval everything. Pick the 3–5 agents where failure would actually hurt — customer-facing outputs, anything that posts publicly, anything that triggers a payment. Skip the internal utility agents until you have headspace.

Don’t automate too early. A spreadsheet you actually use beats a Datadog dashboard you forget to check. Start manual, automate after you’ve run the check 10 times and know what you’re actually looking for.

The operator’s bottom line

Evals don’t have to be engineering-grade to be useful. A golden set of 5–10 real inputs, a checklist of pass/fail criteria, and 15 minutes of log-checking every Monday will catch 80% of agent drift before it compounds. Start there. If you’re still running agents without any eval, you’re flying blind — and eventually something will fail publicly enough that you’ll wish you’d spent the 20 minutes.


Related: The agent stack I use to run 30+ production agents · Event-triggered vs scheduled agents: which pattern for which job · The cheapest way to run a content agent on Cloudflare

Want help setting up evals for your agents? Get in touch — I run production agent audits for operator teams.

Keep reading

Get the AI playbook in your inbox

Every Wednesday. 28,400+ operators. Zero fluff.

↵ to see all results esc esc to close