Alejandro Rioja.
AI Agents Operations

How to Debug an AI Agent in Production (A Field Guide)

Alejandro Rioja
Alejandro Rioja
7 min read
TL;DR

Debugging a production AI agent is mostly about isolating which layer failed — prompt, tool, model, or orchestration. I log every step with a trace ID, replay the exact inputs, and bisect. In my agents, ~70% of 'AI bugs' turn out to be plumbing bugs, not model bugs.

Free newsletter

Every Wednesday. 28,400+ operators. Zero fluff.

Table of contents

Open Table of contents

Make every run traceable before you debug anything

You cannot debug what you cannot see. The single highest-leverage thing you can do — before any specific bug shows up — is attach a trace ID to every agent run and log every step it takes.

A “step” is anything that crosses a boundary: the inbound trigger, each model call (with the full messages array), each tool call (with arguments), each tool result, and the final output. Log them as structured JSON keyed by the trace ID.

typescript
function logStep(traceId: string, step: string, payload: unknown) {
  console.log(JSON.stringify({
    traceId,
    step,            // "trigger" | "model_call" | "tool_call" | "tool_result" | "output"
    ts: Date.now(),
    payload,
  }));
}

On Cloudflare Workers I ship these to a queue and into a table; locally they go to stdout. The rule is absolute: if a step isn’t logged, it didn’t happen as far as debugging is concerned. This mirrors the instrumentation I describe in the agent stack I use — the trace ID is the spine everything else hangs off.

Isolate the layer: prompt, tool, model, or orchestration

Once you have a trace, debugging becomes a bisection. There are four layers and the bug lives in exactly one of them most of the time.

1. The input layer (the most common culprit)

Pull the exact messages array that went into the failing model call. Not a reconstruction — the literal payload from the log. Then read it like a stranger would. Half my “the model ignored the instructions” bugs are actually:

If the input is wrong, the model did its job perfectly on garbage. Fix the plumbing.

2. The tool layer

If the input looks clean, check whether a tool returned an error the agent treated as success. A classic: an API returns 200 with a body of { "error": "rate limited" }, your tool wrapper doesn’t check the body, and the agent confidently acts on an error message. Log tool results raw and assert their shape.

3. The model layer

Only after ruling out 1 and 2 do I suspect the model. Even then, “model bug” usually means “my prompt is ambiguous.” Take the exact failing input, drop it into a one-off script against the same model and temperature, and see if it reproduces. If it does, the fix is prompt work or a tighter eval, not a frantic model swap.

4. The orchestration layer

If a single step is fine in isolation but the multi-step run fails, the bug is in the handoff — state lost between steps, a race condition, a retry that re-ran a non-idempotent action. These are the nastiest and I cover the patterns in multi-agent orchestration patterns.

Reproduce non-determinism instead of fighting it

The thing that makes agents feel un-debuggable is non-determinism: the same input produces different output across runs. You can tame it.

First, pin what you can. Set temperature: 0 while debugging. It won’t make Claude fully deterministic, but it sharply narrows the variance so you can tell a real bug from sampling noise.

Second, run it N times. If a failure reproduces 1 in 20 runs, loop the exact input 50 times and capture every output. Now you have a sample, not an anecdote. A bug that fires 5% of the time is a real bug — you just need volume to see it.

bash
for i in $(seq 1 50); do
  node replay.mjs --trace=abc123 >> runs.jsonl
done
# then count failures
grep -c '"status":"fail"' runs.jsonl

Third, diff the passing and failing runs. With temperature pinned and the same input, a difference in output means a difference in input you haven’t spotted yet — a timestamp in the prompt, a tool result that varies, a retrieved doc that changed.

Build a replay harness so you stop debugging in production

Debugging by re-triggering the live agent is slow and risky — it sends real emails, books real courts. Instead, capture the trace and replay it offline.

The replay harness loads a logged trace, reconstructs the exact inputs to any step, and re-runs just that step against the model. Because you logged the full messages array, you don’t need the upstream system at all. This turns a 10-minute production round-trip into a 2-second local loop, and it’s the single biggest speedup in my debugging workflow.

A good replay harness also lets you mutate and re-run: change one line of the system prompt, replay the same 50 failing traces, and see how many now pass. That’s the bridge from debugging to eval — once you have a corpus of failing traces, you have the start of a regression suite.

Watch the metrics that actually predict breakage

Some failures never throw an exception. The agent runs, returns something plausible, and quietly does the wrong thing. To catch those you watch behavioral metrics, not just error rates:

I track these the same way I track everything else — see how I measure whether an AI agent is actually working. The metric that catches a silent failure is worth ten that catch loud ones.

The 5-minute triage checklist

When an agent breaks and I’m on the clock, I run this in order:

  1. Get the trace ID for the failing run.
  2. Read the exact input to the failing step. Is it well-formed? (Solves ~50% of cases here.)
  3. Check the tool results in that trace for errors-disguised-as-success.
  4. Replay the step offline at temperature: 0. Does it reproduce?
  5. If it reproduces, it’s a prompt/model issue — fix and re-run the trace corpus. If it doesn’t, it’s non-determinism or a state/orchestration bug — loop it 50× to characterize.

Disciplined isolation beats clever prompting every time. The model is rarely the problem; the system around it usually is.

FAQ

How do I debug an AI agent that fails only sometimes?

Capture the exact input from a logged trace and replay it 50+ times at temperature 0. Intermittent failures are real bugs with low fire-rates — volume turns the anecdote into a reproducible sample you can diff and fix.

Is the bug usually in the model or in my code?

In my production agents, roughly 70% of apparent “AI bugs” are plumbing: malformed tool results, truncated inputs, swallowed exceptions, or lost state between steps. Rule out the input and tool layers before you suspect the model.

What’s the minimum logging I need to debug agents?

A trace ID on every run, plus structured logs of the trigger, every model call (full messages array), every tool call and its raw result, and the final output. If a step isn’t logged, you can’t debug it.

How do I stop debugging against live production?

Build a replay harness that loads a logged trace and re-runs any single step offline using the captured inputs. It turns a slow, risky production round-trip into a fast local loop and becomes the seed of your regression suite.

Keep reading

Get the AI playbook in your inbox

Every Wednesday. 28,400+ operators. Zero fluff.

↵ to see all results esc esc to close