Why Your AI Agent Keeps Failing in Production (And How to Fix It)
Most production agent failures come from five causes: brittle prompts that don't handle edge cases, missing retry logic for transient API errors, no observability so you can't see what's breaking, runaway loops with no exit condition, and tool definitions that are ambiguous enough that the model picks the wrong one. All five are fixable without changing models or frameworks.
Every Wednesday. 28,400+ operators. Zero fluff.
✓ Check your inbox — click the confirmation link to complete sign-up.
✓ You're subscribed!
✓ You're already on the list.
Table of contents
Open Table of contents
- Failure 1: Brittle prompts that break on edge-case inputs
- Failure 2: No retry logic for transient API errors
- Failure 3: No observability — you can’t see what’s breaking
- Failure 4: Runaway loops with no exit condition
- Failure 5: Ambiguous tool definitions the model resolves wrong
- One more thing: test your agents on bad inputs
- The operator’s bottom line
Failure 1: Brittle prompts that break on edge-case inputs
A prompt that works on your test cases will fail on inputs you didn’t anticipate. That’s not a model limitation — it’s an instruction-writing problem.
Symptoms: The agent produces nonsense output, calls the wrong tool, or outputs malformed JSON when the input is slightly different from what you tested.
Root cause: Your system prompt describes the happy path only. It doesn’t tell the model what to do when data is missing, malformed, or ambiguous.
Fix: Add explicit edge-case handling to your system prompt:
If the input data is missing a required field, return:
{ "status": "error", "reason": "missing_field", "field": "<fieldname>" }
Do NOT attempt to infer or hallucinate missing values.
If you are uncertain which tool to call, call no tool and return:
{ "status": "clarification_needed", "question": "..." }The model follows explicit instructions for edge cases reliably. The mistake is assuming it will generalize the happy-path instructions to handle the messy cases.
Failure 2: No retry logic for transient API errors
Every external API your agent calls will fail at some point. Claude’s API, the Meta Graph API, your database — all of them return 5xx errors, timeout, or rate-limit. If your agent has no retry logic, one transient error kills the whole run.
Symptoms: Agent runs fail randomly at different steps. The logs show a 503 or 429 with no follow-up attempt.
Fix: Wrap every external call in an exponential-backoff retry:
async function withRetry<T>(fn: () => Promise<T>, retries = 3, baseDelayMs = 500): Promise<T> {
for (let attempt = 0; attempt <= retries; attempt++) {
try {
return await fn();
} catch (err: any) {
const isTransient = err.status === 429 || err.status >= 500 || err.code === "ECONNRESET";
if (!isTransient || attempt === retries) throw err;
const delay = baseDelayMs * Math.pow(2, attempt) + Math.random() * 100;
await new Promise((r) => setTimeout(r, delay));
}
}
throw new Error("unreachable");
}
// Usage
const result = await withRetry(() => client.messages.create({ ... }));Three retries with exponential backoff handles ~99% of transient failures. Add this to every external call and half your random failures disappear.
Failure 3: No observability — you can’t see what’s breaking
This is the most common failure mode in production and the one that costs the most time to debug: the agent fails silently or produces wrong output, and you have no idea where in the chain it went wrong.
Symptoms: You know something is wrong but can’t identify the step. You add console.log statements and re-run manually trying to reproduce.
Fix: Structured logging on every step, with a run ID that traces the entire execution:
function createLogger(runId: string, agentName: string) {
return {
step: (step: string, data: object) =>
console.log(JSON.stringify({ runId, agent: agentName, step, ts: new Date().toISOString(), ...data })),
error: (step: string, err: unknown) =>
console.error(JSON.stringify({ runId, agent: agentName, step, error: String(err), ts: new Date().toISOString() })),
};
}
const log = createLogger(crypto.randomUUID(), "newsletter-agent");
log.step("fetch_topic", { topicId: topic.id, topic: topic.name });
// ... do work ...
log.step("draft_complete", { subject: draft.subject, wordCount: draft.body.split(" ").length });If you’re on Cloudflare Workers, these logs go to Logpush or Workers Tail. If you’re running locally or on a VPS, pipe them to a log aggregator. The structured JSON means you can filter by runId to see exactly what happened in a single run.
Failure 4: Runaway loops with no exit condition
Agentic loops — where the model calls tools and iterates until a condition is met — can run forever if that condition is never met or the model misidentifies it.
Symptoms: Agent spends hundreds of dollars in API costs before timing out. Or it runs the same tool call over and over without making progress.
Fix: Always have a hard iteration cap and a progress check:
const MAX_ITERATIONS = 10;
let iterations = 0;
let lastToolCallName = "";
let sameToolCallCount = 0;
while (true) {
iterations++;
if (iterations > MAX_ITERATIONS) {
log.error("loop", { reason: "exceeded_max_iterations" });
break;
}
const response = await client.messages.create({ ... });
// Detect stuck loops: same tool called 3x in a row
const toolCall = response.content.find(b => b.type === "tool_use");
if (toolCall?.name === lastToolCallName) {
sameToolCallCount++;
if (sameToolCallCount >= 3) {
log.error("loop", { reason: "stuck_loop", tool: toolCall.name });
break;
}
} else {
sameToolCallCount = 0;
lastToolCallName = toolCall?.name ?? "";
}
if (response.stop_reason === "end_turn") break;
}This catches both “ran too long” and “spun in place” failure modes. The cap should be generous enough for the happy path but tight enough to limit blast radius.
Failure 5: Ambiguous tool definitions the model resolves wrong
If you give the model two tools with overlapping descriptions, it will sometimes call the wrong one. This is especially common with tools like search_database vs get_record or send_email vs create_draft.
Symptoms: The model calls the right category of tool but picks the wrong specific one. Or it calls a tool in the wrong context (using a write tool when only reading was appropriate).
Fix: Make tool descriptions mutually exclusive and add explicit “when NOT to use this”:
const tools = [
{
name: "get_subscriber",
description: "Fetch a single subscriber record by email. Use ONLY when you have a specific email address. Do NOT use for searching or listing subscribers.",
input_schema: { ... }
},
{
name: "search_subscribers",
description: "Search subscribers by tag, segment, or status. Use when you need to find subscribers matching a criteria — NOT when you have a specific email address.",
input_schema: { ... }
}
];The “do NOT use when X” clause is the part most people skip. It’s the most important part. Models are better at following explicit negative constraints than inferring them from positive descriptions.
One more thing: test your agents on bad inputs
Most agents are tested only on clean, happy-path inputs. Production has dirty inputs: empty strings, null fields, Unicode edge cases, API responses that return 200 but with an unexpected schema.
Add a test suite that explicitly exercises:
- Empty or null inputs
- Inputs at the maximum length you’d expect
- Inputs with special characters or non-ASCII text
- External APIs returning unexpected response shapes
If your agent breaks on any of these, fix it before it goes live. The production environment will find every assumption you made.
The operator’s bottom line
Most agent failures in production are infrastructure problems masquerading as model problems. Before you switch models, add retries, structured logging, loop caps, and explicit edge-case handling to your prompts. Fix the ambiguous tool definitions. Then test on bad inputs. Do all of that before blaming the model — in my experience, the model is usually the last thing that needs to change.
Every Wednesday. 28,400+ operators. Zero fluff.
✓ Check your inbox — click the confirmation link to complete sign-up.
✓ You're subscribed!
✓ You're already on the list.
Get the AI playbook in your inbox
Every Wednesday. 28,400+ operators. Zero fluff.
Check your inbox.
We sent you a confirmation email — click the link inside to complete your subscription. Check spam if you don't see it within a minute.
You're subscribed.
Welcome — the next edition lands in your inbox soon.
You're already on the list — look for it every Wednesday.