AI Agent Cost Math: When Haiku Beats Sonnet (and When It Doesn't)
Picking Claude Haiku over Sonnet can cut per-call cost dramatically, but only when the task tolerates a lower success rate. The real metric isn't cost per call — it's cost per successful outcome, including retries and human cleanup. I route by task, not by default.
Every Wednesday. 28,400+ operators. Zero fluff.
✓ Check your inbox — click the confirmation link to complete sign-up.
✓ You're subscribed!
✓ You're already on the list.
Table of contents
Open Table of contents
The token economics, stated plainly
Anthropic prices Claude per million tokens, input and output billed separately, with output costing several times more than input. The exact numbers move over time, so check Anthropic’s current pricing — but the structure is what drives the decision:
- Haiku is the cheap, fast tier — by far the lowest per-token cost in the family.
- Sonnet sits in the middle — markedly more expensive than Haiku, markedly cheaper than Opus.
- Opus is the premium tier for the hardest reasoning.
Two things follow. First, output tokens dominate cost on generative tasks, so a model that’s verbose costs more even at the same per-token rate. Second, the per-token gap between Haiku and Sonnet is large enough that on a high-volume step it absolutely shows up on the bill. That’s the case for Haiku. Now the case against.
The metric that actually matters: cost per successful outcome
Per-call cost is a vanity number. Here’s the formula I actually use:
cost_per_success = (call_cost × attempts) + cleanup_cost
÷ success_rateWhere attempts accounts for retries, and cleanup_cost is the expected cost of a human fixing the failures that slip through. Watch what this does to the comparison.
Suppose Haiku costs roughly a tenth of Sonnet per call. If Haiku succeeds 80% of the time on a task and Sonnet succeeds 98%, the per-call savings look enormous. But if each Haiku failure triggers one retry and 1-in-10 still needs a human who costs real money, the cleanup term can swamp the token savings. On a low-stakes, high-volume task the math favors Haiku overwhelmingly. On a task where a failure emails the wrong customer, it can invert completely.
You can’t make this call without measuring success rate per model — which is exactly what an eval harness gives you. Run the same eval set against both models and read the success rates off the same yardstick.
Where Haiku wins decisively
Haiku is the right call when the task is narrow, structured, and verifiable:
- Classification and routing — “is this inbound a booking, a complaint, or spam?” Three buckets, easy to verify, runs constantly. Haiku all day.
- Extraction with a schema — pulling a date, a name, an amount out of text, validated with Zod. If the output parses, it’s almost certainly right.
- Short rewrites and formatting — tone tweaks, summarizing a known-good input, normalizing data.
- First-pass filtering — Haiku triages, and only the ambiguous cases get escalated to Sonnet. This is the highest-leverage pattern.
The common thread: the cost of a Haiku mistake is low and the mistake is cheap to catch. When verification is cheap and stakes are low, the cheap model wins.
Where Sonnet earns its price
Sonnet (and sometimes Opus) is worth it when the task is open-ended, multi-step, or expensive to get wrong:
- Multi-tool agent loops where one wrong tool call cascades. Higher reasoning reliability compounds across steps — the orchestration patterns I cover in multi-agent orchestration lean on the model not losing the plot.
- Customer-facing generation where a bad output costs trust, not just a retry.
- Anything where verification is itself hard. If you can’t cheaply tell whether the output is right, you can’t afford a model that’s frequently wrong.
A failure here doesn’t cost one retry — it costs a refund, a churned customer, or my time. Against that, the per-token premium is rounding error.
The routing rule I actually ship
I don’t pick one model per agent. I route per task inside the agent, usually with a cheap classifier deciding which downstream model handles the work:
function pickModel(task: Task): string {
// Cheap, verifiable, high-volume → Haiku
if (task.type === "classify" || task.type === "extract") {
return "claude-haiku";
}
// Open-ended or customer-facing → Sonnet
if (task.customerFacing || task.steps > 2) {
return "claude-sonnet";
}
return "claude-sonnet"; // default to the safe choice
}Two principles encoded here. Default to the safe model, not the cheap one — you optimize cost down from a working baseline, never reliability up from a broken one. And escalate, don’t gamble: let Haiku handle the easy 80% and hand the hard 20% to Sonnet. That hybrid almost always beats running everything on either model alone.
There’s also prompt caching to layer on top: if your system prompt is large and reused, caching cuts input cost substantially regardless of tier, which sometimes makes Sonnet cheap enough that the Haiku question is moot.
A worked example from my own stack
Take a high-volume inbound triage step. It runs thousands of times, the task is three-way classification, and a miss just means the item lands in a review queue — cheap to catch, low stakes. That’s a textbook Haiku task, and moving it off Sonnet meaningfully cut the cost of that step with no measurable hit to the outcome that mattered.
Now take the step that drafts the actual reply to a customer. Lower volume, open-ended, and a bad draft going out costs trust. That stays on Sonnet. Same agent, two models, routed by stakes. I watch the cost-per-run and success metrics for both, the way I describe in how I measure whether an AI agent is actually working — and I only push a step down a tier after the eval says the cheaper model holds the success rate.
FAQ
Is Claude Haiku always cheaper than Sonnet in practice?
Per token, yes — by a wide margin. Per successful outcome, not always. If Haiku’s lower success rate triggers retries and human cleanup, the total cost can exceed Sonnet’s on tasks where mistakes are expensive to catch or fix.
How do I decide between Haiku and Sonnet for a given task?
Score the task on two axes: how verifiable the output is and how costly a mistake is. Cheap-to-verify, low-stakes, high-volume work goes to Haiku; open-ended, customer-facing, or hard-to-verify work goes to Sonnet. Route per task, not per agent.
What’s the single cost metric I should track?
Cost per successful outcome — call cost times attempts plus expected cleanup cost, divided by success rate. Per-call price alone hides retries and human time, which is where cheap models quietly get expensive.
Can I use both models in one agent?
Yes, and you usually should. The strongest pattern is a cheap first pass (Haiku classifies or filters) that escalates only ambiguous cases to Sonnet. That hybrid typically beats running everything on a single tier.
Every Wednesday. 28,400+ operators. Zero fluff.
✓ Check your inbox — click the confirmation link to complete sign-up.
✓ You're subscribed!
✓ You're already on the list.
Get the AI playbook in your inbox
Every Wednesday. 28,400+ operators. Zero fluff.
Check your inbox.
We sent you a confirmation email — click the link inside to complete your subscription. Check spam if you don't see it within a minute.
You're subscribed.
Welcome — the next edition lands in your inbox soon.
You're already on the list — look for it every Wednesday.