Alejandro Rioja.
AI Agents Operations

AI Agent Cost Math: When Haiku Beats Sonnet (and When It Doesn't)

Alejandro Rioja
Alejandro Rioja
6 min read
TL;DR

Picking Claude Haiku over Sonnet can cut per-call cost dramatically, but only when the task tolerates a lower success rate. The real metric isn't cost per call — it's cost per successful outcome, including retries and human cleanup. I route by task, not by default.

Free newsletter

Every Wednesday. 28,400+ operators. Zero fluff.

Table of contents

Open Table of contents

The token economics, stated plainly

Anthropic prices Claude per million tokens, input and output billed separately, with output costing several times more than input. The exact numbers move over time, so check Anthropic’s current pricing — but the structure is what drives the decision:

Two things follow. First, output tokens dominate cost on generative tasks, so a model that’s verbose costs more even at the same per-token rate. Second, the per-token gap between Haiku and Sonnet is large enough that on a high-volume step it absolutely shows up on the bill. That’s the case for Haiku. Now the case against.

The metric that actually matters: cost per successful outcome

Per-call cost is a vanity number. Here’s the formula I actually use:

code
cost_per_success = (call_cost × attempts) + cleanup_cost
                   ÷ success_rate

Where attempts accounts for retries, and cleanup_cost is the expected cost of a human fixing the failures that slip through. Watch what this does to the comparison.

Suppose Haiku costs roughly a tenth of Sonnet per call. If Haiku succeeds 80% of the time on a task and Sonnet succeeds 98%, the per-call savings look enormous. But if each Haiku failure triggers one retry and 1-in-10 still needs a human who costs real money, the cleanup term can swamp the token savings. On a low-stakes, high-volume task the math favors Haiku overwhelmingly. On a task where a failure emails the wrong customer, it can invert completely.

You can’t make this call without measuring success rate per model — which is exactly what an eval harness gives you. Run the same eval set against both models and read the success rates off the same yardstick.

Where Haiku wins decisively

Haiku is the right call when the task is narrow, structured, and verifiable:

The common thread: the cost of a Haiku mistake is low and the mistake is cheap to catch. When verification is cheap and stakes are low, the cheap model wins.

Where Sonnet earns its price

Sonnet (and sometimes Opus) is worth it when the task is open-ended, multi-step, or expensive to get wrong:

A failure here doesn’t cost one retry — it costs a refund, a churned customer, or my time. Against that, the per-token premium is rounding error.

The routing rule I actually ship

I don’t pick one model per agent. I route per task inside the agent, usually with a cheap classifier deciding which downstream model handles the work:

typescript
function pickModel(task: Task): string {
  // Cheap, verifiable, high-volume → Haiku
  if (task.type === "classify" || task.type === "extract") {
    return "claude-haiku";
  }
  // Open-ended or customer-facing → Sonnet
  if (task.customerFacing || task.steps > 2) {
    return "claude-sonnet";
  }
  return "claude-sonnet"; // default to the safe choice
}

Two principles encoded here. Default to the safe model, not the cheap one — you optimize cost down from a working baseline, never reliability up from a broken one. And escalate, don’t gamble: let Haiku handle the easy 80% and hand the hard 20% to Sonnet. That hybrid almost always beats running everything on either model alone.

There’s also prompt caching to layer on top: if your system prompt is large and reused, caching cuts input cost substantially regardless of tier, which sometimes makes Sonnet cheap enough that the Haiku question is moot.

A worked example from my own stack

Take a high-volume inbound triage step. It runs thousands of times, the task is three-way classification, and a miss just means the item lands in a review queue — cheap to catch, low stakes. That’s a textbook Haiku task, and moving it off Sonnet meaningfully cut the cost of that step with no measurable hit to the outcome that mattered.

Now take the step that drafts the actual reply to a customer. Lower volume, open-ended, and a bad draft going out costs trust. That stays on Sonnet. Same agent, two models, routed by stakes. I watch the cost-per-run and success metrics for both, the way I describe in how I measure whether an AI agent is actually working — and I only push a step down a tier after the eval says the cheaper model holds the success rate.

FAQ

Is Claude Haiku always cheaper than Sonnet in practice?

Per token, yes — by a wide margin. Per successful outcome, not always. If Haiku’s lower success rate triggers retries and human cleanup, the total cost can exceed Sonnet’s on tasks where mistakes are expensive to catch or fix.

How do I decide between Haiku and Sonnet for a given task?

Score the task on two axes: how verifiable the output is and how costly a mistake is. Cheap-to-verify, low-stakes, high-volume work goes to Haiku; open-ended, customer-facing, or hard-to-verify work goes to Sonnet. Route per task, not per agent.

What’s the single cost metric I should track?

Cost per successful outcome — call cost times attempts plus expected cleanup cost, divided by success rate. Per-call price alone hides retries and human time, which is where cheap models quietly get expensive.

Can I use both models in one agent?

Yes, and you usually should. The strongest pattern is a cheap first pass (Haiku classifies or filters) that escalates only ambiguous cases to Sonnet. That hybrid typically beats running everything on a single tier.

Keep reading

Get the AI playbook in your inbox

Every Wednesday. 28,400+ operators. Zero fluff.

↵ to see all results esc esc to close