Alejandro Rioja.
AI Agents Operations

Prompt Caching with the Claude API: Cut Your Input Costs Without Switching Models

Alejandro Rioja
Alejandro Rioja
8 min read
TL;DR

Prompt caching cuts the cost of large, stable inputs — your system prompt, tool definitions, few-shot examples — to roughly 10% of normal input pricing on repeat requests. The mechanism is a prefix match: put a cache_control marker at the end of your stable content and keep everything volatile after it. The mistake that kills cache hit rates is letting a timestamp or UUID float into the prefix.

Free newsletter

Every Wednesday. 28,400+ operators. Zero fluff.

Table of contents

Open Table of contents

What prompt caching actually does

Every call to the Claude API sends tokens. Without caching, every token in your request — system prompt, tool definitions, few-shot examples, and the user message — gets priced at the normal input rate. With caching, a prefix of those tokens gets stored on Anthropic’s servers after the first request. On subsequent requests that share that exact prefix, you pay a cache read price instead of re-processing them from scratch.

The cost difference is real:

Once you’re past break-even — which happens fast on any agent running more than a few times a day — every additional cache hit is a ~90% discount on those tokens.

The prefix-match invariant

This is the one rule everything else follows: the cache key is a prefix match of your rendered prompt.

Anthropic’s servers store the rendered content from the start of your prompt up to the cache_control marker. For a cache hit to occur on the next request, every token from the start of the prompt up to that marker must be identical — byte for byte.

The render order for prefix matching is: tools → system → messages. So your tools array is hashed first, then the system block, then messages in order.

What this means in practice: stable content must come first. If your system prompt references anything dynamic — a current date, a user ID, a request trace ID — and it appears before the cache_control marker, the cache will miss on every request because the prefix keeps changing.

What to put a cache marker on

The highest-leverage targets are:

1. Your system prompt

System prompts are usually the largest stable block. A detailed agent persona, a list of behavioral rules, a set of output format instructions — all of this is identical across every invocation of the same agent. Mark it:

typescript
import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

const response = await client.messages.create({
  model: "claude-opus-4-8",
  max_tokens: 1024,
  system: [
    {
      type: "text",
      text: `You are a content operations agent for alejandrorioja.com.
Your job is to draft blog posts in Alejandro's voice: direct, practitioner, 
first-person, numbered lists, honest caveats. No hedging. No filler. 
Every section must earn its place.

[... 2000 more tokens of stable instructions ...]`,
      cache_control: { type: "ephemeral" },
    },
  ],
  messages: [
    {
      role: "user",
      content: "Draft a post about prompt caching.",
    },
  ],
});

The cache_control: { type: "ephemeral" } on the system block tells Claude to cache everything up to and including that block. The messages array is volatile — different each request — and stays outside the cache boundary.

2. Tool definitions

If your agent uses tools, those definitions can be substantial. A well-documented tool schema with description, parameter names, and enum values can run 500–1,000 tokens per tool. With 5 tools, that’s up to 5,000 tokens you’re paying to re-process on every call:

typescript
const response = await client.messages.create({
  model: "claude-opus-4-8",
  max_tokens: 1024,
  tools: [
    {
      name: "search_airtable",
      description: "Search the Airtable content queue...",
      input_schema: { type: "object", properties: { query: { type: "string" } } },
    },
    // ... more tools ...
    {
      name: "post_to_kit",
      description: "Schedule a broadcast via the Kit API...",
      input_schema: { /* ... */ },
      // Mark the last tool to cache the entire tools array
    } as Anthropic.Tool & { cache_control: { type: "ephemeral" } },
  ],
  system: "...",
  messages: [...],
});

Mark the last tool in the array. The prefix match will cover the full tools array from that point.

3. Few-shot examples in messages

If you pass static few-shot examples as early messages in the messages array, those can be cached too. Structure them as the first N messages and mark the last example turn:

typescript
const messages: Anthropic.MessageParam[] = [
  {
    role: "user",
    content: [
      {
        type: "text",
        text: "Here are examples of posts in my voice:\n\n[Example 1...]\n\n[Example 2...]",
        cache_control: { type: "ephemeral" },
      } as Anthropic.TextBlockParam & { cache_control: { type: "ephemeral" } },
    ],
  },
  {
    role: "assistant",
    content: "Understood. I'll follow that voice.",
  },
  // The actual user turn follows — this is volatile, no cache marker
  {
    role: "user",
    content: actualUserRequest,
  },
];

What NOT to cache (silent invalidators)

These are the things that look stable but aren’t — and they’ll kill your hit rate silently. The API won’t warn you. You’ll just see cache_creation_input_tokens on every request and wonder why.

Timestamps in the system prompt. The single most common mistake:

typescript
// This invalidates the cache on every request
const system = `You are an agent. Current time: ${new Date().toISOString()}`;

Move timestamps to the user message where they belong:

typescript
// Stable system prompt — cacheable
const system = `You are an agent. Use the current time provided by the user.`;

// Volatile user message — not cached
const userMessage = `Current time: ${new Date().toISOString()}. Run the daily brief.`;

Random UUIDs and trace IDs. Same problem. If you inject a trace ID into the system block for logging, every request gets a fresh prefix.

Non-deterministic JSON serialization. If you serialize an object into the system prompt and the key order isn’t guaranteed, the rendered string can differ even when the underlying data is the same. Serialize with a stable key order or use a template string.

Dynamic few-shot selection. If you’re choosing few-shot examples based on the current query and putting them in the cached prefix, you’ve made the “stable” prefix query-dependent. Either commit to fixed examples for the cache layer, or move dynamic examples to the uncached message turn.

Verifying your cache hit rate

Every response includes usage metadata. Check it:

typescript
const response = await client.messages.create({ /* ... */ });

console.log({
  inputTokens: response.usage.input_tokens,
  cacheRead: response.usage.cache_read_input_tokens,
  cacheWrite: response.usage.cache_creation_input_tokens,
  outputTokens: response.usage.output_tokens,
});

On the first request: cache_creation_input_tokens will be non-zero, cache_read_input_tokens will be 0. That’s the write.

On a cache hit: cache_read_input_tokens will be non-zero, cache_creation_input_tokens will be 0. That’s the read.

If you’re seeing cache_creation_input_tokens on every request, your prefix is changing. Add a log statement that prints the first 200 characters of your rendered system prompt before each call — a floating timestamp will jump out immediately.

The 1-hour TTL: when it’s worth the extra write cost

The default TTL is 5 minutes. If your agent runs at low frequency — less than once every 5 minutes — you’ll be paying cache write costs on most requests without getting reads.

typescript
// Opt into a 1-hour TTL
cache_control: { type: "ephemeral", ttl: "1h" }

The 1-hour write costs ~2× base input price instead of 1.25×. The math: if you’re hitting the cache 3 or more times per hour, the 1-hour TTL saves money. If your agent runs once a day (like my daily brief), even the 1-hour TTL won’t help — you’re paying write costs every time. In that case, the caching benefit is modest unless the system prompt is enormous.

My daily brief agent has a 3,000-token system prompt but runs once daily. Caching doesn’t help. My newsletter agent runs dozens of times per session while drafting — caching saves substantially.

Pre-warming: making the first request cheap

If you have a known traffic spike coming — a batch job, an API launch — you can pre-warm the cache with a low-cost dummy request:

typescript
// Pre-warm: write the cache at near-zero output cost
await client.messages.create({
  model: "claude-opus-4-8",
  max_tokens: 1, // minimal output
  system: [{ type: "text", text: stableSystemPrompt, cache_control: { type: "ephemeral" } }],
  messages: [{ role: "user", content: "ping" }],
});

// Now the real requests read from cache

This is mostly useful for batch processing where you’re spinning up many parallel requests and want every one to hit a warm cache rather than racing to write it.

Prompt caching in agentic loops

In a multi-turn agentic loop, the conversation history grows on every turn. The cache is smart enough to handle this: it uses a 20-block lookback window, finding the longest matching prefix within the last 20 content blocks.

The practical implication: keep your stable content (system prompt, tool definitions) anchored at the top. The growing conversation history at the end of the messages array won’t break the prefix match for the stable blocks — they’re before the volatile content, and the prefix match starts from the top.

In practice, my agents structure turns like this:

code
System (cached) → Tools (cached) → Few-shot (cached) → Turn 1 → Turn 2 → ... → Current turn

The cache covers everything up to the few-shot marker. The growing turn history after it gets re-processed each time, but that’s fine — those tokens are session-specific and small relative to the stable prefix.

What it looks like on the bill

Take a high-frequency agent: 100 calls per day, 4,000-token system prompt, Sonnet pricing.

Without caching:

With caching (5-min TTL, assuming 50 calls/hour at peak):

That’s roughly a 90% reduction on those input tokens. At scale — 1,000 calls per day — the difference compounds further. And this is on top of any model-routing savings from the Haiku vs Sonnet math: caching works at every tier.

The operator’s bottom line

Prompt caching is the easiest cost optimization in the Claude API: one additional field on the content blocks you’re already writing. The constraint is discipline around prefix stability — nothing dynamic before the cache marker. If you can keep your system prompt, tools, and any static examples free of volatile content, you’ll pay ~10% of normal input cost on every cache hit. For high-frequency agents with large stable prompts, this is a bigger lever than switching model tiers.


Related: AI Agent Cost Math: When Haiku Beats Sonnet · Event-Triggered vs Scheduled Agents · The 5 AI Tools I Actually Use to Run My Business

Keep reading

Get the AI playbook in your inbox

Every Wednesday. 28,400+ operators. Zero fluff.

↵ to see all results esc esc to close