# Alejandro Rioja > Alejandro Rioja — AI agent systems for founders. Plus posts on growth, marketing, sales, ops, and business from inside live P&Ls. Site: https://alejandrorioja.com Author: Alejandro Rioja --- ## AI Agent Cost Math: When Haiku Beats Sonnet (and When It Doesn't) Source: https://alejandrorioja.com/ai-agent-cost-math-when-haiku-beats-sonnet/ Published: 2026-06-08 Tags: AI Agents, Operations TL;DR: Picking Claude Haiku over Sonnet can cut per-call cost dramatically, but only when the task tolerates a lower success rate. The real metric isn't cost per call — it's cost per successful outcome, including retries and human cleanup. I route by task, not by default. ## Table of contents _Updated June 2026._ **TL;DR:** Choosing Claude Haiku over Sonnet can cut per-call cost by an order of magnitude, but only when the task tolerates Haiku's lower success rate. The metric that matters is **cost per successful outcome** — call cost plus retries plus human cleanup — not the sticker price per token. I route per task, and a meaningful share of my high-volume steps run on Haiku while the judgment calls stay on Sonnet. **Operator's read:** I run 100+ agents, and inference is a real line item. But I've watched teams "save money" by forcing everything onto the cheapest model and then eat the cost in retries, escalations, and angry customers. Cost math only works when you measure the whole funnel. The cheapest model is not the one with the lowest per-token price. It's the one with the lowest total cost to get the job done right. Those are different numbers, and the gap between them is where most agent cost decisions go wrong. ## The token economics, stated plainly Anthropic prices Claude per million tokens, input and output billed separately, with output costing several times more than input. The exact numbers move over time, so check Anthropic's current pricing — but the **structure** is what drives the decision: - **Haiku** is the cheap, fast tier — by far the lowest per-token cost in the family. - **Sonnet** sits in the middle — markedly more expensive than Haiku, markedly cheaper than Opus. - **Opus** is the premium tier for the hardest reasoning. Two things follow. First, output tokens dominate cost on generative tasks, so a model that's verbose costs more even at the same per-token rate. Second, the per-token gap between Haiku and Sonnet is large enough that on a high-volume step it absolutely shows up on the bill. That's the case *for* Haiku. Now the case against. ## The metric that actually matters: cost per successful outcome Per-call cost is a vanity number. Here's the formula I actually use: ``` cost_per_success = (call_cost × attempts) + cleanup_cost ÷ success_rate ``` Where `attempts` accounts for retries, and `cleanup_cost` is the expected cost of a human fixing the failures that slip through. Watch what this does to the comparison. Suppose Haiku costs roughly a tenth of Sonnet per call. If Haiku succeeds 80% of the time on a task and Sonnet succeeds 98%, the per-call savings look enormous. But if each Haiku failure triggers one retry and 1-in-10 still needs a human who costs real money, the cleanup term can swamp the token savings. On a low-stakes, high-volume task the math favors Haiku overwhelmingly. On a task where a failure emails the wrong customer, it can invert completely. You can't make this call without measuring success rate per model — which is exactly what an [eval harness](/the-eval-harness-i-use-to-ship-ai-agents/) gives you. Run the same eval set against both models and read the success rates off the same yardstick. ## Where Haiku wins decisively Haiku is the right call when the task is **narrow, structured, and verifiable**: - **Classification and routing** — "is this inbound a booking, a complaint, or spam?" Three buckets, easy to verify, runs constantly. Haiku all day. - **Extraction with a schema** — pulling a date, a name, an amount out of text, validated with Zod. If the output parses, it's almost certainly right. - **Short rewrites and formatting** — tone tweaks, summarizing a known-good input, normalizing data. - **First-pass filtering** — Haiku triages, and only the ambiguous cases get escalated to Sonnet. This is the highest-leverage pattern. The common thread: the cost of a Haiku mistake is low and the mistake is cheap to catch. When verification is cheap and stakes are low, the cheap model wins. ## Where Sonnet earns its price Sonnet (and sometimes Opus) is worth it when the task is **open-ended, multi-step, or expensive to get wrong**: - **Multi-tool agent loops** where one wrong tool call cascades. Higher reasoning reliability compounds across steps — the orchestration patterns I cover in [multi-agent orchestration](/multi-agent-orchestration-patterns-queues-state-handoffs/) lean on the model not losing the plot. - **Customer-facing generation** where a bad output costs trust, not just a retry. - **Anything where verification is itself hard.** If you can't cheaply tell whether the output is right, you can't afford a model that's frequently wrong. A failure here doesn't cost one retry — it costs a refund, a churned customer, or my time. Against that, the per-token premium is rounding error. ## The routing rule I actually ship I don't pick one model per agent. I route per **task** inside the agent, usually with a cheap classifier deciding which downstream model handles the work: ```typescript function pickModel(task: Task): string { // Cheap, verifiable, high-volume → Haiku if (task.type === "classify" || task.type === "extract") { return "claude-haiku"; } // Open-ended or customer-facing → Sonnet if (task.customerFacing || task.steps > 2) { return "claude-sonnet"; } return "claude-sonnet"; // default to the safe choice } ``` Two principles encoded here. **Default to the safe model**, not the cheap one — you optimize cost *down* from a working baseline, never reliability *up* from a broken one. And **escalate, don't gamble**: let Haiku handle the easy 80% and hand the hard 20% to Sonnet. That hybrid almost always beats running everything on either model alone. There's also prompt caching to layer on top: if your system prompt is large and reused, caching cuts input cost substantially regardless of tier, which sometimes makes Sonnet cheap enough that the Haiku question is moot. ## A worked example from my own stack Take a high-volume inbound triage step. It runs thousands of times, the task is three-way classification, and a miss just means the item lands in a review queue — cheap to catch, low stakes. That's a textbook Haiku task, and moving it off Sonnet meaningfully cut the cost of that step with no measurable hit to the outcome that mattered. Now take the step that drafts the actual reply to a customer. Lower volume, open-ended, and a bad draft going out costs trust. That stays on Sonnet. Same agent, two models, routed by stakes. I watch the cost-per-run and success metrics for both, the way I describe in [how I measure whether an AI agent is actually working](/how-i-measure-whether-an-ai-agent-is-actually-working/) — and I only push a step down a tier after the eval says the cheaper model holds the success rate. ## FAQ ### Is Claude Haiku always cheaper than Sonnet in practice? Per token, yes — by a wide margin. Per successful outcome, not always. If Haiku's lower success rate triggers retries and human cleanup, the total cost can exceed Sonnet's on tasks where mistakes are expensive to catch or fix. ### How do I decide between Haiku and Sonnet for a given task? Score the task on two axes: how verifiable the output is and how costly a mistake is. Cheap-to-verify, low-stakes, high-volume work goes to Haiku; open-ended, customer-facing, or hard-to-verify work goes to Sonnet. Route per task, not per agent. ### What's the single cost metric I should track? Cost per successful outcome — call cost times attempts plus expected cleanup cost, divided by success rate. Per-call price alone hides retries and human time, which is where cheap models quietly get expensive. ### Can I use both models in one agent? Yes, and you usually should. The strongest pattern is a cheap first pass (Haiku classifies or filters) that escalates only ambiguous cases to Sonnet. That hybrid typically beats running everything on a single tier. --- ## How to Debug an AI Agent in Production (A Field Guide) Source: https://alejandrorioja.com/how-to-debug-an-ai-agent-in-production/ Published: 2026-06-08 Tags: AI Agents, Operations TL;DR: Debugging a production AI agent is mostly about isolating which layer failed — prompt, tool, model, or orchestration. I log every step with a trace ID, replay the exact inputs, and bisect. In my agents, ~70% of 'AI bugs' turn out to be plumbing bugs, not model bugs. ## Table of contents _Updated June 2026._ **TL;DR:** Debugging a production AI agent is mostly about isolating which layer failed — prompt, tool call, model output, or orchestration. I log every step with a trace ID, replay the exact inputs, and bisect from there. In my agents, roughly 70% of what looks like an "AI bug" turns out to be plumbing: a malformed tool result, a truncated input, a silently swallowed exception. **Operator's read:** I run 100+ production agents — booking flows for Pickleland, content pipelines, inbox triagers. They break the way all software breaks, plus a few new ways. This is the field guide I wish I'd had: how to find the failing layer without staring at a wall of tokens. When an agent misbehaves in production, the instinct is to blame the model. "Claude hallucinated." Sometimes true. Usually not. The model is one layer in a stack of five or six, and the bug is far more often in the layer you wrote than the one Anthropic shipped. This post is the systematic way I find it. ## Make every run traceable before you debug anything You cannot debug what you cannot see. The single highest-leverage thing you can do — before any specific bug shows up — is attach a trace ID to every agent run and log every step it takes. A "step" is anything that crosses a boundary: the inbound trigger, each model call (with the full messages array), each tool call (with arguments), each tool result, and the final output. Log them as structured JSON keyed by the trace ID. ```typescript function logStep(traceId: string, step: string, payload: unknown) { console.log(JSON.stringify({ traceId, step, // "trigger" | "model_call" | "tool_call" | "tool_result" | "output" ts: Date.now(), payload, })); } ``` On Cloudflare Workers I ship these to a queue and into a table; locally they go to stdout. The rule is absolute: if a step isn't logged, it didn't happen as far as debugging is concerned. This mirrors the instrumentation I describe in [the agent stack I use](/the-agent-stack-i-use-to-run-30-production-agents-no-python/) — the trace ID is the spine everything else hangs off. ## Isolate the layer: prompt, tool, model, or orchestration Once you have a trace, debugging becomes a bisection. There are four layers and the bug lives in exactly one of them most of the time. ### 1. The input layer (the most common culprit) Pull the exact `messages` array that went into the failing model call. Not a reconstruction — the literal payload from the log. Then read it like a stranger would. Half my "the model ignored the instructions" bugs are actually: - A tool result that came back as `"[object Object]"` because something got stringified wrong. - An input truncated mid-sentence because it blew the context window and a naive slice cut it. - A variable that interpolated as `undefined` and quietly poisoned the prompt. If the input is wrong, the model did its job perfectly on garbage. Fix the plumbing. ### 2. The tool layer If the input looks clean, check whether a tool returned an error the agent treated as success. A classic: an API returns `200` with a body of `{ "error": "rate limited" }`, your tool wrapper doesn't check the body, and the agent confidently acts on an error message. Log tool results raw and assert their shape. ### 3. The model layer Only after ruling out 1 and 2 do I suspect the model. Even then, "model bug" usually means "my prompt is ambiguous." Take the exact failing input, drop it into a one-off script against the same model and temperature, and see if it reproduces. If it does, the fix is prompt work or a [tighter eval](/the-eval-harness-i-use-to-ship-ai-agents/), not a frantic model swap. ### 4. The orchestration layer If a single step is fine in isolation but the multi-step run fails, the bug is in the handoff — state lost between steps, a race condition, a retry that re-ran a non-idempotent action. These are the nastiest and I cover the patterns in [multi-agent orchestration patterns](/multi-agent-orchestration-patterns-queues-state-handoffs/). ## Reproduce non-determinism instead of fighting it The thing that makes agents feel un-debuggable is non-determinism: the same input produces different output across runs. You can tame it. First, **pin what you can.** Set `temperature: 0` while debugging. It won't make Claude fully deterministic, but it sharply narrows the variance so you can tell a real bug from sampling noise. Second, **run it N times.** If a failure reproduces 1 in 20 runs, loop the exact input 50 times and capture every output. Now you have a sample, not an anecdote. A bug that fires 5% of the time is a real bug — you just need volume to see it. ```bash for i in $(seq 1 50); do node replay.mjs --trace=abc123 >> runs.jsonl done # then count failures grep -c '"status":"fail"' runs.jsonl ``` Third, **diff the passing and failing runs.** With temperature pinned and the same input, a difference in output means a difference in input you haven't spotted yet — a timestamp in the prompt, a tool result that varies, a retrieved doc that changed. ## Build a replay harness so you stop debugging in production Debugging by re-triggering the live agent is slow and risky — it sends real emails, books real courts. Instead, capture the trace and replay it offline. The replay harness loads a logged trace, reconstructs the exact inputs to any step, and re-runs just that step against the model. Because you logged the full `messages` array, you don't need the upstream system at all. This turns a 10-minute production round-trip into a 2-second local loop, and it's the single biggest speedup in my debugging workflow. A good replay harness also lets you **mutate and re-run**: change one line of the system prompt, replay the same 50 failing traces, and see how many now pass. That's the bridge from debugging to eval — once you have a corpus of failing traces, you have the start of a regression suite. ## Watch the metrics that actually predict breakage Some failures never throw an exception. The agent runs, returns something plausible, and quietly does the wrong thing. To catch those you watch behavioral metrics, not just error rates: - **Tool-call success rate** per tool. A drop here often precedes a visible failure. - **Output schema validity** — what % of outputs parse against the expected structure. I validate every output with Zod and alert when validity dips. - **Loop length** — average number of steps per run. A sudden spike usually means the agent is stuck retrying. - **Cost per run** — a runaway loop shows up as a cost spike before it shows up as a complaint. (When cost matters, the [Haiku vs Sonnet math](/ai-agent-cost-math-when-haiku-beats-sonnet) is worth knowing.) I track these the same way I track everything else — see [how I measure whether an AI agent is actually working](/how-i-measure-whether-an-ai-agent-is-actually-working/). The metric that catches a silent failure is worth ten that catch loud ones. ## The 5-minute triage checklist When an agent breaks and I'm on the clock, I run this in order: 1. **Get the trace ID** for the failing run. 2. **Read the exact input** to the failing step. Is it well-formed? (Solves ~50% of cases here.) 3. **Check the tool results** in that trace for errors-disguised-as-success. 4. **Replay the step offline** at `temperature: 0`. Does it reproduce? 5. **If it reproduces,** it's a prompt/model issue — fix and re-run the trace corpus. **If it doesn't,** it's non-determinism or a state/orchestration bug — loop it 50× to characterize. Disciplined isolation beats clever prompting every time. The model is rarely the problem; the system around it usually is. ## FAQ ### How do I debug an AI agent that fails only sometimes? Capture the exact input from a logged trace and replay it 50+ times at temperature 0. Intermittent failures are real bugs with low fire-rates — volume turns the anecdote into a reproducible sample you can diff and fix. ### Is the bug usually in the model or in my code? In my production agents, roughly 70% of apparent "AI bugs" are plumbing: malformed tool results, truncated inputs, swallowed exceptions, or lost state between steps. Rule out the input and tool layers before you suspect the model. ### What's the minimum logging I need to debug agents? A trace ID on every run, plus structured logs of the trigger, every model call (full messages array), every tool call and its raw result, and the final output. If a step isn't logged, you can't debug it. ### How do I stop debugging against live production? Build a replay harness that loads a logged trace and re-runs any single step offline using the captured inputs. It turns a slow, risky production round-trip into a fast local loop and becomes the seed of your regression suite. --- ## How to Measure Whether AI Search Is Actually Sending You Traffic Source: https://alejandrorioja.com/how-to-measure-ai-search-traffic/ Published: 2026-06-08 Tags: GEO, Analytics TL;DR: Most AI-search traffic shows up as a trickle of referrals from chatgpt.com, perplexity.ai, and claude.ai — but the bigger effect is dark: people read the AI's answer and never click. I measure both, using referrers for the clicks and brand-search lift for the influence. ## Table of contents _Updated June 2026._ **TL;DR:** Most AI-search traffic arrives as a thin stream of referrals from `chatgpt.com`, `perplexity.ai`, and `claude.ai` — easy to count once you know where to look. But the larger effect is **dark**: people read the AI's answer, absorb your brand, and never click. I track the clicks with referrer segments and the influence with brand-search lift, direct-traffic shifts, and citation monitoring. Counting only clicks badly undersells AI search. **Operator's read:** I run a content engine and watch its analytics daily. The "is AI search sending traffic?" question has a frustrating answer: yes, but most of the value doesn't appear in your sessions report. Here's how I measure the part that does and infer the part that doesn't. Everyone wants one number: "how much traffic is ChatGPT sending me?" The honest answer is that AI search produces two very different effects, and you need two different measurements. Conflate them and you'll either panic (the clicks look tiny) or fool yourself (you'll miss the real impact). ## Effect 1: Direct referrals — countable, and smaller than you'd hope When someone clicks a citation inside ChatGPT, Perplexity, or a Claude answer, your analytics records a referrer. These are real, attributable sessions. In GA4 or any analytics tool, build a segment that catches the AI engines: ``` session source matches any of: chatgpt.com chat.openai.com perplexity.ai claude.ai gemini.google.com copilot.microsoft.com ``` Save that as an "AI Search" channel and watch it over time. A few caveats that bite people: - **Referrers leak.** Some AI surfaces strip or mangle the referrer, so a chunk of genuine AI clicks land in "Direct" instead. Your referral count is a floor, not the truth. - **Volume is low relative to the answer impressions.** AI engines answer the question on the page; only the curious minority clicks through. A handful of daily referrals can correspond to far more people who saw you cited. So the referral segment is necessary but insufficient. It tells you AI search is sending *some* traffic. It badly undercounts the influence. ## Effect 2: Dark influence — the bigger, harder-to-see half The real action is zero-click. Someone asks ChatGPT a question, your brand appears in the answer as a recommended source, and they never click — they just remember you. That shows up later as a **branded search** or a **direct visit**, attributed to nothing. This is the same dynamic that made featured snippets frustrating to measure, amplified. You can't measure dark influence directly, but you can triangulate it: 1. **Branded search volume.** Track searches for your name/brand in Google Search Console over time. If you start getting cited by AI engines and your branded impressions rise without a matching campaign, that lift is a fingerprint of AI influence. 2. **Direct-traffic trend.** A sustained rise in "Direct" sessions that doesn't track any campaign often reflects AI referrals stripped of their referrer plus people typing you in after an AI mention. 3. **Assisted conversions.** Look at whether AI-search sessions, even when rare, show up as the *first* touch in converting journeys. A channel that's tiny by last-click can be meaningful by first-touch. None of these is a clean number. Together they tell you whether the dark half is moving. ## Track citations, not just clicks Here's the metric I care about most for AI search, and it isn't in your analytics at all: **am I being cited, and for which queries?** Maintain a list of the 20-40 queries that matter for your business and run them through ChatGPT, Perplexity, and Claude on a schedule — weekly is plenty. Log, for each query and engine: are you cited, and in what position? This is the GEO equivalent of rank tracking, and it's the leading indicator. Citations move *before* the downstream traffic and brand lift do, so this is where you see whether your [GEO work for local business](/geo-for-local-business-getting-a-brick-and-mortar-cited-by-ai-search/) is landing. I built a small agent that runs these checks and logs the results — the kind of thing that's trivial once you have an agent stack. If you'd rather do it by hand, a spreadsheet and a weekly 30-minute pass works fine to start. The methodology mirrors my [ChatGPT vs Google citation test](/chatgpt-search-vs-google-50-term-test/), just run continuously instead of once. ## Build the dashboard: four numbers, weekly I don't drown in metrics. For AI search I watch four things and review them weekly: 1. **AI referral sessions** — the countable clicks from the referrer segment. Trend, not absolute. 2. **Citation coverage** — % of my tracked queries where I'm cited across the three engines. The leading indicator. 3. **Branded search impressions** — from Search Console, as the dark-influence proxy. 4. **AI-sourced conversions** — even if small, whether AI sessions ever start a converting journey. If citation coverage is rising while referral sessions stay flat, that's *not* a failure — it usually means the dark half is growing and the branded-search number should follow. If citation coverage is falling, that's an early warning to act on before any traffic number moves. This is the same "measure the leading indicator" discipline I apply to agents in [how I measure whether an AI agent is actually working](/how-i-measure-whether-an-ai-agent-is-actually-working/). ## What to do with the numbers Measurement is only useful if it changes what you do. The playbook: - **Citation coverage low for a query you care about?** That's a content + [schema](/schema-markup-for-ai-engines-the-types-that-punch-above-their-weight/) problem. The page either doesn't exist, isn't structured for extraction, or isn't authoritative enough to get pulled into the answer. - **Cited but no referral traffic?** Expected and fine — AI search is doing brand work, not click work. Don't "fix" it by chasing clicks; lean into being the cited source. - **Referrals from one engine but not others?** Engines diverge hard on sources (I measured ~40% overlap between ChatGPT and Google). Being cited by one doesn't get you the others — work each engine's coverage separately. ## A note on attribution honesty Resist the urge to claim precision you don't have. AI-search measurement in 2026 is triangulation, not attribution. Anyone selling you a clean "ChatGPT sent you X dollars" number is overstating what's knowable, because the referrers leak and the biggest effect is zero-click by design. The right posture: count what you can count, watch the proxies for what you can't, and make decisions on the trend. The trend is trustworthy even when the absolute number isn't. ## FAQ ### How do I see traffic from ChatGPT or Perplexity in GA4? Build a channel/segment matching the AI engine domains — chatgpt.com, chat.openai.com, perplexity.ai, claude.ai, gemini.google.com, copilot.microsoft.com — as session source. That captures the click-through referrals, though some are stripped to "Direct," so treat the count as a floor. ### Why is my AI-search referral traffic so low? Because AI search is mostly zero-click — the engine answers on the page and only a minority clicks through. Low referral counts often coincide with much larger citation impressions. Measure citations and branded-search lift to see the part referrals miss. ### What's the best leading indicator for AI search? Citation coverage: the percentage of your tracked business-critical queries where you're cited across ChatGPT, Perplexity, and Claude. It moves before traffic and brand lift do, so it tells you early whether your GEO work is landing. ### Can I get exact revenue attribution from AI search? No, not reliably in 2026. Referrers leak into Direct and most of the impact is zero-click by design. Treat AI-search measurement as triangulation — count clicks, watch branded-search and direct-traffic proxies, and decide on the trend, not a false-precise dollar figure. --- ## Multi-Agent Orchestration Patterns: Queues, State, and Handoffs Source: https://alejandrorioja.com/multi-agent-orchestration-patterns-queues-state-handoffs/ Published: 2026-06-08 Tags: AI Agents, Operations TL;DR: Reliable multi-agent systems aren't about clever prompts — they're about boring distributed-systems discipline: durable queues between agents, state held outside the model, and idempotent handoffs that survive retries. The model is the worker; the queue is the backbone. ## Table of contents _Updated June 2026._ **TL;DR:** Reliable multi-agent systems aren't won with clever prompts — they're won with boring distributed-systems discipline. Put a durable **queue** between agents, hold **state outside the model**, and make every **handoff idempotent** so a retry can't double-act. The model is the worker; the queue is the backbone. Get those three right and orchestration stops being scary. **Operator's read:** Most of my 100+ agents are single-step. The ones that aren't — the pipelines that classify, then enrich, then act — only became reliable once I stopped thinking "prompt chain" and started thinking "job queue with LLM workers." This is the architecture, not the prompt engineering. "Multi-agent" sounds like the agents talk to each other. In practice the reliable version is the opposite: agents don't talk directly at all. They drop messages on a queue and pick up work from a queue, and the orchestration lives in the plumbing between them. Here are the patterns that hold up in production. ## Pattern 1: Put a durable queue between every agent The first instinct is to call agent B directly from inside agent A. Don't. Direct calls couple the two: if B is slow, A blocks; if B fails, A's work is lost; if you need to scale B, you can't without touching A. Instead, A finishes its work and **enqueues a message** for B. B is a separate worker that drains the queue at its own pace. ```typescript // Agent A finishes, hands off via the queue — no direct call to B await env.ENRICH_QUEUE.send({ traceId, type: "enrich", payload: classifierResult, }); // A's job is done. B will pick this up independently. ``` On Cloudflare I use Workers Queues for exactly this — the same primitives behind [the agent stack I use](/the-agent-stack-i-use-to-run-30-production-agents-no-python/). The queue gives you four things for free: **buffering** (B can be down without losing work), **retries** (failed messages redeliver), **backpressure** (a spike queues instead of crashing), and **decoupling** (scale or redeploy B without touching A). Every one of those is something you'd otherwise have to build by hand and get wrong. ## Pattern 2: Hold state outside the model, always The most common multi-agent bug is assuming the model remembers anything between steps. It doesn't. Each model call is stateless; the only memory is what you put in the prompt. So the source of truth for "where is this job in the pipeline" must live in a database, not in a conversation. I keep a single job record that every agent reads and updates: ```typescript interface JobState { traceId: string; stage: "classified" | "enriched" | "acted" | "done" | "failed"; data: Record; attempts: number; updatedAt: number; } ``` Each agent does the same loop: **read** the job state, do its work, **write** the new state, enqueue the next stage. The model never holds the state — it receives the relevant slice as input and returns a result. This is what makes the system restartable: if a worker dies mid-job, the state record still says exactly where things stood, and the redelivered queue message picks up from there. It also makes debugging tractable, because the state table is a queryable record of every job's journey — the same instrumentation mindset from [how I measure whether an agent is working](/how-i-measure-whether-an-ai-agent-is-actually-working/). ## Pattern 3: Make every handoff idempotent Queues guarantee *at-least-once* delivery, not exactly-once. That means a message can be delivered twice — network blips, retries, redeploys. If your agent's action isn't idempotent, a double-delivery double-acts: two confirmation emails, two bookings, two charges. This is the single nastiest class of orchestration bug, and it's the one teams discover in production. The fix is to make actions idempotent with a key: ```typescript async function handleEnrich(msg: QueueMessage, env: Env) { const job = await getJob(env, msg.traceId); if (job.stage !== "classified") { // Already processed past this stage — this is a duplicate delivery. Skip. return; } const result = await enrich(job.data); await advanceJob(env, msg.traceId, "enriched", result); await env.ACT_QUEUE.send({ traceId: msg.traceId, type: "act" }); } ``` The stage check makes the operation safe to run twice: the second delivery sees the job has already advanced and no-ops. For external side effects (sending an email, charging a card), pass an idempotency key to the downstream API so *it* deduplicates too. Assume every message will be delivered twice and design so that's harmless — because eventually it will be. ## Pattern 4: Orchestrator vs choreography — pick deliberately There are two ways to wire the flow, and the right choice depends on complexity. **Choreography** (what I default to): each agent knows only the next step and enqueues it. The flow emerges from the chain. Simple, decentralized, easy to extend — add a stage by inserting a queue. The downside is that no single place describes the whole flow, so a complex pipeline can get hard to reason about. **Orchestration** (a central coordinator): one orchestrator owns the flow, calls each agent in turn, and decides what's next based on results. The whole flow lives in one readable place and branching logic is explicit. The cost is a central component that must itself be durable — if the orchestrator's own state isn't externalized (Pattern 2), it becomes the single point of failure. My rule: **choreography until branching gets complex, then a durable orchestrator.** A linear three-stage pipeline is choreography. A flow with conditional routing, parallel fan-out, and joins wants an orchestrator whose state lives in the database so it can resume after a crash. ## Pattern 5: Fan-out, fan-in without losing pieces When one job spawns N parallel sub-tasks (enrich 50 records, summarize 20 docs) and you need to wait for all of them before continuing, you need a **join**. The trick is a counter in the job state: 1. Parent enqueues N child messages and writes `expected: N, completed: 0` to the job record. 2. Each child does its work and **atomically increments** `completed`. 3. The child that bumps `completed` to equal `expected` enqueues the next stage. The atomic increment is load-bearing — without it, two children finishing simultaneously can both think they're not the last, and the join never fires. Use a counter the datastore can increment atomically, or a transaction. This pattern lets you parallelize the expensive middle of a pipeline (often Haiku-cheap work — see the [Haiku vs Sonnet cost math](/ai-agent-cost-math-when-haiku-beats-sonnet)) while keeping a clean join at the end. ## What I'd skip You don't need a heavyweight agent framework to do any of this. Queues, a state table, and idempotency keys are primitives every platform already has. I've watched teams reach for elaborate multi-agent frameworks to get features a queue gives you for free, and inherit a black box that's harder to debug than the plumbing it replaced. Start with the boring primitives. Reach for a framework only when you've felt a specific pain it solves. The summary: agents are stateless workers, queues are the durable backbone, state lives in a database, and every handoff is safe to run twice. That's the whole game. ## FAQ ### Should agents call each other directly or go through a queue? Through a queue. Direct calls couple agents — one's failure or slowness propagates to the other, and you can't scale or redeploy independently. A durable queue gives you buffering, retries, backpressure, and decoupling for free. ### Where should multi-agent state live? Outside the model, in a database, as a job record each agent reads and updates. Model calls are stateless, so the source of truth for pipeline progress must be external — that's what makes the system restartable after a crash. ### How do I prevent an agent from acting twice on the same job? Make handoffs idempotent. Check the job's stage before acting and no-op if it's already advanced, and pass idempotency keys to external APIs. Queues deliver at-least-once, so assume every message can arrive twice and design so duplicates are harmless. ### Do I need a multi-agent framework? Usually no. Durable queues, a state table, and idempotency keys cover most production needs with primitives your platform already provides. Adopt a framework only when you hit a concrete problem it uniquely solves, not by default. --- ## The Eval Harness I Use to Ship AI Agents Without Fear Source: https://alejandrorioja.com/the-eval-harness-i-use-to-ship-ai-agents/ Published: 2026-06-08 Tags: AI Agents, Operations TL;DR: Shipping agents without fear comes from one thing: an eval harness. A fixed set of graded test cases, scored automatically (assertions plus an LLM judge), run before every prompt or model change. If the score holds, ship. The test set is built from real production failures. ## Table of contents _Updated June 2026._ **TL;DR:** The reason I can change a prompt or swap a model on a live agent without holding my breath is one thing: an **eval harness**. A fixed set of graded test cases, scored automatically — hard assertions where I can write them, an LLM judge where I can't — run before every change. Score holds, I ship. Score drops, I don't. The test set isn't synthetic; it's built from real production failures, so every bug becomes a permanent regression test. **Operator's read:** Across 100+ agents, the difference between the ones I touch confidently and the ones I'm scared of is whether they have evals. No eval harness means every prompt tweak is a gamble. An eval harness turns "I think this is better" into "this is measurably 4 points better and broke nothing." That's the whole unlock. You wouldn't ship code without tests. People ship agents without evals constantly, then wonder why a "tiny prompt tweak" broke production. An eval harness is the test suite for non-deterministic software. Here's the one I actually run. ## Start with a test set built from real failures The harness is only as good as its test cases, and the best test cases come from production, not your imagination. Every time an agent fails in the wild, I capture the exact input (I log every run with a trace ID — see [how to debug an agent in production](/how-to-debug-an-ai-agent-in-production)) and turn it into an eval case: ```typescript interface EvalCase { id: string; input: AgentInput; // the exact production input expected?: string; // ground truth, when there is one assertions: Assertion[]; // hard checks that must pass rubric?: string; // for the LLM judge, when output is open-ended } ``` Two practices matter here. **Pull from production**, so your evals test what actually breaks, not what you guessed might. And **cover the spread** — happy path, edge cases, adversarial inputs, and the empty/malformed inputs that cause silent failures. A test set of 30-50 well-chosen cases catches far more than 500 lazy ones. I'd rather have 40 cases that each represent a real failure mode than a thousand that all test the same easy path. ## Score with assertions first, an LLM judge second Not every output needs a model to grade it. I reach for the cheapest scorer that works. **Hard assertions** for anything structured. Does the output parse as valid JSON? Does it contain the required field? Is the extracted date in range? Did it call the right tool with the right arguments? These are deterministic, free, and unambiguous — write as many as you can. ```typescript const assertions: Assertion[] = [ (out) => isValidJSON(out), (out) => parse(out).category in ALLOWED_CATEGORIES, (out) => parse(out).confidence >= 0 && parse(out).confidence <= 1, ]; ``` **An LLM judge** for the open-ended rest — tone, helpfulness, "did this actually answer the question." Here you give a model the input, the output, and a rubric, and ask it to score. Two rules keep the judge honest: make the rubric **specific** (a 1-5 scale with described anchors beats "rate the quality"), and use a **strong model as the judge** — judging is a reasoning task, so this is a place I happily pay for Sonnet even when the agent itself runs on Haiku per the [cost math](/ai-agent-cost-math-when-haiku-beats-sonnet). A vague rubric or a weak judge gives you noise that looks like signal. ## Run the harness before every change The harness exists to answer one question: *did this change make the agent better or worse?* So I run it before every prompt edit, model swap, or tool change. ```bash # baseline on main npm run eval -- --suite=booking-agent > baseline.json # make the change, then re-run npm run eval -- --suite=booking-agent > candidate.json # compare npm run eval:diff baseline.json candidate.json ``` The diff shows aggregate score, per-case pass/fail, and — crucially — **which specific cases regressed.** An aggregate that ticks up while three cases silently break is not an improvement; it's a trade I want to see and approve, not one that sneaks through. Watching the per-case diff is how you avoid "fixed one thing, broke two others," the failure mode that makes people afraid of their own prompts. ## Set a regression gate and let it block Once you trust the harness, wire it into the path to production as a gate. My rule is blunt: **a change that drops the score below the baseline threshold doesn't ship.** Not "I'll look into it later" — it's blocked, same as a failing CI test. ```typescript const PASS_THRESHOLD = 0.90; // 90% of cases must pass if (candidate.passRate < PASS_THRESHOLD || candidate.passRate < baseline.passRate) { throw new Error(`Eval regression: ${candidate.passRate} < ${baseline.passRate}`); } ``` This is what converts evals from a nice-to-have into the thing that lets you move fast. The gate is what makes "ship without fear" literally true: the worst case for a bad change is a red eval run, not a production incident. And because the test set grows every time something breaks, the gate gets stricter and more protective over time on its own. ## Account for non-determinism in scoring A subtlety that trips people up: the same input can score differently across runs because the model samples differently. If you run each case once, you'll see phantom regressions — a case "broke" that's really just sampling noise. Two mitigations. Run evals at **`temperature: 0`** to shrink variance (it won't fully eliminate it). And for cases you've seen flicker, **run them N times and take the pass rate**, not a single pass/fail. A case that passes 9/10 is in better shape than one that passes 5/10 even though both can show a green single run. This is the same volume-over-anecdote principle I use when [debugging intermittent failures](/how-to-debug-an-ai-agent-in-production) — one run is an opinion, fifty runs are data. ## Close the loop with production monitoring The eval harness tests against known cases. Production throws novel ones. So the loop is: monitor live behavior, catch a new failure mode, turn it into an eval case, fix it, and now it's permanently guarded. The monitoring side — tracking success rate, output validity, and cost per run on live traffic — is what I cover in [how I measure whether an AI agent is actually working](/how-i-measure-whether-an-ai-agent-is-actually-working/). Evals and monitoring are two halves of the same system: monitoring finds the bugs, evals make sure they stay dead. That feedback loop is the real product. Any single eval set goes stale; a *process* that converts every production failure into a permanent test gets stronger every week. That's how an agent goes from "scary to touch" to something I'll refactor on a Friday afternoon without flinching. ## FAQ ### What goes into an AI agent eval set? Real production inputs turned into graded cases — happy path, edge cases, adversarial and malformed inputs — each with hard assertions and, for open-ended outputs, an LLM-judge rubric. 30-50 cases drawn from actual failures beat hundreds of synthetic ones that all test the easy path. ### Should I use an LLM to grade agent outputs? Use hard assertions wherever the output is structured (valid JSON, correct field, right tool call) — they're free and deterministic. Reserve an LLM judge for open-ended qualities like tone and helpfulness, with a specific rubric and a strong judge model so you get signal, not noise. ### How do I stop a prompt change from silently breaking production? Run the eval harness before every change and diff against a baseline, watching per-case regressions, not just the aggregate score. Then gate deploys on the result so any change that drops below the baseline threshold is blocked like a failing test. ### How do I handle non-determinism in evals? Run at temperature 0 to reduce variance, and for cases that flicker, run them multiple times and score the pass rate instead of a single run. A case that passes 9 of 10 times is healthier than one that passes 5 of 10, even if a single run shows both green. --- ## How to Automate Your Newsletter With an AI Agent Source: https://alejandrorioja.com/how-to-automate-your-newsletter-with-an-ai-agent/ Published: 2026-06-06 Updated: 2026-06-06 Tags: AI Agents, Growth TL;DR: A Claude agent reads my content queue, picks the strongest angle for the week, drafts a newsletter in my voice, segments the list by engagement tier, and schedules the send via the Kit API — all without me opening a composer. I review a rendered preview and hit approve. The hard creative work is mine; the mechanical execution is the agent's. ## Table of contents _Updated June 2026._ **TL;DR:** A Claude agent reads my content queue, picks the strongest angle for the week, drafts a newsletter in my voice, segments the list by engagement tier, and schedules the send via the Kit API — all without me opening a composer. I review a rendered preview and hit approve. The hard creative work is mine; the mechanical execution is the agent's. **[Operator's read]** A newsletter that sends consistently beats one that's "better" but ships when inspiration strikes. The constraint was execution overhead, not ideas. I had ideas; I didn't have the bandwidth to format, schedule, and segment them every week. The agent eliminated that gap. ## The actual bottleneck in most newsletter workflows Most newsletter automation advice focuses on the wrong thing: welcome sequences, automations, tagging logic. Those are fine, but they don't solve the week-to-week creation problem. The real drag is this: you know what you want to say, but sitting down to format it, write the subject line variants, pick the right segment, and schedule it at the right time costs 2–3 hours of context-switching per week. Multiply by 52 weeks and you've spent a full work week just *sending* newsletters. The agent handles every step after "I know what this week's angle is." ## The stack I'm using - **[Kit](/recommends/convertkit)** (formerly ConvertKit) — the email platform. Excellent API, solid subscriber tagging, clean analytics. The agent-friendly API is what sold me. - **Claude (Anthropic SDK)** — the generation layer - **Cloudflare Workers** — scheduled trigger (runs every Tuesday at 8am CT) - **Airtable** — content queue and approval inbox If you're not on Kit, the same pattern works with any platform that has a REST API for creating and scheduling broadcasts. ## Step 1: The content queue The agent needs a source of truth for "what are we writing about." Mine is an [Airtable](/recommends/airtable) table with columns: - `Topic` — the angle or question - `Status` — Queue / Approved / Sent - `Tier` — whether this is for all subscribers or engaged-only - `Notes` — any constraints (avoid this tone, include this link, etc.) Each week, I spend 10 minutes adding 2–3 topics to the queue. That's my creative input. The rest is the agent's job. ## Step 2: The draft agent ```typescript // workers/newsletter-agent/index.ts import Anthropic from "@anthropic-ai/sdk"; import Airtable from "airtable"; const client = new Anthropic(); const VOICE_SYSTEM = `You are writing a weekly newsletter for Alejandro Rioja's subscribers. His audience: founders and operators interested in AI agents, SEO, and growing a one-person business. Voice: direct, first-person, practitioner. No hype, no "exciting times," no excessive bullet lists. Structure every newsletter as: 1. One-sentence hook (the problem or observation) 2. The core insight (3–5 paragraphs, no headers, conversational) 3. One concrete action the reader can take this week 4. A short sign-off (2 sentences max) Subject line: specific, outcome-oriented, under 50 chars. No clickbait. Return JSON: { "subject": "...", "preheader": "...", "body": "..." }`; async function getNextTopic(): Promise<{ id: string; topic: string; notes: string; tier: string }> { const base = new Airtable({ apiKey: process.env.AIRTABLE_API_KEY }).base(process.env.AIRTABLE_BASE_ID!); const records = await base("Newsletter Queue") .select({ filterByFormula: "{Status} = 'Queue'", sort: [{ field: "Created", direction: "asc" }], maxRecords: 1 }) .firstPage(); if (!records.length) throw new Error("Queue is empty. Add topics."); const r = records[0]; return { id: r.id, topic: r.get("Topic") as string, notes: (r.get("Notes") as string) ?? "", tier: (r.get("Tier") as string) ?? "all" }; } async function draftNewsletter(topic: string, notes: string): Promise<{ subject: string; preheader: string; body: string }> { const msg = await client.messages.create({ model: "claude-sonnet-4-6", max_tokens: 2048, system: VOICE_SYSTEM, messages: [{ role: "user", content: `Write this week's newsletter on: "${topic}". Additional notes: ${notes || "none"}` }], }); const text = (msg.content[0] as any).text.replace(/```json\n?/, "").replace(/```/, "").trim(); return JSON.parse(text); } async function scheduleWithKit(draft: { subject: string; preheader: string; body: string }, tier: string): Promise { const segmentId = tier === "engaged" ? process.env.KIT_ENGAGED_SEGMENT_ID : null; const sendAt = new Date(); sendAt.setDate(sendAt.getDate() + ((4 - sendAt.getDay() + 7) % 7)); // next Thursday sendAt.setHours(9, 0, 0, 0); // 9am CT const payload: any = { broadcast: { subject: draft.subject, content: draft.body, description: draft.preheader, send_at: sendAt.toISOString(), email_layout_template: "minimal", }, }; if (segmentId) payload.broadcast.segment_id = segmentId; const res = await fetch("https://api.kit.com/v4/broadcasts", { method: "POST", headers: { "Content-Type": "application/json", "X-Kit-Api-Key": process.env.KIT_API_KEY! }, body: JSON.stringify(payload), }); const data = await res.json(); return data.broadcast?.id ?? ""; } export default { async scheduled(_event: ScheduledEvent, env: Env) { // Inject env vars Object.assign(process.env, env); const { id, topic, notes, tier } = await getNextTopic(); const draft = await draftNewsletter(topic, notes); const broadcastId = await scheduleWithKit(draft, tier); // Mark as Approved in Airtable (not Sent — human reviews the Kit preview before confirm) const base = new Airtable({ apiKey: env.AIRTABLE_API_KEY }).base(env.AIRTABLE_BASE_ID); await base("Newsletter Queue").update(id, { Status: "Approved", KitBroadcastId: broadcastId }); console.log(`Scheduled broadcast ${broadcastId} for topic: ${topic}`); }, }; ``` ## Step 3: The approval step The agent creates the broadcast in Kit's draft state and marks the Airtable record as "Approved." Kit sends me a notification with a preview link. I click it, read it, and if it looks right, I confirm the send. If I want changes, I edit directly in Kit. This is the gate that keeps the agent from going fully autonomous on outbound email. I trust the drafts about 90% of the time. The 10% I catch in review — a tone that's slightly off, a stat I want to verify, a link I want to add — is worth the 3-minute review. ## What the agent handles that I never want to do again - Writing subject line variants and picking the best one - Formatting the preheader text - Computing the right send time (my audience opens Thursday mornings; the agent knows this) - Segmenting correctly based on the topic's tier - Logging everything to Airtable so I have a record ## What I still own The *idea*. The topic in the queue is mine. The angle is mine. The agent is a great executor of a clear brief; it's not a strategy layer. If I put a bad topic in the queue, I get a well-written newsletter about a bad topic. Also: the first-review gate. Every single send gets my eyes on it before it goes out. That's not going to change. ## The operator's bottom line If you're spending more than an hour a week on newsletter mechanics — formatting, scheduling, segmenting — you should automate it. The Kit API is clean, the Worker cron trigger is rock-solid, and the Claude draft quality is high enough that I approve ~90% of first drafts unchanged. Build the queue in Airtable, wire the Worker, and get back to creating ideas instead of executing sends. --- ## How to Rank in AI Search Without Writing a New Blog Post Source: https://alejandrorioja.com/how-to-rank-in-ai-search-without-writing-a-single-new-blog-post/ Published: 2026-06-06 Updated: 2026-06-06 Tags: GEO, SEO TL;DR: AI engines cite content that answers questions directly, claims clear authorship, and structures knowledge in a way that makes retrieval easy. Most existing blog posts can be retrofitted to meet all three criteria with edits, not rewrites. The playbook: add a direct TL;DR, tighten entity signals, add FAQ schema, and submit to llms.txt. New content is optional; restructuring is not. ## Table of contents _Updated June 2026._ **TL;DR:** AI engines cite content that answers questions directly, claims clear authorship, and structures knowledge in a way that makes retrieval easy. Most existing blog posts can be retrofitted to meet all three criteria with edits, not rewrites. The playbook: add a direct TL;DR, tighten entity signals, add FAQ schema, and submit to llms.txt. New content is optional; restructuring is not. **[Operator's read]** I ran this process on 341 existing posts before writing a single new GEO-targeted article. Citations in ChatGPT and Perplexity went up. New content accelerated gains — but the existing-content audit was where I started, and it paid off faster than I expected. ## Why AI engines aren't citing your existing content Before you write anything new, ask: why isn't what I already have getting cited? The answer is almost never "the content doesn't exist." It's usually one of these: 1. **No direct answer at the top** — the post buries the answer in paragraph 6 2. **Weak authorship signals** — no clear author entity, no credentials in the content 3. **Structural noise** — long intros, irrelevant sections, no clear heading hierarchy 4. **No machine-readable Q&A** — AI engines like structured question-answer pairs; most blog posts don't have them 5. **Not in any AI-readable index** — no llms.txt, no sitemaps the crawlers find All five are fixable on existing content. None require a new post. ## The four-step retrofit process ### Step 1: Add a direct TL;DR in the first 100 words AI engines do something analogous to what you do when you're skimming — they look for the direct answer before going deeper. If your post starts with a story, a question, or context-setting, the model may never read far enough to find your actual answer. Fix: Add a **TL;DR** block in the first 100 words. Format: takeaway → why → constraint or caveat. Two to four sentences. No fluff. Example before: > *Have you ever wondered why some businesses seem to dominate Google's search results? In this post, we'll explore the strategies that top-ranking sites use...* Example after: > **TL;DR:** Three things move the needle for local SEO in 2026: Google Business Profile completeness, citation consistency across directories, and structured schema for your NAP data. Tactics like "post every day" and "get 100 reviews fast" are secondary to those three. The ceiling is your GBP accuracy — fix that first. The rewrite isn't longer. It's just front-loaded. ### Step 2: Tighten your entity signals AI engines build a knowledge graph. They want to know: who wrote this, what is it about, and is the author credible on this topic? For author entity: make sure your About page is linked from every post, your author schema includes `sameAs` links to LinkedIn and Twitter, and your author bio on each post mentions specific credentials (not "marketing professional" — "ran SEO for three SaaS companies from 0 to 100K monthly visitors"). For topic entity: use the exact terms your audience searches for. If you're covering "GEO" (generative engine optimization), say "generative engine optimization" somewhere, not just the abbreviation. Models use term co-occurrence to classify content. ### Step 3: Add FAQ schema to every post that answers questions FAQPage schema is the highest-leverage schema type for GEO citation because it explicitly maps question to answer in a format models can parse directly. Take the 3–5 questions your post implicitly answers and make them explicit: ```json { "@context": "https://schema.org", "@type": "FAQPage", "mainEntity": [ { "@type": "Question", "name": "How long does it take to rank in AI search?", "acceptedAnswer": { "@type": "Answer", "text": "Most sites see initial citation improvements within 4–8 weeks of restructuring existing content for direct answers and adding FAQ schema. Brand-new domains take longer — expect 3–6 months before consistent citations appear." } } ] } ``` Add this to your post's `` or via your CMS's schema field. Every major AI engine crawls and parses this. ### Step 4: Submit to llms.txt and your platform's AI index `llms.txt` is an emerging standard — a plain-text file at `yoursite.com/llms.txt` that tells AI crawlers which content is high-quality and how to prioritize it. It's analogous to `robots.txt` but for LLMs. A basic llms.txt: ``` # llms.txt # alejandrorioja.com — AI agents and GEO for operators ## Priority content - /blog/geo-for-local-business (definitive guide, updated monthly) - /blog/schema-markup-for-ai-engines (technical reference) - /blog/how-to-get-cited-by-chatgpt (step-by-step) ## Author Alejandro Rioja — operator, AI agent builder, GEO practitioner. LinkedIn: https://linkedin.com/in/alejandrorioja ``` Pair this with a clean sitemap that includes `lastmod` timestamps. AI crawlers deprioritize content that looks stale. ## How to prioritize which posts to retrofit Not every post is worth retrofitting. Focus your first pass on: 1. **Posts that already rank on page 1 for a question-format keyword** — these are closest to being cited; they just need the structure fix 2. **Posts on topics you're verifiably credible on** — AI engines weight authorship heavily; a post where your credentials are relevant gets a citation lift from entity signals 3. **Posts that directly answer a question vs. posts that inform** — "How to do X" and "What is X" retrofit better than listicles or opinion pieces Use your Search Console data: filter for queries that are questions (how, what, why, best way to). Posts ranking 5–15 for those queries are your best retrofit candidates — they're relevant but not yet close enough to the top to get cited. ## The mistake most people make They write a new post optimized for AI search before retrofitting their existing archive. New content helps, but the existing posts have age, backlinks, and crawl history on their side. A well-structured three-year-old post will outperform a new post on the same topic for months. Do the retrofit first. Write new content where there are genuine gaps — questions your existing posts don't answer at all. That's when new is better than old. ## The operator's bottom line If you have more than 20 existing blog posts, your GEO work starts with audit and retrofit, not a content calendar. Add TL;DRs, tighten entity signals, add FAQ schema, and submit to llms.txt. Do that on your top 20 posts before writing anything new. You'll see citation improvements in weeks, not months — and you'll have a cleaner baseline for measuring whether new content actually moves the needle. --- ## I Built a Claude Skill That Runs My Facebook Ads — Here's the Code Source: https://alejandrorioja.com/i-built-a-claude-skill-that-runs-my-facebook-ads-heres-the-code/ Published: 2026-06-06 Updated: 2026-06-06 Tags: AI Agents TL;DR: I built a Claude skill that reads my Meta Ads account via the Graph API, identifies underperformers, rewrites ad copy in my brand voice, and creates new ad sets without me touching Ads Manager. The whole thing is under 300 lines of TypeScript. The ROI was immediate: I cut weekly ads-management time from ~3 hours to about 20 minutes. ## Table of contents _Updated June 2026._ **TL;DR:** I built a Claude skill that reads my Meta Ads account via the Graph API, identifies underperformers, rewrites ad copy in my brand voice, and creates new ad sets without me touching Ads Manager. The whole thing is under 300 lines of TypeScript. The ROI was immediate: I cut weekly ads-management time from ~3 hours to about 20 minutes. **[Operator's read]** I run ads for Pickleland and for my consulting brand. Two accounts, different audiences, constant creative fatigue. I was spending Sunday afternoons in Ads Manager doing things a model should be doing. So I automated it. ## Why I stopped managing Facebook ads manually The actual work of running Facebook ads breaks into three jobs: 1. **Monitoring** — checking which ad sets are burning money vs. printing it 2. **Diagnosing** — figuring out *why* something is underperforming (creative fatigue? bad targeting? landing page?) 3. **Iterating** — writing new copy, creating new ad sets, adjusting budgets Job 1 is mechanical. Job 3 is mostly mechanical (with a voice constraint). Job 2 needs judgment — and it's the only one that benefits from a human being in the loop. A Claude skill can do 1 and 3. I review job 2 outputs before anything ships. That's the architecture I landed on. ## The Meta Graph API setup (this is the annoying part) Before any code: you need a Meta Business account, a System User, and a permanent access token. Facebook's dev portal is hostile but the path is: 1. Create a **Meta App** at developers.facebook.com (type: Business) 2. Add the **Marketing API** product 3. Under your Business Portfolio → Settings → Users → System Users, create a system user and give it `ADVERTISER` role on your ad account 4. Generate a token with these permissions: `ads_read`, `ads_management`, `business_management` Store the token as `META_ACCESS_TOKEN` and your ad account ID (format: `act_XXXXXXXX`) as `META_AD_ACCOUNT_ID` in your `.env`. ## The skill file structure ``` .claude/skills/fb-ads/ SKILL.md ← instructions Claude reads index.ts ← the actual tool implementation types.ts ← shared types ``` The `SKILL.md` is what tells Claude when and how to use the skill. Mine says: ```markdown # Facebook Ads Manager Skill Use this skill when the user says "check my ads", "run ads report", "pause underperformers", or "write new ad copy". Never run this without explicit user instruction — it touches live ad spend. ## What it can do - Pull performance data for all active ad sets (last 7 or 30 days) - Flag ad sets with ROAS < 1.5 or CTR < 0.8% as underperformers - Rewrite ad copy for flagged creatives in Ale's voice - Create new ad sets with revised copy (PAUSED by default — you approve before activating) ## What it will NOT do - Change budgets on live ad sets without explicit confirmation - Activate new ad sets automatically - Delete anything ``` The "never activate automatically" constraint is non-negotiable. This skill creates things in PAUSED state. I review and activate manually. Anything touching live spend needs a human checkpoint. ## The core TypeScript code ```typescript // .claude/skills/fb-ads/index.ts import Anthropic from "@anthropic-ai/sdk"; const BASE = "https://graph.facebook.com/v20.0"; const TOKEN = process.env.META_ACCESS_TOKEN!; const ACCOUNT = process.env.META_AD_ACCOUNT_ID!; interface AdSetPerformance { id: string; name: string; status: string; spend: number; impressions: number; clicks: number; conversions: number; roas: number; ctr: number; cpc: number; } async function getAdSetPerformance(days = 7): Promise { const fields = [ "id", "name", "status", "insights.date_preset(last_" + days + "d){spend,impressions,clicks,actions,action_values}" ].join(","); const url = `${BASE}/${ACCOUNT}/adsets?fields=${encodeURIComponent(fields)}&access_token=${TOKEN}&limit=100`; const res = await fetch(url); const data = await res.json(); return (data.data ?? []).map((adset: any) => { const ins = adset.insights?.data?.[0] ?? {}; const spend = parseFloat(ins.spend ?? "0"); const impressions = parseInt(ins.impressions ?? "0"); const clicks = parseInt(ins.clicks ?? "0"); const purchaseValue = (ins.action_values ?? []) .filter((a: any) => a.action_type === "purchase") .reduce((s: number, a: any) => s + parseFloat(a.value), 0); const purchases = (ins.actions ?? []) .filter((a: any) => a.action_type === "purchase") .reduce((s: number, a: any) => s + parseInt(a.value), 0); return { id: adset.id, name: adset.name, status: adset.status, spend, impressions, clicks, conversions: purchases, roas: spend > 0 ? purchaseValue / spend : 0, ctr: impressions > 0 ? (clicks / impressions) * 100 : 0, cpc: clicks > 0 ? spend / clicks : 0, }; }); } async function getAdCreatives(adsetId: string): Promise<{ id: string; body: string; title: string }[]> { const url = `${BASE}/${adsetId}/ads?fields=creative{body,title}&access_token=${TOKEN}`; const res = await fetch(url); const data = await res.json(); return (data.data ?? []).map((ad: any) => ({ id: ad.id, body: ad.creative?.body ?? "", title: ad.creative?.title ?? "", })); } async function rewriteCopy(original: { body: string; title: string }, context: string): Promise<{ body: string; title: string }> { const client = new Anthropic(); const msg = await client.messages.create({ model: "claude-sonnet-4-6", max_tokens: 512, messages: [{ role: "user", content: `You are rewriting a Facebook ad in Alejandro Rioja's voice: direct, operator-focused, no hype, results-first. The ad is underperforming. Context: ${context} Original title: ${original.title} Original body: ${original.body} Rewrite it. Keep it under 90 words for the body. Make the headline a specific outcome or number. Return JSON: {"title": "...", "body": "..."}` }] }); const text = (msg.content[0] as any).text.replace(/```json\n?/, "").replace(/```/, "").trim(); return JSON.parse(text); } export async function runAdsReport(days = 7) { const adsets = await getAdSetPerformance(days); const active = adsets.filter(a => a.status === "ACTIVE"); const underperformers = active.filter(a => a.roas < 1.5 || a.ctr < 0.8); const winners = active.filter(a => a.roas >= 1.5 && a.ctr >= 0.8); return { adsets: active, underperformers, winners, days }; } export async function rewriteUnderperformers(report: Awaited>) { const rewrites = []; for (const adset of report.underperformers) { const creatives = await getAdCreatives(adset.id); for (const creative of creatives) { const context = `ROAS ${adset.roas.toFixed(2)}, CTR ${adset.ctr.toFixed(2)}%, spend $${adset.spend.toFixed(0)} over ${report.days} days`; const newCopy = await rewriteCopy(creative, context); rewrites.push({ adsetId: adset.id, adsetName: adset.name, original: creative, rewritten: newCopy }); } } return rewrites; } ``` ## How I use it day-to-day The skill is invoked from Claude Code (my daily driver). A typical Monday morning session: ``` > check my ads from the last 7 days ``` Claude runs `runAdsReport(7)`, formats the results as a table, flags underperformers, and asks if I want rewrites. I say yes. It generates new copy, shows me both versions side by side, and creates PAUSED ad sets with the new creative. I review them in Ads Manager, activate the ones I like, and archive the losers. Total time: 20 minutes. Zero Sunday afternoons in Ads Manager. ## What this doesn't replace The skill can't tell me whether a product-market fit problem is masquerading as a copy problem. If ROAS is bad across the board, that's a funnel or offer issue, not a headline issue. Claude will faithfully rewrite copy on a broken funnel — and the rewrites won't save it. The diagnostic step is still mine. I read the report, look at the funnel data, and decide whether we're iterating creative or solving something upstream. The agent is fast at everything *except* that judgment call. ## The operator's bottom line If you're running ads manually and touching Ads Manager more than twice a week, you're doing ops that a script should do. The Graph API is well-documented and the Meta permissions flow, while annoying, is a one-time setup. Build the skill in an afternoon. The payback in reclaimed time shows up in week one. --- ## The 5 AI Tools I Actually Use to Run My Business (2026) Source: https://alejandrorioja.com/the-5-ai-tools-i-actually-use-to-run-my-business-2026-operator-stack/ Published: 2026-06-06 Updated: 2026-06-06 Tags: AI Agents, Growth TL;DR: Five tools: Claude (operator layer + coding), Cursor (TypeScript development), Airtable (data backbone for all agents), Kit (newsletter + email automation), and Cloudflare Workers (agent hosting). Everything else I've tried has been replaced by one of these or cut entirely. This is the stack I'd rebuild if I had to start over today. ## Table of contents _Updated June 2026._ **TL;DR:** Five tools: Claude (operator layer + coding), Cursor (TypeScript development), [Airtable](/recommends/airtable) (data backbone for all agents), [Kit](/recommends/convertkit) (newsletter + email automation), and Cloudflare Workers (agent hosting). Everything else I've tried has been replaced by one of these or cut entirely. This is the stack I'd rebuild if I had to start over today. **[Operator's read]** I run two businesses: a personal AI-consulting brand (alejandrorioja.com) and Pickleland, a pickleball facility in Pflugerville, TX. Different contexts, different audiences, different ops. These five tools run both. I'm not listing them because they're trendy; I'm listing them because I've deleted their replacements. ## 1. Claude — the operator layer Claude (via Claude Code and the Anthropic SDK) is the brain of everything that moves. I use it in three modes: **Claude Code** is my daily driver for development. I write TypeScript, build agents, debug infrastructure issues, and manage content — all from the Claude Code interface. It's not just autocomplete; it's a collaborator that can read a 500-line file, understand intent, and propose a refactor I hadn't considered. **The Anthropic SDK** powers every agent I've built. My newsletter agent, my Facebook ads skill, my content pipeline, my OG card generator — all Claude on the backend. The model quality is high enough that I trust first drafts about 85% of the time. **Claude's voice and brand** judgment is underrated. When I'm writing something that needs to sound like me, I've found Claude + a detailed system prompt outperforms every other model I've tested. The trick is a specific, opinionated system prompt — not "write in a casual tone" but "write like Alejandro: direct, practitioner, no hype, numbered, first-person, with honest caveats." I pay for Claude Max. It's the most-used subscription I have, and the ROI is not close. ## 2. Cursor — where the TypeScript gets written Cursor is the IDE. I switched from VS Code about a year ago and haven't looked back. The tab completion is fast enough that it genuinely changes how I write code — I think at a higher altitude and let Cursor handle the syntactic boilerplate. The diff view for AI suggestions is clean. The multi-file context window means I can ask it to update a function and it updates the callers too. I don't use Cursor for architecture decisions. I still sketch those on paper or in Claude. But once the design is clear, Cursor is the fastest path from design to running TypeScript. The biggest unlock: Cursor + Claude Code in parallel. I use Claude Code for high-level planning and agent orchestration; I use Cursor for the implementation detail work. They don't conflict — they cover different altitudes. ## 3. Airtable — the data backbone Every AI agent I run needs a place to read from and write to. That place is [Airtable](/recommends/airtable). Here's what I use it for across both businesses: - **Content queue** — posts and newsletter topics in progress, with status tracking - **Booking records** — Pickleland court reservations synced from the booking system - **Affiliate link catalog** — 105+ slugs with metadata the content agent reads at generation time - **Agent audit log** — what ran, when, what it produced, any errors The API is clean and fast. Airtable is not a database for high-throughput workloads — but for agent side-tables, review queues, and human-in-the-loop approval workflows, it's exactly the right tool. The visual interface means I can inspect any table without writing a query. The alternative I tried: Notion databases. The Notion API is slower and the data model is clunkier for agent reads. Airtable wins for agent-adjacent data. ## 4. Kit — newsletter and email automation I switched to [Kit](/recommends/convertkit) (formerly ConvertKit) for one reason: the API is actually good. Most email platforms treat their API as an afterthought. Kit treats it as a first-class product. I can create broadcasts, schedule sends, segment by tag, and read analytics — all programmatically. My newsletter agent does all of this without me touching the composer. Kit-specific things I use: - **Broadcasts API** — my agent creates scheduled broadcasts programmatically every week - **Subscriber tagging** — I tag subscribers by behavior (opened last 5 sends = "engaged"; hasn't opened in 60 days = "at-risk") and my agent targets segments accordingly - **Forms + landing pages** — clean, fast-loading, no-code. I don't mess with these programmatically; they just work. If you're on Mailchimp or a legacy platform: the migration is worth it. Mailchimp's API requires three extra calls to do what Kit does in one. ## 5. Cloudflare Workers — where the agents live Every scheduled agent runs on Cloudflare Workers. The pitch: global edge deployment, zero cold starts on the free tier, and a cron trigger system that actually works. My agents don't need a server. They need a scheduled function that runs reliably, can make external API calls, and costs close to nothing at my scale. Workers is the answer. What I have running on Workers: - **Content pipeline** — generates EN post, fans out to 12 translations, generates OG card - **Newsletter agent** — drafts and schedules the weekly send - **Facebook ads monitor** — reads performance, flags underperformers, notifies me - **Pickleland occupancy reporter** — reads booking data, sends me a daily summary Total monthly cost for all of this: ~$5. That's the paid Workers plan. The agents run reliably on the cron schedule; I've had one failure in six months (a DNS issue on Meta's side, not mine). ## What I cut and why **Zapier** — replaced by Workers + the respective platform APIs directly. Zapier adds latency, costs more at scale, and has a ceiling that Workers doesn't. **ChatGPT** — Claude's context window, tool use, and system prompt quality are better for the operator use case. I keep a ChatGPT tab for quick web searches but don't build on it. **Webflow** — moved my site to Astro + Cloudflare Pages. More control, better performance, build process I can script against. **Grammarly** — Claude does everything Grammarly does and keeps my voice better. ## The operator's bottom line The five tools above are not the newest or the most-discussed. They're the ones that held up under daily production use across two different businesses. Before adding a new tool to your stack, ask: which of these five could do this job? You'll be surprised how often the answer is "one of them already can." --- ## Why Your AI Agent Keeps Failing in Production (And How to Fix It) Source: https://alejandrorioja.com/why-your-ai-agent-keeps-failing-in-production-and-how-to-fix-it/ Published: 2026-06-06 Updated: 2026-06-06 Tags: AI Agents TL;DR: Most production agent failures come from five causes: brittle prompts that don't handle edge cases, missing retry logic for transient API errors, no observability so you can't see what's breaking, runaway loops with no exit condition, and tool definitions that are ambiguous enough that the model picks the wrong one. All five are fixable without changing models or frameworks. ## Table of contents _Updated June 2026._ **TL;DR:** Most production agent failures come from five causes: brittle prompts that don't handle edge cases, missing retry logic for transient API errors, no observability so you can't see what's breaking, runaway loops with no exit condition, and tool definitions that are ambiguous enough that the model picks the wrong one. All five are fixable without changing models or frameworks. **[Operator's read]** I run 30+ agents in production. I've had all of these failures. The ones that burned the most time weren't the exotic ones — they were the boring infrastructure failures I thought I'd handled. ## Failure 1: Brittle prompts that break on edge-case inputs A prompt that works on your test cases will fail on inputs you didn't anticipate. That's not a model limitation — it's an instruction-writing problem. **Symptoms:** The agent produces nonsense output, calls the wrong tool, or outputs malformed JSON when the input is slightly different from what you tested. **Root cause:** Your system prompt describes the happy path only. It doesn't tell the model what to do when data is missing, malformed, or ambiguous. **Fix:** Add explicit edge-case handling to your system prompt: ``` If the input data is missing a required field, return: { "status": "error", "reason": "missing_field", "field": "" } Do NOT attempt to infer or hallucinate missing values. If you are uncertain which tool to call, call no tool and return: { "status": "clarification_needed", "question": "..." } ``` The model follows explicit instructions for edge cases reliably. The mistake is assuming it will generalize the happy-path instructions to handle the messy cases. ## Failure 2: No retry logic for transient API errors Every external API your agent calls will fail at some point. Claude's API, the Meta Graph API, your database — all of them return 5xx errors, timeout, or rate-limit. If your agent has no retry logic, one transient error kills the whole run. **Symptoms:** Agent runs fail randomly at different steps. The logs show a 503 or 429 with no follow-up attempt. **Fix:** Wrap every external call in an exponential-backoff retry: ```typescript async function withRetry(fn: () => Promise, retries = 3, baseDelayMs = 500): Promise { for (let attempt = 0; attempt <= retries; attempt++) { try { return await fn(); } catch (err: any) { const isTransient = err.status === 429 || err.status >= 500 || err.code === "ECONNRESET"; if (!isTransient || attempt === retries) throw err; const delay = baseDelayMs * Math.pow(2, attempt) + Math.random() * 100; await new Promise((r) => setTimeout(r, delay)); } } throw new Error("unreachable"); } // Usage const result = await withRetry(() => client.messages.create({ ... })); ``` Three retries with exponential backoff handles ~99% of transient failures. Add this to every external call and half your random failures disappear. ## Failure 3: No observability — you can't see what's breaking This is the most common failure mode in production and the one that costs the most time to debug: the agent fails silently or produces wrong output, and you have no idea where in the chain it went wrong. **Symptoms:** You know something is wrong but can't identify the step. You add `console.log` statements and re-run manually trying to reproduce. **Fix:** Structured logging on every step, with a run ID that traces the entire execution: ```typescript function createLogger(runId: string, agentName: string) { return { step: (step: string, data: object) => console.log(JSON.stringify({ runId, agent: agentName, step, ts: new Date().toISOString(), ...data })), error: (step: string, err: unknown) => console.error(JSON.stringify({ runId, agent: agentName, step, error: String(err), ts: new Date().toISOString() })), }; } const log = createLogger(crypto.randomUUID(), "newsletter-agent"); log.step("fetch_topic", { topicId: topic.id, topic: topic.name }); // ... do work ... log.step("draft_complete", { subject: draft.subject, wordCount: draft.body.split(" ").length }); ``` If you're on Cloudflare Workers, these logs go to Logpush or Workers Tail. If you're running locally or on a VPS, pipe them to a log aggregator. The structured JSON means you can filter by `runId` to see exactly what happened in a single run. ## Failure 4: Runaway loops with no exit condition Agentic loops — where the model calls tools and iterates until a condition is met — can run forever if that condition is never met or the model misidentifies it. **Symptoms:** Agent spends hundreds of dollars in API costs before timing out. Or it runs the same tool call over and over without making progress. **Fix:** Always have a hard iteration cap and a progress check: ```typescript const MAX_ITERATIONS = 10; let iterations = 0; let lastToolCallName = ""; let sameToolCallCount = 0; while (true) { iterations++; if (iterations > MAX_ITERATIONS) { log.error("loop", { reason: "exceeded_max_iterations" }); break; } const response = await client.messages.create({ ... }); // Detect stuck loops: same tool called 3x in a row const toolCall = response.content.find(b => b.type === "tool_use"); if (toolCall?.name === lastToolCallName) { sameToolCallCount++; if (sameToolCallCount >= 3) { log.error("loop", { reason: "stuck_loop", tool: toolCall.name }); break; } } else { sameToolCallCount = 0; lastToolCallName = toolCall?.name ?? ""; } if (response.stop_reason === "end_turn") break; } ``` This catches both "ran too long" and "spun in place" failure modes. The cap should be generous enough for the happy path but tight enough to limit blast radius. ## Failure 5: Ambiguous tool definitions the model resolves wrong If you give the model two tools with overlapping descriptions, it will sometimes call the wrong one. This is especially common with tools like `search_database` vs `get_record` or `send_email` vs `create_draft`. **Symptoms:** The model calls the right category of tool but picks the wrong specific one. Or it calls a tool in the wrong context (using a write tool when only reading was appropriate). **Fix:** Make tool descriptions mutually exclusive and add explicit "when NOT to use this": ```typescript const tools = [ { name: "get_subscriber", description: "Fetch a single subscriber record by email. Use ONLY when you have a specific email address. Do NOT use for searching or listing subscribers.", input_schema: { ... } }, { name: "search_subscribers", description: "Search subscribers by tag, segment, or status. Use when you need to find subscribers matching a criteria — NOT when you have a specific email address.", input_schema: { ... } } ]; ``` The "do NOT use when X" clause is the part most people skip. It's the most important part. Models are better at following explicit negative constraints than inferring them from positive descriptions. ## One more thing: test your agents on bad inputs Most agents are tested only on clean, happy-path inputs. Production has dirty inputs: empty strings, null fields, Unicode edge cases, API responses that return 200 but with an unexpected schema. Add a test suite that explicitly exercises: - Empty or null inputs - Inputs at the maximum length you'd expect - Inputs with special characters or non-ASCII text - External APIs returning unexpected response shapes If your agent breaks on any of these, fix it before it goes live. The production environment will find every assumption you made. ## The operator's bottom line Most agent failures in production are infrastructure problems masquerading as model problems. Before you switch models, add retries, structured logging, loop caps, and explicit edge-case handling to your prompts. Fix the ambiguous tool definitions. Then test on bad inputs. Do all of that before blaming the model — in my experience, the model is usually the last thing that needs to change. --- ## How to Build Your First AI Agent in 15 Minutes Source: https://alejandrorioja.com/how-to-build-your-first-ai-agent-in-15-minutes/ Published: 2026-06-02 Updated: 2026-06-02 Tags: AI Agents TL;DR: You don't need a framework, a course, or a PhD. You need Node.js, the Anthropic SDK, and 25 lines of TypeScript. This tutorial builds a real, working agent — a structured content summarizer you can deploy to Cloudflare in the same session. The only prerequisite is a free API key. ## Table of contents _Updated June 2026._ **TL;DR:** You don't need a framework, a course, or a PhD. You need Node.js, the Anthropic SDK, and 25 lines of TypeScript. This tutorial builds a real, working agent — a structured content summarizer you can deploy to Cloudflare in the same session. The only prerequisite is a free API key. **[Operator's read]** The most common thing I hear from founders who want to automate with AI is "I need to learn more first." You don't. The agent pattern is simple, and the fastest way to understand it is to build one. Here's the exact path I'd take if I were starting from zero today. ## Why most "build an AI agent" tutorials fail you They either use Python (fine for ML engineers, friction for everyone else), hide the real code behind a framework like LangChain, or build something too abstract to connect to your actual work. This tutorial does three things differently: 1. **TypeScript only** — if you've ever written JavaScript, you can follow this 2. **No framework** — you'll see every line of code that touches the model 3. **A useful output** — you'll build a structured summarizer you can actually use on customer emails, reviews, or meeting notes ## What you're building A **content summarizer agent**: paste any block of text, get back a structured summary in a consistent format. One HTTP request in, one clean summary out. Why this as a first project: the pattern — system prompt + user input → structured output — is the foundation of every agent I run. Swap the system prompt and you have a question-answerer, a tone rewriter, a classifier, or a draft generator. Learn this once and you've learned 80% of what production agents actually do. ## Prerequisites (2 minutes) - **Node.js 18+** — check with `node --version`. Install from nodejs.org if needed. - **An Anthropic API key** — sign up at [Claude](/recommends/claude), grab a key from the console. The free tier works. - A terminal and a text editor. No Docker. No virtual environment. No `pip install` anything. ## Step 1: Create the project (2 minutes) ```bash mkdir my-first-agent && cd my-first-agent npm init -y npm install @anthropic-ai/sdk npm install -D tsx typescript ``` Add a script to `package.json` so you can run the agent easily: ```json { "scripts": { "agent": "tsx agent.ts" } } ``` ## Step 2: Write the agent (5 minutes) Create `agent.ts` and paste this: ```typescript import Anthropic from "@anthropic-ai/sdk"; const client = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY, }); const SYSTEM_PROMPT = `You are a precise content summarizer. When given any block of text, return a structured summary in this exact format: **One-line summary:** **Key points:** - - - **Action item (if any):** Be specific. No filler. Under 150 words total.`; async function summarize(text: string): Promise { const message = await client.messages.create({ model: "claude-haiku-4-5", max_tokens: 512, system: SYSTEM_PROMPT, messages: [{ role: "user", content: text }], }); const block = message.content[0]; if (block.type !== "text") throw new Error("Unexpected response type"); return block.text; } const sample = ` Hey team — following up on the Q2 review meeting. We agreed to push the launch to July 15th instead of June 30th due to the payment integration delay. Marketing needs the new landing page copy by June 20th or we can't start the email campaign. Budget for the launch campaign is confirmed at $8,000. Please confirm receipt. `; const result = await summarize(sample); console.log(result); ``` ## Step 3: Run it (1 minute) ```bash ANTHROPIC_API_KEY=sk-ant-... npm run agent ``` Expected output: ``` **One-line summary:** Launch pushed to July 15th due to payment delay; landing page copy needed by June 20th to unblock email campaign. **Key points:** - Launch date moved from June 30th to July 15th - Landing page copy deadline: June 20th (blocks email campaign) - Campaign budget confirmed at $8,000 **Action item (if any):** Confirm receipt and deliver landing page copy by June 20th. ``` That's a working AI agent. Real input, custom system prompt, structured output. The whole thing is 30 lines of code. ## Step 4: Customize it for your use case The system prompt is the only thing that makes this agent yours. Here are three drop-in alternatives: **Customer review classifier:** ```text Classify this customer review as POSITIVE, NEGATIVE, or MIXED. Then extract the main complaint or praise in one sentence. Format: SENTIMENT: