04 est. 45 min

Memory, tools, evals

Module 04 · est. 45 min · You’ll build: a memory file your agent reads and writes, a tool layer that fails safe instead of breaking, and a tiny eval harness that tells you when your agent is wrong — before your customers do.

TL;DR: Three things turn a clever script into an agent you trust. Memory: a place to store what it learned so it doesn’t repeat itself or forget context across runs (for most agents this is a file or a row in a table — not a vector database). Tools: the agent’s hands, built to fail safe and validate their inputs. Evals: a way to know when the agent is wrong, run automatically, so you find out before reality does. Skip evals and you’re flying blind; most “agent failures” are actually “I had no way to tell it was failing.”

[Operator’s read] I learned all three the expensive way. My early agents forgot everything between runs and re-did work. My tools threw raw errors that the model “helpfully” hallucinated around. And I had no evals, so I found out my review-reply agent was off-tone from a customer, not a test. Don’t repeat my tuition payments.


Memory: a brain that doesn’t forget

An agent without memory is Groundhog Day. It wakes up, does its job, forgets everything, and wakes up blank tomorrow. For some agents that’s fine (a pure summarizer doesn’t need to remember). For most, it’s the difference between useful and maddening.

First, kill the vector-database reflex. The internet will tell you “agent memory = embeddings + a vector DB.” For 90% of business agents, that’s wildly overbuilt. Vector search is for “find the semantically-similar needle in 100,000 documents.” You probably have memory needs like “what did I already reply to” and “what’s this customer’s history” — and those are a key-value lookup, not a similarity search. Reach for the simplest thing that works:

The memory ladder, cheapest first:

  1. A checkpoint — one value: “last item I processed.” (The polling marker from Module 03.) Solves 40% of memory needs alone.
  2. A JSON file or markdown file — append what the agent learns, read it at the start of the next run. Lives in the repo or in object storage. This is most of my memory.
  3. A row in a table (Airtable, a database, Cloudflare KV/D1) — when memory is structured and you query it: “this customer, this status, last touched when.”
  4. A vector store — only when you genuinely need semantic search over a large corpus. If you can’t articulate why a JSON file fails, you don’t need this yet.

Memory as a file the agent reads and writes:

typescript
// memory.ts — dead simple, survives restarts, debuggable by eyeball
import { readFileSync, writeFileSync, existsSync } from 'fs';

const MEM = './memory.json';

export function loadMemory(): Record<string, any> {
  return existsSync(MEM) ? JSON.parse(readFileSync(MEM, 'utf-8')) : {};
}

export function remember(key: string, value: any) {
  const mem = loadMemory();
  mem[key] = value;
  writeFileSync(MEM, JSON.stringify(mem, null, 2)); // pretty-printed so YOU can read it
}

Then you give the agent memory as context at the top of its run, and a tool to write to it:

typescript
// In run.ts, inject memory into the first message:
const memory = loadMemory();
const messages = [{
  role: 'user',
  content: `Here is what you remember from past runs:\n${JSON.stringify(memory, null, 2)}\n\nNow run your job.`,
}];

// And expose a 'remember' tool so the agent can write back:
{
  name: 'remember',
  description: 'Store a fact for future runs. Use for things you should not repeat or forget.',
  input_schema: {
    type: 'object',
    properties: { key: { type: 'string' }, value: {} },
    required: ['key', 'value'],
  },
}

The pro move: prune memory. Memory that only grows becomes a token bomb (cost + the model drowning in irrelevant history). Decide what’s worth keeping. My instagram-growth agent remembers who it followed and when (so it can unfollow non-followers later) but forgets the reasoning — it doesn’t need it. Memory is curation, not hoarding. A good memory file is short and high-signal.

Tools: hands that don’t break

Your agent is only as reliable as its worst tool. A model can recover from a fuzzy prompt; it cannot recover from a tool that lies, throws cryptically, or does something irreversible by surprise. Four rules I follow for every tool.

Rule 1 — Validate inputs at the boundary. The model will, occasionally, call your tool with garbage — a malformed date, a missing field, a wildly out-of-range number. Validate before you act. Never trust the model’s input like it’s a typed function call; treat it like untrusted user input, because effectively it is.

typescript
async function sendInvoice(input: any): Promise<string> {
  // Validate FIRST. Return a useful error the MODEL can read and recover from.
  if (typeof input.amount !== 'number' || input.amount <= 0) {
    return `ERROR: amount must be a positive number, got ${JSON.stringify(input.amount)}. Fix and retry.`;
  }
  if (!/^\S+@\S+\.\S+$/.test(input.email ?? '')) {
    return `ERROR: invalid email "${input.email}". Provide a valid address.`;
  }
  // ... only now do the real thing
}

Rule 2 — Errors are messages to the model, not exceptions to the void. When a tool fails, don’t throw and crash the loop. Return the error as a string the model can read. The model is smart — tell it “the API returned 429, rate-limited, wait and retry” and it often will. A thrown exception kills the agent; a returned error string lets it adapt. This is the single highest-leverage tool habit.

Rule 3 — Make destructive tools loud and gated. Any tool that spends money, sends to customers, or deletes something gets a guardrail. The simplest: a dryRun flag that defaults to true.

typescript
async function sendEmail(input: any): Promise<string> {
  if (process.env.DRY_RUN !== 'false') {
    console.log(`[DRY RUN] would send to ${input.to}: ${input.subject}`);
    return 'DRY RUN: email composed but not sent. Looks good.';
  }
  return await actuallySend(input);
}

You develop and test with DRY_RUN on. You flip it off only when the agent has earned trust. Nearly every customer-facing agent I run started its life in permanent dry-run, printing what it would do, for days, before I let it touch a real send.

Rule 4 — One tool, one job, well-described. The description field is not documentation — it’s prompt. The model picks tools based on those descriptions. Vague description, wrong tool. “Sends an email” is weak. “Sends the finished digest to the operator’s Slack. Call exactly once, only after the digest is complete” tells the model when and when not to use it. Spend real effort here; it’s some of the highest-ROI writing in the whole agent.

Evals: a way to know when it’s wrong

This is the part everyone skips and the part that separates pros from hobbyists. An eval is a test for non-deterministic output. Your agent won’t produce identical text every run, so you can’t assert equals. Instead you check properties: did it do the right kind of thing? Did it avoid the known failure modes?

You do not need a fancy eval framework. You need a folder of test cases and a script that runs your agent against each and checks properties. Here’s a real, minimal eval harness:

typescript
// evals/run.ts — your agent's safety net
type Case = {
  name: string;
  input: string; // the source the agent will see
  checks: ((output: string) => boolean)[]; // properties that must hold
};

const cases: Case[] = [
  {
    name: 'empty source → must NOT invent content',
    input: '',
    checks: [
      (out) => /nothing|no items|empty|quiet/i.test(out), // acknowledges emptiness
      (out) => !/meeting|deadline|event/i.test(out), // did NOT hallucinate items
    ],
  },
  {
    name: 'normal day → leads with most important item',
    input: 'Minor: update the wiki. URGENT: payroll deadline today 5pm. FYI: new follower.',
    checks: [(out) => out.toLowerCase().indexOf('payroll') < out.toLowerCase().indexOf('wiki')],
  },
  {
    name: 'off-tone trap → no profanity, stays on-brand',
    input: 'Customer review: "this place SUCKS, total ripoff!!!"',
    checks: [
      (out) => !/\b(suck|idiot|wrong|stupid)\b/i.test(out), // doesn't fight back
      (out) => /sorry|apolog|understand|reach out/i.test(out), // de-escalates
    ],
  },
];

let failed = 0;
for (const c of cases) {
  const output = await runAgent(c.input); // call your real agent
  const passed = c.checks.every((check) => check(output));
  console.log(`${passed ? 'PASS' : 'FAIL'}  ${c.name}`);
  if (!passed) {
    failed++;
    console.log(`  got: ${output}\n`);
  }
}
console.log(`\n${cases.length - failed}/${cases.length} passed`);
process.exit(failed > 0 ? 1 : 0);

Run npx tsx evals/run.ts and you get a pass/fail report. Wire it into CI (Module 07) and it runs on every change. Now when you tweak the prompt and accidentally break the empty-source handling, the eval catches it before the agent ships garbage to a customer.

The two checks every agent’s evals need:

  1. The hallucination check — feed it empty/thin input, assert it does NOT invent. This catches the #1 production failure.
  2. The known-failure check — every time the agent screws up in real life, add that exact case to the eval suite. Your evals grow from real failures. After six months, your eval folder is an encyclopedia of every way this agent has ever been wrong, and it can never regress into them again.

Use a model to grade, for fuzzy criteria. Some properties (“is this reply on-brand?”) are too fuzzy for a regex. For those, call Claude as a judge: feed it the output and a rubric, ask for pass/fail + reason. It’s another API call, but it lets you eval taste, not just keywords. Keep deterministic checks where you can — they’re free and fast — and use the model-judge only for the genuinely subjective stuff.

How the three work together

Memory, tools, evals aren’t three separate chores. They’re a system:

  • Tools define what the agent can do.
  • Memory defines what it knows across runs.
  • Evals define how you trust the combination.

When something goes wrong, the diagnosis is almost always one of the three: it acted twice (memory/checkpoint bug), it broke on weird input (tool validation bug), or it produced subtly-wrong output you didn’t catch (missing eval). Internalize that triage and debugging agents stops being mysterious.

Hands-on lab

Take the agent you built in Module 02 and harden it.

Step 1 — Add memory. Give it a memory.json and a remember tool. Make it store one useful thing across runs — for a digest agent, “the last thing I flagged, so I don’t re-flag it.” Run it twice and confirm the second run sees the first run’s memory.

Step 2 — Harden one tool. Pick your agent’s most dangerous tool (the one that sends or spends). Add input validation that returns a readable error string, and a DRY_RUN flag defaulting to safe. Deliberately feed it bad input and confirm it returns an error the model can act on instead of crashing.

Step 3 — Write three evals. A folder, a runner, three cases: (1) the empty/thin-input hallucination check, (2) one “it must lead with the important thing” check, (3) one check specific to your agent’s worst plausible failure. Run them. Make them pass.

Step 4 — Break it on purpose. Edit your prompt to introduce a regression (e.g. remove the “never invent” clause). Run the evals. Watch them go red. That red is the whole point — it’s proof you’ll know next time, before a customer does. Then fix it back to green.

Deliverable: your Module 02 agent, now with persistent memory, one fail-safe tool, and a green eval suite that includes a hallucination check. You can now trust this thing a little. Next module: why the prompt you wrote will still fail in production, and the structure that fixes it.

↵ to see all results esc esc to close