05 est. 40 min

Prompting for production

Module 05 · est. 40 min · You’ll walk away with: the prompt skeleton I use for every production agent, and the ability to spot why a prompt that works in ChatGPT will quietly fail when it runs unattended.

TL;DR: ChatGPT prompts and production-agent prompts are different jobs. A ChatGPT prompt optimizes for one good answer with you sitting there to course-correct. A production prompt optimizes for correct behavior across thousands of inputs with nobody watching. The skeleton that survives production: define the role and the unattended context, give an explicit “done,” handle the empty/ambiguous case, forbid invention, specify output format exactly, and put rules as constraints not suggestions. Vague prompts hallucinate; over-stuffed prompts confuse. The discipline is saying exactly enough.

[Operator’s read] I have prompts that have run thousands of times without me reading the output. The only reason that’s safe is that the prompt anticipated the weird inputs I never saw. Every clause in the skeleton below is scar tissue from an agent that did something dumb because I didn’t tell it not to.

Why your ChatGPT prompts will fail

You’re good at ChatGPT. You ask, it answers, if it’s off you say “no, more like this,” and you converge. That conversational repair is doing way more work than you realize. In production, there is no repair turn. The agent runs at 6am, on input you’ve never seen, and whatever it produces is the final answer. It ships. No “actually, try again.”

That changes everything about how you prompt. Three specific failures show up the moment you remove the human:

Failure 1 — The agent assumes there’s always something to do. You prompt “summarize today’s events.” In ChatGPT, there’s always input because you pasted it. In production, sometimes the calendar is empty — and an under-specified agent will invent events because the prompt implied events exist. The empty case doesn’t occur to you when you’re testing with real data. It occurs constantly in production.

Failure 2 — The agent never stops, or stops too early. Agents loop (Module 02). A ChatGPT prompt doesn’t need a stop condition because you read the answer and close the tab. A production prompt without an explicit “done” will re-call tools, second-guess itself, or trail off mid-task. You have to define the finish line.

Failure 3 — The output format drifts. In ChatGPT you eyeball the answer and copy the part you want. In production, something downstream consumes the output — a Slack message, an email body, a database write. If the format drifts (“Sure! Here’s your digest: …” one day, a markdown table the next), the downstream breaks. You need the format pinned, exactly.

The production prompt skeleton

Here’s the structure I use for every agent. Not every section every time, but this is the checklist. I’ll build it up clause by clause and explain why each one earns its place.

code

## ROLE & CONTEXT
You are {specific role} for {specific business}. You run {trigger: e.g.
"on a schedule, unattended, every morning at 6am"}. No human reviews your
output before it ships. Get it right the first time.

## INPUTS
You will receive:
- {input 1}: {what it is, what it might look like, including the empty case}
- {input 2}: ...
Inputs may be empty, partial, or malformed. Handle that explicitly (see RULES).

## JOB
Do exactly this, in order:
1. {step}
2. {step}
3. {step — usually the action via a tool}

## RULES (hard constraints, not suggestions)
- Never include anything you cannot support from the inputs. No guessing,
  no filling gaps with plausible-sounding content.
- If inputs are empty or unusable, {exact fallback behavior}. Do NOT proceed
  as if there were data.
- {domain rule}
- {safety rule, e.g. "never mention competitors / never quote a price"}

## OUTPUT FORMAT
{Exact format. Show a literal example. Specify what NOT to include —
no preamble, no "here's your...", no markdown if downstream is plain text.}

## DONE MEANS
You have {specific completion condition, usually "called {tool} exactly once"}
and have nothing left to do. Then stop.

Now the reasoning behind each section.

ROLE & CONTEXT — the “unattended” line is load-bearing. Telling the model “no human reviews this” measurably changes its behavior. It stops asking clarifying questions (useless when nobody’s there to answer), stops hedging with “you may want to consider,” and commits to a decision. A model that thinks it’s chatting behaves differently from one that knows it’s the last line. State the trigger too — “you run every morning” frames the whole task.

INPUTS — describe the weird ones. Beginners describe the happy-path input. Pros describe the empty, partial, and malformed inputs, because those are what break things. “You’ll receive today’s calendar — which may be empty on weekends, may contain all-day events with no time, and occasionally contains a malformed entry from the import.” Every weird case you name is a failure you prevent.

JOB — ordered steps, ending in the action. Number them. Models follow numbered steps more reliably than prose. The last step is almost always “call the tool that does the real thing,” which connects the thinking to the consequence.

RULES — constraints, phrased as hard limits. This is where you encode everything that should never happen. Three rules belong in nearly every agent:

The anti-hallucination rule (“never include what you can’t support from inputs”). Your single most important sentence.
The empty-case rule (“if inputs are empty, do X — do not proceed as if there were data”). Prevents Failure 1.
A safety rule specific to the domain (never quote prices, never name competitors, never send to more than N people, never spend over $X).

Phrase rules as absolutes. “Try to avoid X” gets ignored under pressure. “Never X” holds. The model treats imperative constraints differently from polite suggestions — use the imperative.

OUTPUT FORMAT — show, don’t describe, and say what to omit. Don’t write “respond professionally.” Show a literal example of the exact output you want, and explicitly list what must NOT appear: “No preamble. Do not start with ‘Here is.’ No markdown headers. Plain text only, because this goes straight into an SMS.” The “what not to include” list prevents Failure 3.

DONE MEANS — the stop condition. “You have called send_digest exactly once and have nothing left to do.” This single line prevents the agent from looping forever or re-sending. Without it, you’re trusting the model to know when to quit, and it won’t reliably.

The “few-shot” upgrade: show, don’t tell

When behavior is fuzzy and rules aren’t enough, give examples. Two or three input→output pairs teach tone and judgment faster than a paragraph of description. This is how I get review replies in the right voice:

code

## EXAMPLES (match this tone exactly)

Review: "Courts were great but the AC was struggling on Saturday."
Reply: "Really glad you enjoyed the courts! You're right about Saturday —
we've already got the AC issue scheduled for service this week. Thanks for
the heads up, and hope to see you back on court soon."

Review: "Booking system is confusing."
Reply: "Thanks for flagging this — booking should be the easy part. Mind
sending us a quick note about where you got stuck? We're actively improving
it and your specifics genuinely help."

Now reply to the review provided, matching this tone: warm, specific,
accountable, never defensive, never generic.

Two examples did more than three paragraphs of “be warm and professional” ever could. The model pattern-matches the register — sentence length, the accountability move, the specific callback. For any agent where tone or judgment matters, few-shot beats instructions. Pull your examples from your actual best work.

The over-stuffing trap (the opposite failure)

There’s a failure mode on the other side: cramming so much into the prompt that the model loses the thread. A 2,000-word prompt with forty rules will have the model forgetting rule #12 by the time it’s processing rule #38. Signs you’ve over-stuffed:

Rules that contradict each other (the model picks one at random).
Edge cases for inputs that can’t actually occur.
Re-explaining the same constraint three ways.

The fix: every clause must earn its place by preventing a real failure you’ve seen or can concretely predict. If you can’t name the failure a rule prevents, cut it. My best prompts are shorter than my early ones — not because I got lazy, but because I learned which clauses actually matter. Tight beats comprehensive.

Prompt like you’re managing, not coding

The mental shift that makes you good at this: you’re not programming, you’re delegating to a fast, literal, capable employee who takes everything you say at face value and nothing you don’t. It won’t read your mind. It won’t infer “obviously don’t email all 5,000 customers” unless you said so. But tell it clearly and it executes flawlessly at 3am.

So write prompts the way you’d brief a sharp new hire on a task you’re about to leave them alone with: the goal, the steps, the hard limits, what “done” looks like, and the two or three ways it could go wrong that they’d never guess. That brief is the prompt.

Hands-on lab

Rewrite your Module 02/04 agent’s prompt into the production skeleton.

Step 1 — Audit your current prompt against the skeleton. Which of the six sections do you have? Almost certainly you’re missing an explicit DONE, a real empty-case RULE, or a pinned OUTPUT FORMAT. Mark the gaps.

Step 2 — Add the three universal rules. The anti-hallucination rule, the empty-case rule, and one domain safety rule specific to your business. Phrase all three as “Never…” absolutes.

Step 3 — Pin the output. Replace any “respond professionally” vagueness with a literal example of the exact output, plus a “do not include” list. Make it match what the downstream actually needs.

Step 4 — Run your evals from Module 04. Your hardened prompt should pass them. Then write one new eval: feed the agent deliberately ambiguous input (a half-sentence, a contradictory instruction) and assert it handles it gracefully — falls back, flags uncertainty, or asks — instead of confidently making something up.

Step 5 — The unattended test. Read your final prompt and ask: “If I handed this to a literal-minded stranger and left the room, would they do the right thing on a weird input?” If you hesitate, there’s a missing clause. Add it.

Deliverable: your agent’s prompt, restructured into the six-section skeleton, with the three universal rules, a pinned output format, and a new eval proving it handles ambiguity without hallucinating. Next module: the three things that kill agents before features ever do — cost, latency, and security — and where to put the human gate so none of them bankrupt or embarrass you.