Field Notes6 min read
Prompt Versioning Without the Hairball
How to keep a prompt from becoming 30 untracked variants in 30 places. A four-rule discipline that scales from one prompt to a hundred.
Six months into building an AI feature, a normal team has somewhere between five and forty prompts. They live in: a Notion doc, three Python files, a "/prompts" folder, a few PR descriptions, several Slack threads, and the prompt-management UI of a vendor someone trialed in March.
Nobody knows which one is in production. That is the hairball. Untangling it after the fact takes a week. Preventing it up front takes four rules.
Rule 1: Prompts live in code
Not in a docstring. Not in Notion. Not in a SaaS dashboard. In your repo, in a file, on a branch, behind code review.
The repo is the only place where you get diffs, history, blame, and atomic deploys. Every other location loses at least one of those, and the one it loses is the one you'll need at 11pm in week six.
A flat prompts/ directory with one file per prompt is fine. So is a single TypeScript or Python module of constants. The form matters less than the rule.
Rule 2: One canonical version per name
At any moment, in production, prompts.draftReply resolves to exactly one string. Not "v3 for free users, v4 for paid, v5 for the experiment Alice is running."
If you need experiments, run them as named experiment branches at the call site:
const variant = chooseVariant(user, ["draftReply", "draftReplyTighter"]);
const prompt = prompts[variant];
Now there's still only one draftReply, and there's a clearly-named experimental sibling. The experiment ends when one of them is renamed back to draftReply or deleted.
Why this matters: when your eval pass rate or your support tickets change, you need to be able to ask "which prompt was running?" and get a single answer. Variant-by-user without naming is the same hairball, just newer.
Rule 3: Build at compile time, not request time
The prompt your model sees should be assembled before the request fires — and the assembled string should be logged.
Anti-pattern:
const response = llm.complete(`You are a helpful assistant.
${getCustomerTone(user)}
Reply to: ${message}`);
Six months later, with five conditionals around tone, three retrieval injections, and a stale memo from a customer in April, no one can reconstruct what the model actually saw on a specific call. Not even with the request ID.
Pattern:
const prompt = buildPrompt({
template: prompts.draftReply,
customerTone: getCustomerTone(user),
message,
});
log.promptVersion = prompts.draftReply.version;
log.renderedPrompt = prompt;
const response = llm.complete(prompt);
The build step is testable. The rendered prompt is logged. The version is logged. Future-you can answer "what did the model see?" with one query.
Rule 4: Log the version, always
Every model call's log record includes the prompt name and version hash. (This is the same field the logging post calls "prompt version" — non-negotiable.)
Two reasons:
-
Regression diagnosis. "Outputs got worse on Tuesday" → group logs by prompt version → "version
a8f2went live Tuesday at 10am." Now you have a suspect. - Eval correlation. Your eval set runs against a specific prompt version. If your prod logs don't record the version, you can't say "this eval result applies to what users are seeing right now."
Use a content hash (first 8 chars of SHA-256 of the prompt string) or a semver bumped manually. Both work. Content hash is more honest — semver requires discipline humans don't have.
The customer-fork trap
One pattern that consistently degenerates into a hairball: forking prompts per customer.
Customer A asks for "more formal tone." Customer B asks for "include their company logo description in every reply." Customer C wants "never mention competitor names." Six months in, you have 23 forks of draftReply with names like draftReply_acme, draftReply_globex_v2_final.
The fix is to push the variation into data, not into prompts:
prompts.draftReply // one canonical template
+
customerConfig[user.tenantId] // tone, banned words, forbidden topics
One prompt + structured per-customer config you can audit, diff, and back-test. Twenty-three forks is a maintenance disaster waiting for its first bug.
Close
Prompt versioning is unglamorous. The four rules look like ceremony when you have one prompt. By the time you have ten, they are the difference between a team that can iterate confidently and a team that's afraid to touch what's working.
Set the rules now. They cost nothing on day one and save weeks at month six.