Field Notes7 min read
Cost Ceilings for AI Features
Most AI features die for cost, not quality. Set the unit-economics ceiling before you ship, watch the four cost vectors, and know the three levers when you're over budget.
Quality kills few AI features. Cost kills most of them.
The pattern looks the same every time: a feature ships, usage grows, the monthly bill triples, finance forwards the invoice with a question mark, and the team scrambles to make cuts under pressure — usually cuts that also hurt quality. The fix isn't shipping cheaper from day one. The fix is having a number you can't cross, and watching the levers that move you toward it.
The unit-economics ceiling
Before you ship, write down one number: maximum cost per user action. Not per month, not per request — per the action your user takes.
Examples:
- "Drafting a customer reply: max $0.04 per draft."
- "Summarizing a support ticket: max $0.008 per ticket."
- "Generating a product description: max $0.12 per product."
Two rules. The ceiling has to be set by someone who knows the unit economics of the product (often a PM or finance partner, not the engineer building the feature). And the ceiling has to be written down — slack messages disappear, spreadsheet cells get overwritten.
If you can't say what the ceiling is, you do not know whether to ship.
The four cost vectors
Cost per user action breaks down into four multiplied terms. Move any one and you change the bill. Get any one wrong in your estimate and your projection is off by 2-10×.
1. Per-call cost
Token-in price × tokens-in + token-out price × tokens-out, plus any tool-call or retrieval costs. Compute this for a representative input — not a one-line "hello" prompt. Real prompts include system instructions, examples, retrieved context. They are 10× longer than your demo prompt.
2. Calls per user action
How many model calls does one user-facing action trigger? An agentic flow can be 3-15 calls deep. A single-call feature is 1. Get this wrong and your math is off by a factor.
3. Retries and reruns
What's your retry rate on transient failures? What's your re-run rate when output fails validation and you ask again? In production these often run 5-20%. The PoC's 1% is not real.
4. Fallback / escalation
What fraction of calls escalate to a more expensive model or to a human review queue? If 12% of calls escalate to a model that costs 8× more, your blended cost is roughly 2× your nominal — a huge miss if you didn't model it.
Multiply the four. That's your projected cost per user action. Hold it next to the ceiling.
The budget alarm
Even with a careful estimate, prod will surprise you. Set an automated daily spend alarm before launch. Two thresholds:
- Warning: daily spend hits 120% of forecast for two consecutive days. Someone gets paged, no automated action.
- Hard cap: daily spend hits 200% of forecast. Feature degrades to a cheaper model or returns a "temporarily unavailable" — automatically, without waiting for a person.
The hard cap is the unglamorous engineering work nobody wants to build. It is also the thing standing between you and a $30k Saturday.
Three levers when you're over budget
1. Model size
Demote where you can. Most teams ship on a top-tier model because it worked in the demo and never go back to test a smaller one. Re-run your eval set against the next size down. If pass rate stays within 2-3 points, ship the smaller model and keep the bigger one as a fallback for the cases that fail.
2. Caching
Two flavors:
- Prompt prefix caching — provider-side, free or near-free, kicks in when your system prompt is stable. Make sure your prompt is structured so the long, stable part comes first.
- Output caching — yours to build. For identical or normalized inputs, return the previous output. Works best for tasks with repeated inputs (FAQ answers, common queries, product categorization).
3. Prompt trim
The fastest free win. Most prompts have 30-50% room: redundant instructions, stale examples, over-long retrieval contexts that hurt accuracy as much as cost. Strip until the eval score drops, then add back the last cut.
The retry trap
One pattern that quietly kills budgets: a validation step that retries on failure, with no cap.
Code that says "if the JSON doesn't parse, ask again" is fine — until a prompt change pushes parse failure from 1% to 18%. Now 18% of calls fire 2-5 times. Your bill triples overnight.
Cap retries at two, hard. Log the retry rate as a first-class metric. Set an alarm on it. A spike in retries is a spike in spend and almost always a spike in quality regressions too — your evals (you have evals, right?) should catch the same change.
Close
Cost discipline does not mean cheap models everywhere. It means knowing your ceiling, watching the four vectors, and having levers you've already tested. The teams that survive the second year of an AI feature are the ones who treated cost as a product constraint from day one — not a finance problem to handle later.