Field Notes7 min readVersão em português

The AI Ambition Gap

Why most AI projects stall between demo and production — and a three-question diagnostic to get unstuck.

Almost every team we meet has shipped an AI demo. Almost no team we meet has shipped an AI feature their users rely on every day.

The distance between those two states is where most AI projects die. We call it the ambition gap: the space between "AI works in a meeting" and "AI works in production."

What the gap looks like

  • Demo passes, prod fails. The model handles the twelve happy-path examples in the demo deck and breaks on the thirteenth input from a real user.
  • No one trusts the output. The feature is built, but the team won't let it run unattended because they have no way to know if it's right.
  • Costs are unbounded. The PoC cost $200 of API credits. The production version would cost forty thousand dollars a month and nobody can predict it within ±60%.
  • The feature regresses silently. Six weeks after launch, output quality drops 20% — and the team finds out from a support ticket.

The diagnostic — three questions

Before you build the next AI feature, answer these. If you can't, fix that before writing more code.

1. What does success look like, measured?

Not "users will love it." A number. Latency, accuracy on a held-out set, cost per request, escalation rate, refund rate, time-to-resolution. Pick one or two and write them down.

If you can't put a number on success, you can't tell when the model regresses. And it will regress — every time you change the prompt, swap the model, or alter the upstream data.

2. What's your eval pipeline?

You need at least 50 real examples (drawn from production-like inputs, not invented ones) with expected outputs or pass/fail rules, and a way to run the model against them on every change. Without this, you are flying blind.

The eval pipeline is the unit test of LLM systems. We have never seen a working production AI feature that didn't have one. We have seen many failed ones that didn't.

3. What does the failure mode look like?

When — not if — the model gives a wrong answer, what happens? Does it block the user? Cost money? Send a bad email to a customer? Trigger a refund?

The cheapest AI feature to operate is one where being wrong is recoverable: a draft a user reviews, a suggestion a user accepts, a tag a user can correct. The most expensive AI feature to operate is one that takes an irreversible action on the user's behalf.

If your design has the model taking irreversible actions, you'd better have answered questions 1 and 2 very well.

Closing the gap

Closing the ambition gap is not a model problem. It is a systems problem: evals, observability, cost controls, human-in-the-loop where it matters, clear failure modes.

The teams that ship are the ones who treat AI like any other piece of production software — with tests, monitoring, and a plan for when it breaks. The teams that stall are the ones who treat AI like a magic trick that worked once on stage.

If you're somewhere in that gap right now, the three questions above are a reasonable place to start. If you can answer all three honestly and the answers are good, you're probably ready to ship. If you can't, that's the work.