Field Notes6 min read
The Demo Trap
Most AI demos optimize for convincing the room. The ones that translate to production optimize for telling the truth. Four sins of demos that lie, and the test that catches them.
Three months after a demo lands, a normal team is locked into shipping what they showed — and discovering it doesn't work on the inputs that didn't make the highlight reel. The AI ambition gap describes the cost of that mismatch. This post is about not building the mismatch in the first place.
Demos are tools. A demo that tells the truth saves you a quarter of rework. A demo that lies costs the same quarter and a chunk of the team's confidence in what's actually shippable.
Three demo flavors
Mixing them up is most of the problem. They look identical from the outside; their purposes are different.
- Sales demo. Goal: get a meeting, close a deal, raise a round. Optimized to convince. Lies are the point — not malicious, but the inputs are curated, the model is the best, the latency is whatever it is.
- Exec demo. Goal: show that AI is "working" inside the company. Lower stakes than sales, but the same incentives — the room wants to see the wow.
- Engineering proof-of-concept. Goal: tell you whether the system can actually be built and shipped. Optimized for truth. Failures here are useful; failures in the others are embarrassing.
Only the third one tells you anything about production. The first two should never be confused with it, even when they look the same.
Four sins of demos that lie
A demo that's about to mislead your team almost always commits one of these.
1. Cherry-picked inputs
The deck has twelve example inputs. They were chosen — explicitly or by survivorship — because the model handled them well. Real users will send the thirteenth input on day one.
Fix: sample inputs at random from a real (or realistic) traffic distribution. If you have no traffic yet, write the 50 most likely user messages without filtering and don't drop the ones the model fails on.
2. No latency budget
The demo's "thinking" gets played off as suspense. The model is on a top-tier tier with a 4-second time-to-first-token. In production with mobile users in São Paulo? Different story.
Fix: show the actual latency on screen during the demo. If it's over your interactive bar, name that, and decide before you ship whether you live with it or change the architecture.
3. No cost meter
The PoC ran 38 model calls and cost $4.20. Nobody mentioned that. When you scale to 200k user actions a day at the same per-action cost, the bill is $24,000 a month, and someone is going to ask why.
Fix: write the per-action cost on the demo. If it doesn't fit under the ceiling, you don't have a shippable system; you have a science fair project.
4. Hidden human-in-the-loop
The demo output looks great because a person reviewed and lightly edited it before the meeting. In production, that person doesn't exist on every call.
Fix: name what humans did and didn't touch. If a person curated the demo output, the production system needs the same human in the loop or a much higher tolerance for the unedited result.
Designing a demo that translates
Reverse each sin. Specifically:
- Inputs: randomly sampled from production-like traffic. Include the inputs that fail.
- Model: the model and the prompt you'd ship, not the most capable tier.
- Latency: shown live, not edited out.
- Cost: per-action number visible, plotted against the ceiling.
- Failure rate: show the share that needs retry, escalation, or human review.
A demo that does these five things is less impressive. It is also load-bearing — what it shows is roughly what you'll ship.
The "would you bet your launch on it?" test
Before the demo, write down: if I shipped this exact system to 1,000 users on Monday, what fraction would have a bad experience this week?
If the honest answer is "I don't know" or "more than 5%" — this is a demo for funding, hiring, or alignment. It is not a demo for shipping. Don't let it turn into one by drift.
Some teams find this question uncomfortable and skip it. Those are the teams that ship in three months and then spend six months unlanching what they shipped.
Close
Demos are great when they tell the truth. The truth-telling version of a demo is rarely the one you'd put in a sales deck — and that is the point.
If your team can't tell whether a given demo is convincing or load-bearing, label it. Both kinds are valid. Confusing them is what gets you stuck on the wrong side of the ambition gap.