The Demo Trap
Most AI demos optimize for convincing the room. The ones that translate to production are built to reveal whether the system is ready. Four common failure modes, and the test that catches them.
Read moreField notes
Brief notes on AI systems, product decisions, and what we learn in practice. Written when there is something specific to say.
Most AI demos optimize for convincing the room. The ones that translate to production are built to reveal whether the system is ready. Four common failure modes, and the test that catches them.
Read moreEvaluation sets are supposed to stay representative. They can quietly drift. Four ways that happens, the quarterly audit that catches it, and the discipline of retiring examples.
Read moreCost gets a ceiling. Latency rarely does — until users churn. The interactive bar, the budget hierarchy, and the four levers when you're over the line.
Read moreHow to get a model to return clean JSON without paying the retry tax. Schema design, the validation pattern, and the three schema failures that quietly leak budget.
Read moreSmaller models do most of the work — when given the right work. A five-step process for sizing models to tasks, and the three signals you've picked wrong.
Read moreMost "we need an agent" problems are tool-use problems, and most tool-use problems are prompt problems. The hierarchy of complexity — and the cost of skipping a rung.
Read moreMost RAG isn't worth it. A four-question test for when to add retrieval, the three failure modes that turn it into a debugging burden, and what to try first.
Read moreHow to keep a prompt from becoming 30 untracked variants in 30 places. A four-rule discipline that scales from one prompt to a hundred.
Read moreMost AI features die for cost, not quality. Set the unit-economics ceiling before you ship, watch the four cost vectors, and know the three levers when you're over budget.
Read moreWhat to capture before you regret not capturing it. The minimum log schema — and the three questions it should let you answer in under five minutes.
Read moreEvery AI feature has humans somewhere. Most teams put them in the wrong place. Four placement modes — and a four-question test for picking the right one.
Read moreThe unit-test playbook for LLM systems. How to build your first 50-example eval set in a week — and why every team that skips this step pays for it later.
Read moreAlmost every team has shipped an AI demo. Almost none have shipped an AI feature their users rely on every day. A diagnostic — and a three-question filter to get unstuck.
Read moreGet in touch
We help founders and teams turn AI ambition into systems that operate in production. If that is the work ahead, send a short note with context.
Talk to us