The Demo Trap
Most AI demos optimize for convincing the room. The ones that translate to production optimize for telling the truth. Four sins of demos that lie, and the test that catches them.
Read moreField notes
Short pieces on AI, building, and what we're noticing along the way. Written when we have something specific to say.
Most AI demos optimize for convincing the room. The ones that translate to production optimize for telling the truth. Four sins of demos that lie, and the test that catches them.
Read moreYour eval set is supposed to be the truth. It can quietly stop being it. Four ways evals lie, the quarterly audit that catches them, and the discipline of killing examples.
Read moreCost gets a ceiling. Latency rarely does — until users churn. The interactive bar, the budget hierarchy, and the four levers when you're over the line.
Read moreHow to get a model to return clean JSON without paying the retry tax. Schema design, the validation pattern, and the three schema failures that quietly leak budget.
Read moreSmaller models do most of the work — when given the right work. A five-step process for sizing models to tasks, and the three signals you've picked wrong.
Read moreMost "we need an agent" problems are tool-use problems, and most tool-use problems are prompt problems. The hierarchy of complexity — and the cost of skipping a rung.
Read moreMost RAG isn't worth it. A four-question test for when to add retrieval, the three failure modes that turn it into a debugging burden, and what to try first.
Read moreHow to keep a prompt from becoming 30 untracked variants in 30 places. A four-rule discipline that scales from one prompt to a hundred.
Read moreMost AI features die for cost, not quality. Set the unit-economics ceiling before you ship, watch the four cost vectors, and know the three levers when you're over budget.
Read moreWhat to capture before you regret not capturing it. The minimum log schema — and the three questions it should let you answer in under five minutes.
Read moreEvery AI feature has humans somewhere. Most teams put them in the wrong place. Four placement modes — and a four-question test for picking the right one.
Read moreThe unit-test playbook for LLM systems. How to build your first 50-example eval set in a week — and why every team that skips this step pays for it later.
Read moreAlmost every team has shipped an AI demo. Almost none have shipped an AI feature their users rely on every day. A diagnostic — and a three-question filter to get unstuck.
Read moreGet in touch
We help founders and teams turn AI ambition into systems that ship and stay shipped. If that's you, write to us — short and direct is fine.
Talk to us