Field Notes7 min read
Latency Budgets for AI Features
Cost gets a ceiling. Latency rarely does — until users churn. The interactive bar, where the time goes in an LLM call, and the four levers when you're over the line.
Teams track cost. Teams track quality. Teams routinely fail to track latency until a quarterly user-research session reveals that the AI feature has 40% lower engagement than the non-AI equivalent — and the comment everyone gives is "it's too slow."
Latency is the budget no one writes down. It is also the one that decides whether your feature is loved or quietly abandoned.
The interactive bar
These thresholds are decades old, and still right:
- < 100ms — instant. Users don't perceive a wait.
- 100ms – 1s — feels responsive. Users stay in flow.
- 1s – 3s — perceptible delay. Acceptable for "the system is working on it."
- 3s – 10s — borderline. Needs a progress signal or users will think it broke.
- > 10s — non-interactive. Reframe as a background job, an email, or a notification.
If your AI feature lives in an interactive context, the latency budget is 3 seconds, max — and that's stretched. Most "AI tab in a chat tool" features need to be under 1.5s end-to-end to feel right.
Where time goes in an LLM call
A single LLM call breaks down into parts you can usually measure:
- Network round-trip to your backend — 20-80ms typical.
- Your code: auth, validation, prompt assembly, retrieval, etc. — 10-300ms. Often the most ignored.
- Network to provider + queue wait — 50-300ms. Spikes during outages.
- Time-to-first-token (TTFT) — 200ms-3s. Depends heavily on model tier and prompt size.
- Tokens-per-second generation — 30-150 tok/s typical, varies by tier.
- Output post-processing: parse, validate, repair — 5-100ms.
- Network back to user — same as inbound.
Add these up honestly for your top user action. The number is almost always larger than people guessed before measuring.
Setting the latency budget
Like the cost ceiling, the latency budget is a number you write down per user action — and enforce.
Two numbers, not one:
- P50 target — what the median user experiences. Aim for the middle of your interactivity tier.
- P95 ceiling — the worst 5% of users. Set this 2-3× your P50, and enforce it with timeouts. A 30s tail kills the feature even if the median is fine.
If you don't have a P50/P95 dashboard for your AI feature today, that's the first thing to build (see Logging for LLM Systems — the latency field is non-negotiable).
Four levers when you're over budget
1. Smaller model
The single biggest lever. Smaller tiers are 2-5× faster on TTFT and tokens-per-second. If you haven't tried sizing down recently, do it before anything else. Often the latency drop alone justifies the swap.
2. Streaming + skeleton UI
Stream tokens to the user as they arrive. Time-to-first-token becomes the perceived latency, not total-completion-time. Combine with a skeleton or "thinking" indicator from the moment of click.
Users will tolerate a 6-second total response if the first words appear in 400ms. They will not tolerate a blank screen for 6 seconds even if the response is identical at the end.
3. Parallelize calls
If your feature makes multiple LLM or tool calls per user action, check what can run concurrently. A workflow that fires three independent calls in series for 3s each takes 9s; the same calls in parallel take ~3s.
Most multi-step systems have at least one accidental serialization. Audit your code, not your design doc.
4. Push to async
If the user action genuinely needs more than 10 seconds of model work, stop pretending it's interactive. Show "we'll have this ready in a minute," return immediately, and deliver via notification, email, or polling.
This is harder than it sounds because it requires a UX commitment — but it's almost always the right call when the work is genuinely deep (a long-form draft, a multi-source research task, a deep agent run).
The tail problem (P50 isn't enough)
One of the most common latency mistakes: tuning to P50 and ignoring P95.
P50 is 1.4s. Looks great. P95 is 18s. That 5% of users are abandoning the feature entirely — and their support tickets dominate your inbox. Median-only tuning hides this.
Things that drive P95 spikes:
- Provider rate limits and queue backups. Spike when traffic is bursty.
- Long-tail inputs. A user pastes a 30-page document into a feature you designed for paragraphs. Your prompt explodes from 800 to 30,000 tokens.
- Retries. A single retry doubles latency. Two retries triple it. (See the retry trap.)
- Cold starts on serverless. The 2-second function spin-up nobody remembered.
Each of those has a different mitigation. None of them surface if you only watch P50.
The timeout discipline
Two timeouts every LLM call needs:
- Connect timeout — 5s. Provider unreachable; fail fast and either retry to fallback or return a graceful error.
- Total timeout — 2-3× your P95 target. Beyond this, kill the call. Long-running calls are worse than a clean failure: they hold connections, stack queue depth, and make incidents harder to debug.
Always combine timeouts with a structured fallback: a cheaper model, a static cached response, or a clear "we couldn't do that — try again." Silently hanging is the worst available outcome.
Close
Latency is not a polish item. It is the single biggest predictor of whether an AI feature gets used after launch.
Write down the budget per user action. Watch P50 and P95 from day one. Have the four levers ready before you need them. Most teams discover their latency problem the same way they discover their cost problem — too late, from users. The work to avoid that is small. The cost of not doing it shows up as churn.