Field Notes7 min read
Picking a Model Size for a Given Task
Smaller models do most of the work — when given the right work. A five-step process for sizing models to tasks, and the three signals you've picked wrong.
The most common model-selection decision goes like this: a team prototypes on the top-tier model, the demo works, they ship it, and the bill at the end of month two leads to a panicked re-test on smaller models. Two-thirds of the time, a smaller model would have worked from day one — they just never tried.
Model size is not a question of "which is best." It's a question of "which is the smallest that's good enough for this specific task." Those are very different questions and produce very different answers.
What "model size" actually decides
Picking a tier (small / medium / large / frontier) trades along five axes:
- Quality on hard tasks. Larger models reason better, handle ambiguity better, recover from confusing prompts better. The gap shrinks every year, but it's real.
- Cost. Order of magnitude per token between tiers. Compounds with every retry, every call in a workflow, every escalation.
- Latency. Smaller models are 2-5× faster at first-token and overall. Streamable tokens-per-second also differ.
- Instruction-following on structured outputs. Smaller models drift more on long, constrained schemas. They can be brought up to spec with better prompting, but the floor is lower.
- Multilingual quality. Especially relevant for PT-BR, Spanish, Arabic, Hindi: tier gaps are larger outside the model's primary training language.
Anyone who has answered "use the best" without weighing those five is making a default-and-hope decision, not a model-selection decision.
The five-step sizing process
1. Write the eval set first
You cannot size a model without an eval. The whole question — "is this model good enough?" — is unanswerable without a yardstick. If you don't have one yet, build it before reading the rest of this post.
2. Run the eval against the smallest plausible tier
Start at the cheapest model your provider offers that has the capabilities you need (tool use, structured output, your target language). Run your eval. Record pass rate, latency, and cost per call.
If pass rate is ≥ 95% of where it needs to be: ship it. You're done. Don't go bigger to feel safer.
3. If it fails, prompt the small model harder before swapping tiers
Smaller models often pass once you add 2-3 few-shot examples, sharpen the schema, or split the task into two simpler steps. Spend half a day on this before reaching for the next tier.
The leverage: a small model with a good prompt usually beats a big model with a mediocre one — at a fraction of the cost.
4. Step up one tier at a time, never two
If you've genuinely exhausted the small tier and still miss the bar, step up to medium. Run the same eval. Compare pass rate and cost.
The interesting decision is rarely "small vs. frontier" — it's "small vs. medium" or "medium vs. large." Skipping tiers wastes money you didn't need to spend.
5. Set a floor model for fallback
Whichever tier you settle on becomes your default. Configure one tier down as your fallback for when the primary is unavailable, rate-limited, or escalating cost. The fallback won't be as good, but it should be good enough to keep the product up.
Without this, an outage at your provider is an outage of your feature.
Three signals you picked the wrong tier
1. The model passes but the bill scales linearly with bad inputs
If 30% of your traffic is users typing one-word queries and you're charging them 8k tokens of context plus a frontier model: you didn't pick the wrong model, you picked the wrong routing. Split traffic: small model for easy inputs, larger model only for the cases that need it.
2. Pass rate is 99% but tail latency kills the UX
P50 looks great. P95 is 14 seconds. Users abandon. The model is too big for an interactive feature. Step down a tier or split the task so the user-facing call is small and any heavier work happens async.
3. Pass rate dropped silently after a model "version" bump
Same model name, new snapshot. Quality moved 5-12% without warning. This is normal — providers update their checkpoints. Without an eval pipeline + version logging (covered here), you find out from users.
The "escalate on failure" pattern
One of the most underused patterns in production AI:
- Run the small model first.
- If the output fails a structured check (schema invalid, confidence low, classifier flags it as "unsure"), then retry with a bigger model.
- Log how often you escalate. That number is your real cost.
For many tasks, the small model handles 80-95% of inputs correctly. Escalating only the failures keeps your blended cost close to the small-model price with quality close to the big-model price. The trick is having a reliable signal for "this output is wrong" — which is what your eval rules and validation step are for.
Don't escalate blindly to a bigger model on every call. That's not escalation, that's "the big model with extra steps."
Close
The teams that get model sizing right share three habits: they build an eval before they pick, they always start small and step up, and they treat their default tier as the floor — not the ceiling.
"Which model is best?" is the wrong question. "Which is the smallest that's good enough for this specific task?" is the right one. The answers are usually surprising, almost always cheaper, and easier to keep good over time.