Field Notes8 min read

Tool Use vs. Agents: Knowing When to Add Steps

Most "we need an agent" problems are tool-use problems, and most tool-use problems are prompt problems. The hierarchy of complexity — and the cost of skipping a rung.

"We're building an agent." Every team. Every kickoff. The word "agent" has expanded to mean any system where the model gets to decide something — and that expansion has caused real damage to real budgets, real timelines, and real ability to debug what shipped.

Before you write a planner, a ReAct loop, a tool-selection graph, or anything else that lets the model take more than one step, climb the ladder. Most of the time, you stop two rungs short of what you started with.

The hierarchy

From simplest to most complex. Always start at the top.

Rung 1: Single prompt, single response

One LLM call. Input goes in, output comes out. No tools, no loops, no branching.

Use when: the task is "transform this input into that output" — classification, extraction, summarization, drafting, rewriting, translation. The vast majority of LLM features in production live here, and they should.

Debuggability: trivial. Cost: minimal. Failure modes: well understood.

Rung 2: Single prompt + tool use (one round)

One LLM call that may invoke one or two tools (lookup, calculator, search), then a final response from the same model with the tool output. Tool list is small, often 1-3.

Use when: the answer needs facts the model doesn't have but you do — current prices, inventory, a user's account state, today's date for relative time. The model formats and reasons; the tool fetches.

Debuggability: still good. You can log the tool call payload and the model's reasoning around it.

Rung 3: Constrained workflow (chain of pre-defined steps)

You — not the model — decide the steps. The flow is: call A → call B → optional call C → response. Each step is its own LLM call with a focused prompt. Sequence is encoded in your code, not in the model's decisions.

Use when: the task has identifiable sub-tasks that benefit from separate prompts (classify, then route, then answer in domain language). Or when a single big prompt is too long, confused, or expensive.

Debuggability: per-step traceable. Each step can have its own eval set. This is where most "complex" LLM features should live.

Rung 4: Agent (the model picks the next step)

The model decides what to do, calls a tool, looks at the result, decides what to do next, calls another tool, and so on until it decides it's done — or hits a step limit. Tool list can be large.

Use when: the path through the problem genuinely cannot be enumerated ahead of time. The user asks "find me a flight that gets there before my meeting and pairs well with my hotel preferences and is under $700" and the search trajectory depends on what's available.

Debuggability: hard. Cost: high and unpredictable. Failure modes: many and weird.

The agent tax

Reaching for rung 4 when rung 3 would do is one of the most expensive architectural choices in AI right now. Concretely:

  • Cost. An agent runs 3-15 LLM calls per user action where a workflow runs 1-3. That's 3-5× the cost vector to manage (see Cost Ceilings).
  • Latency. Sequential model calls add up. A 4-step agent at 1.5s/call is 6 seconds for a single user action. Interactive features die at that latency.
  • Evaluation surface. A workflow has one eval per step. An agent has an eval per trajectory, and the model decides the trajectory. Building an eval set that covers "all the paths the agent might take" is dramatically harder than "the path I designed."
  • Failure modes. Agents loop. Agents get stuck. Agents pick the wrong tool. Agents declare success prematurely. Each is its own incident category to detect, alert on, and recover from.
  • Monitoring. A workflow log is linear. An agent log is a tree. Reasoning about "what happened on this call?" is meaningfully harder.

None of this means agents are wrong. It means the cost is real, and you should pay it only when the alternative cannot do the job.

The ladder before you reach for agents

Climb each rung honestly. Don't skip.

  1. Try a single prompt. Spend a half day on prompt design. Add a few-shot examples. Run it against 30 representative inputs. If it works, ship.
  2. Add one tool. If the model needs a fact, give it the fact via a tool. Don't add three tools to the same prompt yet — start with the one that closes the most gaps.
  3. Split into a workflow. If the prompt is doing too much (classify + route + draft + format), break it into named steps. Each step is shorter, more debuggable, and more swappable.
  4. Constrain the workflow with the model picking branches. The model can decide between paths A, B, or C — but the paths themselves are not invented at runtime. This is "agent-lite" and covers most real branching needs.
  5. Now, if (1)-(4) genuinely cannot solve it, agent. And when you do, set a hard step limit, log every tool call, and design fallback behavior for "agent gave up."

The two questions that flag premature agent design

If you're proposing an agent in a kickoff, ask:

  1. Can I draw the happy path on a whiteboard? If yes — that's a workflow, not an agent. Encode the path you drew. If users only ever take 3-4 distinct flows through your system, you have 3-4 workflows, not an "agent."
  2. Do I need the model to choose which tool to call, or only what to put in it? If the latter, you don't need an agent — you need a workflow that calls the tool with model-supplied arguments.

"Yes" on either question takes the agent off the table, and the engineering gets dramatically easier.

When agents earn their tax

All of these tend to be true:

  • The user's request implies search through a large space of possible solutions.
  • The shape of the answer depends on intermediate results in a way you cannot enumerate at design time.
  • You can afford 3-10× latency and cost on this action.
  • You have eval infrastructure that can score trajectories, not just final outputs.
  • You have monitoring sophisticated enough to catch loops and tool-misuse in production.

That's a real set of constraints — and a small set of products. Most teams don't meet them, build the agent anyway, and spend six months fighting it.

Close

Agents are not bad. Agents are expensive. The hierarchy exists so you pay for complexity only when you need it — and so the things you actually shipped are debuggable when they break.

Climb the ladder. Most days, you stop on rung 2 or 3. The teams that ship are not the ones with the cleverest agent; they're the ones who solved a real problem at the simplest rung that worked.