# Retrieval That Earns Its Keep

> Most RAG isn't worth it. A four-question test for when to add retrieval, the three failure modes that turn it into a debugging burden, and what to try first.

**Published:** 2026-02-02
**Reading time:** 7 minutes
**Author:** Bernardo Campos (Founder, 21xVentures)
**Canonical:** https://21xventures.com/blog/retrieval-that-earns-its-keep/

---

"We need RAG" is the second most reflexive sentence in AI projects, right behind "we need an agent." Most teams who say it would be better served by a longer prompt with a static knowledge document and a Friday afternoon for review.

Retrieval is a real tool. It is also a major new failure surface: an index that drifts, embeddings that don't match what users ask, chunking choices that hide the answer two paragraphs over, latency added to every call. If you add retrieval without earning it, you've turned a 200-line system into a 2,000-line system that's harder to debug.

## Four questions before you add retrieval

### 1. Does the answer live in a document you control?

If yes — say, a policy doc, a product spec, an internal FAQ — and the corpus is under 50,000 tokens, paste it into the prompt. Done. No vector store, no embedding pipeline, no chunking. Prompt caching makes this nearly free on the input side.

People skip this step because it feels unsophisticated. It is also faster, cheaper, and more debuggable than any retrieval system you'll build.

### 2. Does the corpus change faster than your deploy cycle?

If your knowledge base updates a few times a week, you can ship it as a static file in your repo with a daily build. If it updates dozens of times an hour (live inventory, support tickets, user data), retrieval starts earning its keep.

### 3. Is the corpus too big to fit in context?

Modern long-context models handle 200k-1M tokens. "Too big" used to mean 8k. It now means "actually big" — entire codebases, multi-year transcripts, full document repositories. If you're well under that, the context cost is cheaper than the retrieval-system cost.

### 4. Do queries actually need targeted facts, or general knowledge?

Retrieval shines when the answer is a specific fact in a specific document ("what's the cancellation policy for Plan B?"). It is a poor fit for tasks that require reasoning over the whole corpus, where you'd need to retrieve too much for the chunk to matter.

If you answered "yes" to questions 2, 3, and 4 — retrieval is on the table. If not, you have a long-prompt problem, not a retrieval problem.

## Three failure modes

### 1. The right document was indexed but the retriever missed it

By far the most common. User asks "how do I downgrade my plan?" — the document is titled "Plan changes and cancellations" and never used the word "downgrade." Top-k returns five irrelevant chunks; the model confabulates.

Mitigations: include hand-written query rewrites for known synonyms, add lexical (BM25) on top of vector search and reciprocal-rank-fuse, and instrument retrieval recall as a first-class metric.

### 2. The chunk has the answer split across boundaries

The cancellation policy is one sentence. The fee table is in the next chunk. Top-k returns one, missing the other, and the model gives a confidently wrong answer.

Mitigations: overlap chunks (10-20% is usually enough), retrieve larger windows than feel necessary, and expand around hits before passing to the model.

### 3. The index is stale

Policy was updated three weeks ago. The index still has the old version. The model reports the old policy. No alarm fires.

Mitigations: log the timestamp of indexed documents, fire an alarm when the freshest doc is older than your SLA, and re-embed nightly even if the source hasn't visibly changed.

## What to try first (often, you don't need retrieval)

In rough order, before reaching for vector search:

1. **Put the doc in the prompt.** For corpora under ~50k tokens. With prompt caching, the marginal input cost is near zero.
2. **Static lookups in code.** If the user is asking about Plan B specifically and you know the user's plan, look up the Plan B policy in a dictionary and pass that specific section.
3. **Keyword filter + small corpus.** Filter docs by simple metadata (account type, language, region) to under 20 candidates, then pass all of them. This is "retrieval" without a vector store, and it's frequently enough.
4. **Two-pass: classify, then look up.** First call: classify the question into one of N categories. Second call: answer using just the docs for that category. Often beats general-purpose RAG on accuracy and cost both.

## When retrieval is the right answer

All of these tend to be true:

- The corpus is large enough that you can't fit a useful slice in context.
- The corpus changes more often than your deploy cycle.
- Queries are fact-finding ("what's our policy on X?"), not corpus-reasoning ("summarize what we learned across last year's tickets").
- You can measure retrieval recall (did the right chunk get returned at all?) and answer quality separately.

And one practical: you have someone who will own the index — keep it fresh, watch the alarms, run quarterly relevance reviews. An unowned retrieval system rots in months.

## Operational gotchas

- **Embed model lock-in.** Switching embedding models means re-embedding the whole corpus. Pick one whose provider you'd be willing to depend on for 18 months.
- **Latency budget.** Retrieval adds 100-400ms per call. If you're already at the edge of your latency budget, that's a feature-killer.
- **Sensitive content.** Whatever's in your index is one prompt injection away from being exfiltrated. Don't index secrets, and assume any indexed text can leak.

## Close

Retrieval is the right tool for a real but smaller set of problems than the industry treats it as. The cost of adding it when you don't need it: latency, dollars, debugging time, and a maintenance liability that compounds.

The four-question test takes ten minutes. Run it before you add the vector store. Many days, the right answer is a longer prompt and a static file.

---

**Related**

- [Tool Use vs. Agents: Knowing When to Add Steps](https://21xventures.com/blog/tool-use-vs-agents/) — same logic, applied to multi-step systems
- [Picking a Model Size for a Given Task](https://21xventures.com/blog/picking-a-model-size/) — the "smallest tool that works" principle, applied to model selection
- [Cost Ceilings for AI Features](https://21xventures.com/blog/cost-ceilings/) — RAG is a cost vector before it's a quality decision
- [Latency Budgets for AI Features](https://21xventures.com/blog/latency-budgets/) — retrieval adds 100-400ms; budget says whether you can afford it

**Contact:** hello@21xventures.com