# Logging for LLM Systems

> What to capture before you regret not capturing it. The minimum log schema for any AI feature in production — and the three questions it should let you answer in under five minutes.

**Published:** 2025-11-10
**Reading time:** 7 minutes
**Author:** Bernardo Campos (Founder, 21xVentures)
**Canonical:** https://21xventures.com/blog/logging-for-llm-systems/

---

Three weeks after launch, a customer complains that your AI feature gave them a wrong answer last Tuesday. Can you reproduce the exact response they saw? Can you tell which prompt version was live at the time? Can you tell whether it was an outlier or one of many similar errors that day?

If you can't answer those three questions in five minutes, you don't have logs. You have noise.

## The minimum log schema

Every LLM call in production should record, at a minimum:

- **Request ID** — a UUID per call, surfaced to the user (e.g. in the response footer or error page).
- **Timestamp** — UTC. Always UTC.
- **User / tenant ID** — who triggered this, in the form your eval set understands.
- **Session ID** — so you can reconstruct multi-turn conversations.
- **Prompt version** — a hash or semver of the prompt template. Without this, "the prompt was different last Tuesday" is unanswerable.
- **Model + version** — provider, model name, and the snapshot or version ID. "gpt-4" is not enough; "gpt-4-2026-04-15" is.
- **Inputs** — the rendered prompt sent to the model, plus the user message that triggered it.
- **Outputs** — the full response from the model, including any structured fields and tool calls.
- **Tokens in / out** — usage numbers, for cost reconstruction.
- **Latency** — total wall-clock, plus model time if your provider gives it.
- **Cost** — computed at log time, not later. Models get re-priced.
- **Outcome / downstream signal** — did the user accept the output? Click through? Escalate? Most teams forget this field. It is the most important field.

That's twelve fields. None are optional. A row missing any of them is a row you can't debug from.

## What NOT to log

Logs are an asset and a liability. Three things to keep out, or to capture with care:

- **Raw PII when you can avoid it.** Hash or tokenize emails, phone numbers, customer IDs *before* they hit the log store. You need to be able to ask "did this user see a bad output?" — you do not need their name in your warehouse.
- **Secrets that ended up in inputs.** Users paste API keys, passwords, SSNs into chat boxes. Scrub these at write time with a deny-list of high-confidence patterns. Cheap insurance.
- **Full model output for high-volume, low-value calls.** If you're tagging 5M items a day, store outputs for a sampled slice (1-5%) plus 100% of failures. Full capture of everything is a storage tax for no extra signal.

## The three questions logs should answer in five minutes

### 1. "What did this exact user see?"

Given a request ID, you should be able to load the input, output, prompt version, and model version in one query. If a support engineer needs to ping you to pull this, your logs aren't done.

### 2. "Is this error a one-off or a pattern?"

You need to slice by prompt version, model version, user tenant, input length, language, and outcome — at minimum. A single bad output is noise. The same kind of bad output in 8% of last Tuesday's calls is a regression.

### 3. "What changed?"

Logs should show deploys and config changes as overlays on the time series. The most common cause of "the model got worse on Tuesday" is "we deployed on Tuesday." Time-correlation lets you stop guessing.

## Where to send the logs

Two reasonable starting points:

- **A Postgres table** with the schema above plus a JSONB column for everything else. Cheap, queryable, scales further than you think. Right answer for <1M calls a day and a team that already runs Postgres.
- **A purpose-built LLM observability tool** (Langfuse, Helicone, Arize, etc.). Better visualizations, built-in trace views for tool calls, often integrates with eval frameworks. Right answer when you need traces across many model calls per request, or when your team can't afford to maintain the schema.

Whichever you pick, put it in front of every model call from day one. Adding observability after launch is twice the work and you'll be missing the data you most need to debug the first incident.

## Sampling and retention

Two settings teams routinely get wrong.

- **Sampling.** 100% of failures, always. 100% of high-value calls (paid features, irreversible actions). 1-5% of everything else, randomized. Sample at write time, not query time.
- **Retention.** 90 days hot for debugging and weekly analysis. 12-18 months cold (object storage, archival format) for long-tail compliance and trend work. Anything else is either too expensive or too short to investigate slow regressions.

## Close

Logging feels like back-of-house work that can wait. It cannot. The first incident in production will eat a week if you can't answer the three questions. With the schema above, it eats an afternoon — and the second incident eats fifteen minutes, because by then you're searching for patterns instead of starting from scratch.

---

**Related**

- [Evals Before Features](https://21xventures.com/blog/evals-before-features/) — the eval pipeline that pairs with these logs
- [Prompt Versioning Without the Hairball](https://21xventures.com/blog/prompt-versioning/) — how the "prompt version" log field becomes load-bearing
- [Cost Ceilings for AI Features](https://21xventures.com/blog/cost-ceilings/) — the cost log field is what makes the ceiling enforceable
- [Latency Budgets for AI Features](https://21xventures.com/blog/latency-budgets/) — the latency field is what makes P50/P95 dashboards possible

**Contact:** hello@21xventures.com