All writing

You can't fix what you can't trace: observability for multi-agent systems

When an orchestrator hands off to a sub-agent that calls a tool that calls a model, 'it gave a bad answer' isn't a bug report. Here's the instrumentation that turns it into one.

A single-prompt chatbot is easy to debug: there’s one call, you read it, you see what happened. A multi-agent system is not that. A request hits an orchestrator, which routes to a specialised sub-agent, which calls a tool, which hits an external API, which feeds a second model call that produces the answer. When that answer is wrong, “the agent messed up” tells you nothing. Which agent, on which hop, with what in its context?

You can’t answer that by reading logs after the fact. You have to instrument for it up front. And no single tool covers it — the honest answer is a small stack of them, each doing the one thing it’s good at.

Four questions, four layers

The mistake is looking for one dashboard that does everything. Observability for agents splits into four distinct questions, and they want different tools.

“What was the call graph and where did the time go?” This is distributed tracing. You want an automatic span hierarchy — invocation → agent run → model call → tool execution — so you can see the whole trajectory as a tree and spot the slow hop. Most agent frameworks can emit this to a standard tracing backend with little custom code. Standardise on OpenTelemetry-style spans so you’re not locked in.

“What is this costing, and how do real users experience it?” This is product analytics, and tracing won’t give it to you. You want per-generation and per-user cost, token counts, p50/p95 latency, and — the part that’s easy to miss — the ability to tie all of that back to actual user sessions and funnels. That means emitting your own events (model.run, tool.call, agent.error) with a consistent schema: trace id, user id, latency, tokens in/out, cost, model, status. Now “is the agent worth what it costs?” is a query, not a guess.

“Is it actually correct?” Neither tracing nor analytics answers this — they tell you the system ran and what it spent, not whether the output was right. That’s structured evaluation: a set of cases run against the agent that score both the final response and the trajectory (did it call the right tools in the right order?). This is the layer that lets you say a change was an improvement instead of just different.

“What happened across the last three months?” Live trace UIs are built for the recent past, not cohort analysis. Export your events to a warehouse and historical trends, regressions, and per-cohort behaviour become ordinary SQL.

The thread that ties it together

The stack only works if the layers share a key. The single most valuable instrumentation decision is making sure every layer stamps the same trace id.

When your analytics event for a costly turn carries the same trace id as the tracing backend’s span tree, debugging stops being archaeology. You see a spike in cost or latency in the dashboard, grab the trace id, and pull up the exact span tree — which sub-agent, which tool call, how long each hop took. Without that shared key you have two disconnected stories about the same event and a lot of guessing in between.

Evals belong in CI

The piece teams skip is wiring structured evaluation into continuous integration. It’s tempting to treat evals as a thing you run by hand when you’re worried. That defeats the purpose.

The point of an eval set is repeatability — catching the regression the moment a prompt tweak or model swap introduces it, not three weeks later when a user reports it. Start small: three to five of your most important flows, scored on the outcome that actually matters. Run them on every change. The first time an eval catches a regression a human would have shipped, it pays for the entire effort.

Why bother with all of it

It looks like a lot of plumbing for something that “already works in the demo.” But these four layers are what let you do the things that make an agent a real product rather than a fragile prototype: debug a bottleneck by looking instead of guessing, adopt a new model in an afternoon because your evals will catch any regression, and tell a stakeholder — with a number, not a feeling — exactly how reliable and how expensive the system is.

That’s not overhead on top of the product. For anything agentic in production, it is the product.