Keeping agents fast: lean state, direct tools, and the data-card trick
An agent that's correct but slow loses to one that's slightly worse and snappy. Most latency in agent systems is self-inflicted — here's where it hides.
Correctness gets all the attention in agent work. But users feel latency first, and an agent that takes nine seconds to answer feels broken even when the answer is perfect. The good news is that most of the slowness in agent systems isn’t the model being slow — it’s architecture decisions quietly adding hops, tokens, and round-trips. Here’s where I keep finding it.
Don’t wrap a function call in an agent
The most common self-inflicted wound: turning every capability into a sub-agent. It feels clean and uniform — everything’s an agent, everything routes the same way. In practice, a sub-agent whose entire job is “call this API and return the result” adds a whole extra model call to decide to do the thing it was always going to do. Sometimes it even returns an empty response and you’ve burned a round-trip for nothing.
The rule I settled on: if a capability is “call an API and return some structured values,” it’s a function tool, and the main agent calls it directly. Reserve sub-agents for flows that genuinely need their own prompt, routing, and multi-step reasoning. The test is simple — does this thing need to think, or just fetch? Fetching doesn’t deserve a model call of its own.
Keep session state lean
The second trap is letting session state become a junk drawer. It’s tempting to stash whatever might be useful — recent results, fetched lists, big objects — into the conversation’s persistent state. Every one of those makes turns slower (more to load and write each cycle) and behaviour weirder (stale data lingers and “sticks” across turns in ways nobody intended).
The discipline: session state holds small scalars only — ids, booleans, tiny strings. Anything that’s only relevant to the current request goes in a clearly-marked ephemeral key that gets dropped when the run ends. Large results don’t belong in state at all; they go to the client out-of-band or as a paged result, with at most a compact summary kept around. Lean state means faster turns and far less “why is it behaving like that?” debugging.
Fetch live data on demand, don’t inline it
Related, and worth stating on its own: live business data should be pulled at the moment of decision, not preloaded into the prompt or parked in state. Inlining a full list of options or activity into every prompt bloats the context, slows the call, and serves stale data the moment the underlying source changes. Fetch when needed; keep only short hints — a count, an id, a timestamp — between turns.
The data-card trick
This one made the biggest difference to both speed and quality, and it’s almost embarrassingly simple.
When you need to give the model structured data — a user’s recent activity, their profile, a summary of past behaviour — the obvious move is to serialise the raw JSON into the prompt. It works, but it’s noisy: the model spends attention parsing braces and quotes, and you spend tokens on syntax that carries no meaning.
Instead, render the data in code into a short, human-readable card before it goes in the prompt:
<recent_activity_card>
Last action: opted into "Spring offer" — $25 — Oct 6
Redemptions: $10 gift card — May 19
</recent_activity_card>
The card is generated deterministically in code, wrapped in a simple tag so the model knows where it begins and ends, and dropped into the prompt as plain text. It’s a fraction of the tokens of the equivalent JSON, it’s far easier for the model to read correctly, and because it’s built in code it’s consistent — the model isn’t re-deciding how to interpret the structure every turn. Lower token cost and better comprehension from the same underlying data.
Stop when the work is done
Last one: don’t make the model narrate. A common pattern has the agent call a tool, get everything the interface needs, and then take an extra turn to explain what it just did. That explanation is pure latency if nothing consumes it. If a tool result is definitive and the UI can render it directly, emit it and finish. Add a wording pass only when there’s actually user-facing copy to write.
None of these are clever. They’re just the difference between an agent that responds while the user is still paying attention and one that doesn’t. Push reliability into structure, and push latency out of it — fewer hops, leaner state, data shaped for the model instead of dumped at it. Speed, almost always, is something you design rather than optimise later.