blog.
Designing LLM Systems That Stay Fast Under Load
- LLM Systems
- Latency
- Inference
- Observability
Most LLM demos feel fast until real traffic arrives. Once multiple users, larger prompts, and upstream tool calls enter the loop, latency stops being a model problem and starts becoming a systems problem.
My default approach is to treat every request like a budget allocation exercise. Instead of only asking which model is smartest, I map where time is spent across retrieval, context building, inference, and response shaping, then decide which parts deserve optimization first.
Start with a latency budget, not a model benchmark
The first mistake teams make is optimizing inference in isolation. In production, the model is just one segment in a longer path that includes vector search, database lookups, prompt assembly, safety checks, and formatting. If the total response budget is two seconds, I prefer assigning hard caps to each stage before changing infrastructure or models.
This changes the conversation immediately. A pipeline that spends 900ms on retrieval and 700ms on model generation does not need a more expensive frontier model. It needs better chunking, more selective context, and fewer blocking calls. The useful metric is not raw tokens per second. It is end-to-end time for a good answer.
Use parallelism carefully and cache only what repeats
Parallel tool execution is one of the cleanest wins when dependencies are independent. If a system needs documentation retrieval, user profile lookup, and policy loading, those operations should not queue behind each other. Small changes in orchestration often recover more latency than model tuning.
Caching only works when the repeated unit is well-defined. I usually cache stable prompt prefixes, frequently accessed retrieval results, and deterministic post-processing layers. I avoid caching entire final responses unless the task is obviously repetitive, because stale answers are usually more damaging than slightly slower ones.
Observability decides whether the system is actually improving
Latency work without instrumentation turns into guesswork very quickly. I want traces for every stage, request IDs across services, and clear percentiles instead of average-only dashboards. P95 usually tells a more honest story than the median, especially when one slow dependency drags the whole experience down.
The practical goal is consistency. Users forgive systems that are slightly slower but predictable. They lose trust in systems that are fast on one request and stall on the next. For that reason, I often choose a smaller, better-behaved model with stronger orchestration over a larger model that performs well only under ideal conditions.