Most of the discourse around large language models focuses on the models themselves — capabilities, benchmarks, training approaches. The discourse around running those models in production at scale is thinner, despite being where most of the operational complexity lives. If you're building infrastructure to serve LLM inference — whether you're at an AI company, a cloud provider, or an enterprise deploying models internally — the reliability and performance challenges are distinct from what you've seen with conventional web services.

This is a deep dive into the infrastructure layer: how inference serving actually works, what makes it hard, and the SRE practices that apply.


How LLM Inference Differs from Conventional Serving

Understanding the operational challenges requires understanding the computational model. Conventional web service requests are roughly uniform: a request arrives, some computation happens (usually bounded and predictable), a response is returned. Load balancers and autoscalers are designed around this assumption.

LLM inference is fundamentally different in two ways:

Variable compute per request. A request to generate a 50-token response and a request to generate a 2,000-token response are not the same cost. The computation scales with the number of tokens generated. Your p99 latency can be an order of magnitude higher than your p50 latency for the same model and the same prompt, simply because some requests generate more output. Load balancers that distribute based on request count are load balancing the wrong thing.

Memory is the binding constraint. LLM inference is GPU memory-bound, not compute-bound. The model weights live in GPU VRAM. The KV cache — the intermediate computation that allows models to process context without recomputing it — also lives in GPU VRAM. When VRAM is full, you either drop requests or degrade to slower serving. This is fundamentally different from CPU-bound web services where you can overprovision CPU by scaling horizontally and pay a predictable cost.


The KV Cache: The Variable You Need to Understand

The KV (key-value) cache is the most important concept for understanding LLM inference performance, and it's often underexplained.

When a model processes a sequence of tokens, it produces key and value tensors for each attention head at each layer. For autoregressive generation (where the model generates one token at a time), re-computing these tensors for the full context at every step would be prohibitively expensive. The KV cache stores these tensors so they can be reused.

The cache size scales with:

  • Sequence length (longer context → larger cache)
  • Model size (more layers, more attention heads → larger cache)
  • Batch size (more concurrent requests → more cache entries)

This creates the core capacity planning challenge: the memory required to serve a request depends on the context length of that request, which varies across your request population. A request with a 100K-token context (not unusual for document analysis use cases) requires dramatically more GPU memory than a request with a 1K-token context.

For SRE purposes, this means your GPU utilization metrics don't tell the full story. A GPU at 60% compute utilization but 95% memory utilization is effectively at capacity — you cannot accept more long-context requests without OOMing. Your dashboards need to surface both dimensions.


Continuous Batching: Why It Matters for Throughput

Early LLM inference serving used static batching: assemble a batch of N requests, run them all through the model together, return results, repeat. This is GPU-efficient in theory but terrible in practice because requests complete at different times. If one request in a batch of 32 generates 10x more tokens than the others, the entire batch waits for that request to finish.

Continuous batching (sometimes called iteration-level scheduling) solves this. As individual requests in a batch complete their current generation step, new requests are immediately slotted in. The GPU is never idle waiting for the slowest request in a batch. Frameworks like vLLM and TGI (Text Generation Inference) implement continuous batching by default now.

The operational implication: throughput metrics should be measured in tokens/second, not requests/second. A continuous batching system that's processing a mix of short and long requests will show high token throughput even if request concurrency looks lower than you'd expect.


The Reliability Patterns

Graceful degradation on memory pressure. When your KV cache fills, you have choices: queue incoming requests and wait, drop them with a 503, or shed context. Prefill admission control — rejecting requests above a context length threshold when memory pressure is high — is a common pattern. Your load shedding policy should be explicit and documented, not an emergent behavior.

Model warming. Cold starts for LLM inference are expensive — loading model weights into GPU VRAM takes seconds to minutes depending on model size. In production, you want your model fully loaded before traffic hits. This means:

  • Health checks that verify the model is loaded, not just that the service process is up
  • Prewarming replicas before they're added to the load balancer rotation
  • Startup probe timeouts that account for model load time

Prompt caching for reliability and cost. Major inference providers (and hosted models) now offer prompt caching — if the prefix of a request matches a cached prefix, the KV computation for that prefix is served from cache rather than recomputed. For SRE use cases where system prompts are long and mostly static (your incident response agent's system prompt doesn't change between invocations), prompt caching can reduce both latency and cost substantially. Monitor cache hit rates as a first-class metric.

Speculative decoding for latency reduction. Speculative decoding uses a small "draft" model to generate candidate tokens quickly, which the large model then verifies in parallel. Verification is cheaper than generation, so if the draft model is frequently correct, you get significant latency improvements. The failure mode is when the draft model is consistently wrong — speculative overhead with no benefit. Monitor acceptance rate (how often the large model accepts the draft model's tokens) to detect this.


Capacity Planning for LLM Inference

Capacity planning for conventional services: measure requests/second and latency at current traffic, model growth, provision to a utilization target.

Capacity planning for LLM inference requires understanding your request distribution:

python
# What you need to measure
metrics = {
    "input_token_distribution": histogram(requests, lambda r: r.input_tokens),
    "output_token_distribution": histogram(requests, lambda r: r.output_tokens),
    "concurrent_request_distribution": histogram(time_windows, lambda w: w.concurrent_requests),
    "kv_cache_utilization": timeseries(measurements, lambda m: m.kv_cache_percent_full),
    "token_throughput": timeseries(measurements, lambda m: m.tokens_per_second),
    "time_to_first_token": histogram(requests, lambda r: r.ttft_ms),  # Latency for first token
    "time_per_output_token": histogram(requests, lambda r: r.tpot_ms)  # Generation speed
}

Your capacity is constrained by the combination of peak concurrent requests AND the context lengths of those requests. A cluster sized for 100 concurrent 1K-context requests will OOM if those 100 requests have 50K-context prompts. Model your 95th and 99th percentile request sizes, not just the average.


The Autoscaling Problem

LLM inference autoscaling is hard for several interconnected reasons:

GPU provisioning is slow. Spinning up a new GPU instance and loading a large model takes 5-10 minutes. Conventional autoscaling strategies that respond to current utilization are too slow for this latency. You need predictive scaling based on traffic forecasts, not reactive scaling based on current load.

Fractional GPU serving. Small models can run on fractional GPUs. Large models require multiple GPUs. Your autoscaling logic needs to understand the hardware requirements of the specific models you're serving and provision the right GPU configuration, not just "more instances."

Cost is highly non-linear. An H100 costs roughly 10x an A10. If your traffic spike requires a capacity increase that crosses a model-size threshold requiring more powerful GPUs, your infrastructure cost doesn't scale linearly with traffic.

The practical advice: maintain a minimum warm replica count that exceeds your typical off-peak traffic by a meaningful margin, use predictive scaling for anticipated traffic patterns (business hours, marketing campaigns), and have a documented blog post for manual scaling when automated systems can't respond fast enough.


What This Means for Your Stack

If you're building SRE tooling on top of hosted inference APIs (Anthropic, OpenAI, Bedrock), most of this is the provider's problem. Your reliability concerns are different: rate limit handling, fallback providers, latency budgeting, and cost management.

If you're running inference infrastructure — either open-weight models on your own GPUs or serving models for others — everything above is your problem. The SRE practices are different from what you're used to, but they're learnable. The teams doing this well are applying rigorous measurement, understanding the memory-bound nature of the workload, and building degradation strategies that keep the service usable under pressure.

The infrastructure behind LLM inference is genuinely interesting engineering. It's also increasingly the infrastructure that the rest of your organization depends on.


*Zak Hassan is a Staff SRE specializing in AI-powered infrastructure automation and data platform reliability. Find him at zakhassan.com or on LinkedIn.*

Topic Paths

About the Author

Zak Hassan writes about reliability engineering under real scale constraints.

Staff-level SRE and platform engineer focused on identity reliability, Kubernetes, observability, cloud architecture, AI infrastructure, and reducing operational uncertainty.

Connect on LinkedIn