*By Zak Hassan — Staff SRE | May 2026*
HTTP/1.1 has trained us to think of load balancing as a solved problem — throw a round-robin L4 balancer in front of the fleet and requests distribute evenly. gRPC breaks that mental model completely. Because gRPC runs over HTTP/2, which multiplexes many RPCs over a single long-lived TCP connection, an L4 load balancer sees one connection per client and routes all of that client's traffic to a single backend. In a service mesh with dozens of gRPC clients, you end up with hot pods, idle pods, and cascading failures that look like capacity problems but are really routing problems. Getting gRPC right in production-like lab environments means understanding the protocol's connection model, building observability on top of its status code system, and treating deadlines as a first-class concern — not an afterthought.
Why L4 Load Balancing Fails gRPC
Traditional TCP load balancers (AWS NLB, kube-proxy in IPVS mode, most hardware appliances) work at the connection level. They pick a backend when a connection is established and stick with it. For HTTP/1.1, this is fine: connections are short-lived, and a new request usually means a new connection. For gRPC over HTTP/2, a single connection carries hundreds of concurrent streams. A client that opens one gRPC channel and makes 500 RPC calls per second sends all 500 to the same backend, regardless of what the load balancer says.
The implication is concrete: if you deploy 10 gRPC service pods and have 10 client pods, you may have 10 connections — one per client-server pair — and no balancing at all. Adding pods doesn't help because existing connections don't migrate. You have to either force connections to redistribute (by restarting clients, which is not a strategy) or solve the problem properly.
Load Balancing Strategies That Actually Work
Client-side load balancing is the purest solution. The client fetches all backend addresses from a service discovery system, opens connections to each, and picks a backend per RPC using a policy like round-robin or least-connections. gRPC's built-in round_robin policy does this:
import grpc
channel = grpc.insecure_channel(
"dns:///my-grpc-service.default.svc.cluster.local:50051",
options=[
("grpc.lb_policy_name", "round_robin"),
# DNS-based discovery returns all pod IPs when using a headless service
("grpc.service_config", '{"loadBalancingPolicy": "round_robin"}'),
],
)For this to work in Kubernetes, the Service must be headless (clusterIP: None). A normal ClusterIP service returns a single virtual IP, so DNS-based discovery still only sees one address. A headless service returns A records for every pod IP, giving the client-side balancer the full backend set.
apiVersion: v1
kind: Service
metadata:
name: my-grpc-service
spec:
clusterIP: None # headless — returns pod IPs directly
selector:
app: my-grpc-service
ports:
- port: 50051
targetPort: 50051L7 proxy load balancing with Envoy or Linkerd is the production-style-grade alternative. Envoy understands HTTP/2 framing and can balance at the stream level, not the connection level. A minimal Envoy configuration for gRPC load balancing:
static_resources:
listeners:
- name: grpc_listener
address:
socket_address: { address: 0.0.0.0, port_value: 9000 }
filter_chains:
- filters:
- name: envoy.filters.network.http_connection_manager
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
stat_prefix: grpc_ingress
codec_type: AUTO
route_config:
name: grpc_route
virtual_hosts:
- name: grpc_backend
domains: ["*"]
routes:
- match:
prefix: "/"
grpc: {} # match only gRPC traffic
route:
cluster: grpc_service
timeout: 0s # 0 = defer to per-RPC deadlines
http_filters:
- name: envoy.filters.http.router
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
clusters:
- name: grpc_service
connect_timeout: 1s
type: STRICT_DNS
lb_policy: LEAST_REQUEST # stream-level balancing
typed_extension_protocol_options:
envoy.extensions.upstreams.http.v3.HttpProtocolOptions:
"@type": type.googleapis.com/envoy.extensions.upstreams.http.v3.HttpProtocolOptions
explicit_http_config:
http2_protocol_options: {} # force HTTP/2 to backends
load_assignment:
cluster_name: grpc_service
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address: { address: my-grpc-service, port_value: 50051 }The grpc: {} route matcher is important — it ensures Envoy applies gRPC-specific handling (trailers, status codes) rather than treating this as generic HTTP/2 traffic.
Deadlines and Timeouts
Every gRPC call must carry a deadline. Without one, a slow downstream service holds your goroutine or thread indefinitely, your connection pool saturates, and the slowness fans out upstream. This is not theoretical — it is one of the most common causes of cascading failures in gRPC-heavy architectures.
The gRPC deadline model differs from a simple socket timeout. A deadline is an absolute point in time attached to a Context. When you receive an inbound RPC with a deadline, you should pass that same deadline (or a shorter one) to any outbound RPCs you make. This is deadline propagation, and it ensures that if a client gives up, the entire call chain gives up rather than continuing to consume backend resources.
import grpc
from datetime import datetime, timedelta, timezone
def call_with_deadline(stub, request):
# Absolute deadline, not a duration — consistent across network hops
deadline = datetime.now(timezone.utc) + timedelta(seconds=5)
try:
response = stub.MyMethod(request, timeout=5.0)
return response
except grpc.RpcError as e:
if e.code() == grpc.StatusCode.DEADLINE_EXCEEDED:
# The deadline expired — the service fault or downstream was too slow
raise TimeoutError(f"RPC deadline exceeded: {e.details()}")
elif e.code() == grpc.StatusCode.CANCELLED:
# The caller cancelled us — propagate cancellation upward
raise CancelledError("RPC was cancelled by caller")
raiseDEADLINE_EXCEEDED and CANCELLED look similar but are operationally distinct. DEADLINE_EXCEEDED means time ran out — alert on this, it indicates latency problems or insufficient deadline budgets. CANCELLED means a client explicitly cancelled the call — this is often benign (user navigated away, a retry policy triggered) and should not alert at the same severity.
Health Checking with the gRPC Health Protocol
Kubernetes readiness probes default to TCP checks, which only verify that a port is open — not that the gRPC server is actually serving. The gRPC Health Checking Protocol provides a standard grpc.health.v1.Health/Check RPC for this purpose.
Implementing the health server:
from concurrent import futures
import grpc
from grpc_health.v1 import health, health_pb2, health_pb2_grpc
def serve():
server = grpc.server(futures.ThreadPoolExecutor(max_workers=10))
# Register the service
# my_pb2_grpc.add_MyServiceServicer_to_server(MyServicer(), server)
# Register health service
health_servicer = health.HealthServicer()
health_pb2_grpc.add_HealthServicer_to_server(health_servicer, server)
# Mark service as serving — update this dynamically as your app warms up
health_servicer.set("my.package.MyService", health_pb2.HealthCheckResponse.SERVING)
# Empty string = overall server health
health_servicer.set("", health_pb2.HealthCheckResponse.SERVING)
server.add_insecure_port("[::]:50051")
server.start()
return server, health_servicerIn the Kubernetes pod spec, use grpc probe type (available since Kubernetes 1.24):
readinessProbe:
grpc:
port: 50051
service: my.package.MyService
initialDelaySeconds: 5
periodSeconds: 10
livenessProbe:
grpc:
port: 50051
service: "" # overall server health for liveness
initialDelaySeconds: 15
periodSeconds: 20
failureThreshold: 3During a rolling deploy, set the health status to NOT_SERVING before the pod receives a SIGTERM. This drains in-flight requests before the process exits rather than dropping them mid-stream.
Observability: Interceptors, Status Codes, and Prometheus Metrics
gRPC status codes — not HTTP status codes — are the primary error signal for gRPC services. OK, UNAVAILABLE, INTERNAL, INVALID_ARGUMENT, NOT_FOUND, RESOURCE_EXHAUSTED each carry specific semantics. Alerting on error rate means alerting on non-OK status codes, not on anything resembling HTTP 4xx/5xx.
The standard Prometheus metric names for gRPC (from the grpc_server_* convention and OpenTelemetry semantic conventions) are:
grpc_server_started_total— RPCs started, labeled by methodgrpc_server_handled_total— RPCs completed, labeled by method andgrpc_codegrpc_server_handling_seconds— RPC duration histogramgrpc_client_started_total,grpc_client_handled_total,grpc_client_handling_seconds— same on the client side
A Python server interceptor that emits both Prometheus metrics and OpenTelemetry traces:
import time
import grpc
from opentelemetry import trace
from opentelemetry.trace.propagation.tracecontext import TraceContextTextMapPropagator
from prometheus_client import Counter, Histogram
GRPC_REQUESTS = Counter(
"grpc_server_handled_total",
"Total gRPC calls handled",
["grpc_method", "grpc_code"],
)
GRPC_LATENCY = Histogram(
"grpc_server_handling_seconds",
"gRPC call duration in seconds",
["grpc_method"],
buckets=[0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
)
tracer = trace.get_tracer(__name__)
propagator = TraceContextTextMapPropagator()
class ObservabilityInterceptor(grpc.ServerInterceptor):
def intercept_service(self, continuation, handler_call_details):
method = handler_call_details.method # e.g. /my.package.MyService/MyMethod
def wrapper(request_or_iterator, servicer_context):
# Extract trace context from gRPC metadata
metadata = dict(servicer_context.invocation_metadata())
ctx = propagator.extract(carrier=metadata)
with tracer.start_as_current_span(
method,
context=ctx,
kind=trace.SpanKind.SERVER,
) as span:
span.set_attribute("rpc.system", "grpc")
span.set_attribute("rpc.method", method.split("/")[-1])
start = time.monotonic()
grpc_code = "OK"
try:
result = continuation(request_or_iterator, servicer_context)
return result
except grpc.RpcError as e:
grpc_code = e.code().name
span.set_attribute("rpc.grpc.status_code", grpc_code)
span.record_exception(e)
raise
finally:
duration = time.monotonic() - start
GRPC_REQUESTS.labels(grpc_method=method, grpc_code=grpc_code).inc()
GRPC_LATENCY.labels(grpc_method=method).observe(duration)
handler = continuation(handler_call_details)
if handler is None:
return None
return grpc.unary_unary_rpc_method_handler(
wrapper,
request_deserializer=handler.request_deserializer,
response_serializer=handler.response_serializer,
)Wire it into your server at startup: grpc.server(..., interceptors=[ObservabilityInterceptor()]).
A useful alert to start with: error rate per method excluding NOT_FOUND and INVALID_ARGUMENT (which are caller errors, not server errors):
sum by (grpc_method) (
rate(grpc_server_handled_total{grpc_code!~"OK|NOT_FOUND|INVALID_ARGUMENT|CANCELLED"}[5m])
)
/
sum by (grpc_method) (
rate(grpc_server_handled_total[5m])
)
> 0.01Streaming RPCs in production-like lab environments
Streaming RPCs — server streaming, client streaming, and bidirectional streaming — are operationally more complex than unary calls. The connection stays open for the stream's lifetime, which can be minutes or hours. This has implications for deploys, health checking, and error handling.
During a rolling deploy, a pod receiving SIGTERM may have active streams. If it closes immediately, clients see UNAVAILABLE and must reconnect. The right pattern is: stop accepting new connections, let existing streams drain, then exit. In practice, set a drain timeout (30-60 seconds is common) and close the server gracefully:
import signal
server, health_servicer = serve()
def handle_sigterm(*args):
# Signal not serving so the load balancer stops sending new connections
health_servicer.set("", health_pb2.HealthCheckResponse.NOT_SERVING)
# Wait for in-flight RPCs to complete, up to 30 seconds
server.stop(grace=30)
signal.signal(signal.SIGTERM, handle_sigterm)
server.wait_for_termination()For long-lived streams, design in explicit keepalives and heartbeats at the application layer. gRPC's HTTP/2 PING frames detect dead connections, but a stream that is alive-but-stuck (upstream blocked, no data flowing) will not trigger a PING. A server-side keepalive message or a client-side deadline on the stream itself (watch pattern with periodic deadline renewal) is more reliable.
Error Handling and Status Code Mapping
The gRPC status code model has 16 codes. Understanding which are retryable is essential for building resilient clients:
| Status Code | Meaning | Retryable? |
|---|---|---|
UNAVAILABLE | Server temporarily unavailable | Yes, with backoff |
RESOURCE_EXHAUSTED | Rate limited or overloaded | Yes, with backoff |
DEADLINE_EXCEEDED | Deadline expired | Depends — retry only if idempotent |
INTERNAL | Server-side bug | No |
INVALID_ARGUMENT | Bad request | No |
NOT_FOUND | Resource doesn't exist | No |
ALREADY_EXISTS | Conflict | No |
PERMISSION_DENIED | Authorization failure | No |
For REST clients consuming gRPC services via a transcoding gateway (like Envoy's gRPC-JSON transcoder or gRPC-Gateway), the standard HTTP mapping is: OK→200, NOT_FOUND→404, INVALID_ARGUMENT→400, INTERNAL→500, UNAVAILABLE→503, RESOURCE_EXHAUSTED→429, PERMISSION_DENIED→403, UNAUTHENTICATED→401.
When returning errors from a gRPC service, use the Status type with google.rpc.ErrorInfo details to provide machine-readable context beyond the string message:
from grpc_status import rpc_status
from google.rpc import status_pb2, error_details_pb2
import grpc
def raise_structured_error(servicer_context):
detail = error_details_pb2.ErrorInfo(
reason="QUOTA_EXCEEDED",
domain="my-service.example.com",
metadata={"quota_limit": "1000", "quota_period": "1m"},
)
status = status_pb2.Status(
code=grpc.StatusCode.RESOURCE_EXHAUSTED.value[0],
message="API quota exceeded",
details=[detail],
)
servicer_context.abort_with_status(rpc_status.to_status(status))Structured error details let clients inspect the error programmatically rather than parsing human-readable strings — a necessity at scale when you have dozens of services translating errors for their callers.
gRPC is a mature protocol with strong operational primitives, but it demands more deliberate infrastructure than HTTP/1.1. The teams that run it well have invested in the details: headless services or L7 proxies for real load distribution, deadlines on every call without exception, health checking that reflects actual readiness, and observability grounded in status codes rather than retrofitted HTTP conventions. The teams that struggle are usually the ones that lifted their HTTP/1.1 mental model wholesale and wondered why the metrics looked fine while half the fleet sat idle.
*Zak Hassan is a Staff SRE specializing in distributed systems, observability, and cloud-native infrastructure. Find him at zakhassan.com or on LinkedIn.*
Topic Paths