*By Zak Hassan — Staff SRE | May 2026*


HTTP/1.1 has trained us to think of load balancing as a solved problem — throw a round-robin L4 balancer in front of the fleet and requests distribute evenly. gRPC breaks that mental model completely. Because gRPC runs over HTTP/2, which multiplexes many RPCs over a single long-lived TCP connection, an L4 load balancer sees one connection per client and routes all of that client's traffic to a single backend. In a service mesh with dozens of gRPC clients, you end up with hot pods, idle pods, and cascading failures that look like capacity problems but are really routing problems. Getting gRPC right in production-like lab environments means understanding the protocol's connection model, building observability on top of its status code system, and treating deadlines as a first-class concern — not an afterthought.

Why L4 Load Balancing Fails gRPC

Traditional TCP load balancers (AWS NLB, kube-proxy in IPVS mode, most hardware appliances) work at the connection level. They pick a backend when a connection is established and stick with it. For HTTP/1.1, this is fine: connections are short-lived, and a new request usually means a new connection. For gRPC over HTTP/2, a single connection carries hundreds of concurrent streams. A client that opens one gRPC channel and makes 500 RPC calls per second sends all 500 to the same backend, regardless of what the load balancer says.

The implication is concrete: if you deploy 10 gRPC service pods and have 10 client pods, you may have 10 connections — one per client-server pair — and no balancing at all. Adding pods doesn't help because existing connections don't migrate. You have to either force connections to redistribute (by restarting clients, which is not a strategy) or solve the problem properly.

Load Balancing Strategies That Actually Work

Client-side load balancing is the purest solution. The client fetches all backend addresses from a service discovery system, opens connections to each, and picks a backend per RPC using a policy like round-robin or least-connections. gRPC's built-in round_robin policy does this:

python
import grpc

channel = grpc.insecure_channel(
    "dns:///my-grpc-service.default.svc.cluster.local:50051",
    options=[
        ("grpc.lb_policy_name", "round_robin"),
        # DNS-based discovery returns all pod IPs when using a headless service
        ("grpc.service_config", '{"loadBalancingPolicy": "round_robin"}'),
    ],
)

For this to work in Kubernetes, the Service must be headless (clusterIP: None). A normal ClusterIP service returns a single virtual IP, so DNS-based discovery still only sees one address. A headless service returns A records for every pod IP, giving the client-side balancer the full backend set.

yaml
apiVersion: v1
kind: Service
metadata:
  name: my-grpc-service
spec:
  clusterIP: None          # headless — returns pod IPs directly
  selector:
    app: my-grpc-service
  ports:
    - port: 50051
      targetPort: 50051

L7 proxy load balancing with Envoy or Linkerd is the production-style-grade alternative. Envoy understands HTTP/2 framing and can balance at the stream level, not the connection level. A minimal Envoy configuration for gRPC load balancing:

yaml
static_resources:
  listeners:
    - name: grpc_listener
      address:
        socket_address: { address: 0.0.0.0, port_value: 9000 }
      filter_chains:
        - filters:
            - name: envoy.filters.network.http_connection_manager
              typed_config:
                "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
                stat_prefix: grpc_ingress
                codec_type: AUTO
                route_config:
                  name: grpc_route
                  virtual_hosts:
                    - name: grpc_backend
                      domains: ["*"]
                      routes:
                        - match:
                            prefix: "/"
                            grpc: {}         # match only gRPC traffic
                          route:
                            cluster: grpc_service
                            timeout: 0s      # 0 = defer to per-RPC deadlines
                http_filters:
                  - name: envoy.filters.http.router
                    typed_config:
                      "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router

  clusters:
    - name: grpc_service
      connect_timeout: 1s
      type: STRICT_DNS
      lb_policy: LEAST_REQUEST   # stream-level balancing
      typed_extension_protocol_options:
        envoy.extensions.upstreams.http.v3.HttpProtocolOptions:
          "@type": type.googleapis.com/envoy.extensions.upstreams.http.v3.HttpProtocolOptions
          explicit_http_config:
            http2_protocol_options: {}       # force HTTP/2 to backends
      load_assignment:
        cluster_name: grpc_service
        endpoints:
          - lb_endpoints:
              - endpoint:
                  address:
                    socket_address: { address: my-grpc-service, port_value: 50051 }

The grpc: {} route matcher is important — it ensures Envoy applies gRPC-specific handling (trailers, status codes) rather than treating this as generic HTTP/2 traffic.

Deadlines and Timeouts

Every gRPC call must carry a deadline. Without one, a slow downstream service holds your goroutine or thread indefinitely, your connection pool saturates, and the slowness fans out upstream. This is not theoretical — it is one of the most common causes of cascading failures in gRPC-heavy architectures.

The gRPC deadline model differs from a simple socket timeout. A deadline is an absolute point in time attached to a Context. When you receive an inbound RPC with a deadline, you should pass that same deadline (or a shorter one) to any outbound RPCs you make. This is deadline propagation, and it ensures that if a client gives up, the entire call chain gives up rather than continuing to consume backend resources.

python
import grpc
from datetime import datetime, timedelta, timezone

def call_with_deadline(stub, request):
    # Absolute deadline, not a duration — consistent across network hops
    deadline = datetime.now(timezone.utc) + timedelta(seconds=5)
    
    try:
        response = stub.MyMethod(request, timeout=5.0)
        return response
    except grpc.RpcError as e:
        if e.code() == grpc.StatusCode.DEADLINE_EXCEEDED:
            # The deadline expired — the service fault or downstream was too slow
            raise TimeoutError(f"RPC deadline exceeded: {e.details()}")
        elif e.code() == grpc.StatusCode.CANCELLED:
            # The caller cancelled us — propagate cancellation upward
            raise CancelledError("RPC was cancelled by caller")
        raise

DEADLINE_EXCEEDED and CANCELLED look similar but are operationally distinct. DEADLINE_EXCEEDED means time ran out — alert on this, it indicates latency problems or insufficient deadline budgets. CANCELLED means a client explicitly cancelled the call — this is often benign (user navigated away, a retry policy triggered) and should not alert at the same severity.

Health Checking with the gRPC Health Protocol

Kubernetes readiness probes default to TCP checks, which only verify that a port is open — not that the gRPC server is actually serving. The gRPC Health Checking Protocol provides a standard grpc.health.v1.Health/Check RPC for this purpose.

Implementing the health server:

python
from concurrent import futures
import grpc
from grpc_health.v1 import health, health_pb2, health_pb2_grpc

def serve():
    server = grpc.server(futures.ThreadPoolExecutor(max_workers=10))
    
    # Register the service
    # my_pb2_grpc.add_MyServiceServicer_to_server(MyServicer(), server)
    
    # Register health service
    health_servicer = health.HealthServicer()
    health_pb2_grpc.add_HealthServicer_to_server(health_servicer, server)
    
    # Mark service as serving — update this dynamically as your app warms up
    health_servicer.set("my.package.MyService", health_pb2.HealthCheckResponse.SERVING)
    # Empty string = overall server health
    health_servicer.set("", health_pb2.HealthCheckResponse.SERVING)
    
    server.add_insecure_port("[::]:50051")
    server.start()
    return server, health_servicer

In the Kubernetes pod spec, use grpc probe type (available since Kubernetes 1.24):

yaml
readinessProbe:
  grpc:
    port: 50051
    service: my.package.MyService
  initialDelaySeconds: 5
  periodSeconds: 10
livenessProbe:
  grpc:
    port: 50051
    service: ""      # overall server health for liveness
  initialDelaySeconds: 15
  periodSeconds: 20
  failureThreshold: 3

During a rolling deploy, set the health status to NOT_SERVING before the pod receives a SIGTERM. This drains in-flight requests before the process exits rather than dropping them mid-stream.

Observability: Interceptors, Status Codes, and Prometheus Metrics

gRPC status codes — not HTTP status codes — are the primary error signal for gRPC services. OK, UNAVAILABLE, INTERNAL, INVALID_ARGUMENT, NOT_FOUND, RESOURCE_EXHAUSTED each carry specific semantics. Alerting on error rate means alerting on non-OK status codes, not on anything resembling HTTP 4xx/5xx.

The standard Prometheus metric names for gRPC (from the grpc_server_* convention and OpenTelemetry semantic conventions) are:

  • grpc_server_started_total — RPCs started, labeled by method
  • grpc_server_handled_total — RPCs completed, labeled by method and grpc_code
  • grpc_server_handling_seconds — RPC duration histogram
  • grpc_client_started_total, grpc_client_handled_total, grpc_client_handling_seconds — same on the client side

A Python server interceptor that emits both Prometheus metrics and OpenTelemetry traces:

python
import time
import grpc
from opentelemetry import trace
from opentelemetry.trace.propagation.tracecontext import TraceContextTextMapPropagator
from prometheus_client import Counter, Histogram

GRPC_REQUESTS = Counter(
    "grpc_server_handled_total",
    "Total gRPC calls handled",
    ["grpc_method", "grpc_code"],
)
GRPC_LATENCY = Histogram(
    "grpc_server_handling_seconds",
    "gRPC call duration in seconds",
    ["grpc_method"],
    buckets=[0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
)

tracer = trace.get_tracer(__name__)
propagator = TraceContextTextMapPropagator()


class ObservabilityInterceptor(grpc.ServerInterceptor):
    def intercept_service(self, continuation, handler_call_details):
        method = handler_call_details.method  # e.g. /my.package.MyService/MyMethod

        def wrapper(request_or_iterator, servicer_context):
            # Extract trace context from gRPC metadata
            metadata = dict(servicer_context.invocation_metadata())
            ctx = propagator.extract(carrier=metadata)

            with tracer.start_as_current_span(
                method,
                context=ctx,
                kind=trace.SpanKind.SERVER,
            ) as span:
                span.set_attribute("rpc.system", "grpc")
                span.set_attribute("rpc.method", method.split("/")[-1])

                start = time.monotonic()
                grpc_code = "OK"
                try:
                    result = continuation(request_or_iterator, servicer_context)
                    return result
                except grpc.RpcError as e:
                    grpc_code = e.code().name
                    span.set_attribute("rpc.grpc.status_code", grpc_code)
                    span.record_exception(e)
                    raise
                finally:
                    duration = time.monotonic() - start
                    GRPC_REQUESTS.labels(grpc_method=method, grpc_code=grpc_code).inc()
                    GRPC_LATENCY.labels(grpc_method=method).observe(duration)

        handler = continuation(handler_call_details)
        if handler is None:
            return None
        return grpc.unary_unary_rpc_method_handler(
            wrapper,
            request_deserializer=handler.request_deserializer,
            response_serializer=handler.response_serializer,
        )

Wire it into your server at startup: grpc.server(..., interceptors=[ObservabilityInterceptor()]).

A useful alert to start with: error rate per method excluding NOT_FOUND and INVALID_ARGUMENT (which are caller errors, not server errors):

promql
sum by (grpc_method) (
  rate(grpc_server_handled_total{grpc_code!~"OK|NOT_FOUND|INVALID_ARGUMENT|CANCELLED"}[5m])
)
/
sum by (grpc_method) (
  rate(grpc_server_handled_total[5m])
)
> 0.01

Streaming RPCs in production-like lab environments

Streaming RPCs — server streaming, client streaming, and bidirectional streaming — are operationally more complex than unary calls. The connection stays open for the stream's lifetime, which can be minutes or hours. This has implications for deploys, health checking, and error handling.

During a rolling deploy, a pod receiving SIGTERM may have active streams. If it closes immediately, clients see UNAVAILABLE and must reconnect. The right pattern is: stop accepting new connections, let existing streams drain, then exit. In practice, set a drain timeout (30-60 seconds is common) and close the server gracefully:

python
import signal

server, health_servicer = serve()

def handle_sigterm(*args):
    # Signal not serving so the load balancer stops sending new connections
    health_servicer.set("", health_pb2.HealthCheckResponse.NOT_SERVING)
    # Wait for in-flight RPCs to complete, up to 30 seconds
    server.stop(grace=30)

signal.signal(signal.SIGTERM, handle_sigterm)
server.wait_for_termination()

For long-lived streams, design in explicit keepalives and heartbeats at the application layer. gRPC's HTTP/2 PING frames detect dead connections, but a stream that is alive-but-stuck (upstream blocked, no data flowing) will not trigger a PING. A server-side keepalive message or a client-side deadline on the stream itself (watch pattern with periodic deadline renewal) is more reliable.

Error Handling and Status Code Mapping

The gRPC status code model has 16 codes. Understanding which are retryable is essential for building resilient clients:

Status CodeMeaningRetryable?
UNAVAILABLEServer temporarily unavailableYes, with backoff
RESOURCE_EXHAUSTEDRate limited or overloadedYes, with backoff
DEADLINE_EXCEEDEDDeadline expiredDepends — retry only if idempotent
INTERNALServer-side bugNo
INVALID_ARGUMENTBad requestNo
NOT_FOUNDResource doesn't existNo
ALREADY_EXISTSConflictNo
PERMISSION_DENIEDAuthorization failureNo

For REST clients consuming gRPC services via a transcoding gateway (like Envoy's gRPC-JSON transcoder or gRPC-Gateway), the standard HTTP mapping is: OK→200, NOT_FOUND→404, INVALID_ARGUMENT→400, INTERNAL→500, UNAVAILABLE→503, RESOURCE_EXHAUSTED→429, PERMISSION_DENIED→403, UNAUTHENTICATED→401.

When returning errors from a gRPC service, use the Status type with google.rpc.ErrorInfo details to provide machine-readable context beyond the string message:

python
from grpc_status import rpc_status
from google.rpc import status_pb2, error_details_pb2
import grpc

def raise_structured_error(servicer_context):
    detail = error_details_pb2.ErrorInfo(
        reason="QUOTA_EXCEEDED",
        domain="my-service.example.com",
        metadata={"quota_limit": "1000", "quota_period": "1m"},
    )
    status = status_pb2.Status(
        code=grpc.StatusCode.RESOURCE_EXHAUSTED.value[0],
        message="API quota exceeded",
        details=[detail],
    )
    servicer_context.abort_with_status(rpc_status.to_status(status))

Structured error details let clients inspect the error programmatically rather than parsing human-readable strings — a necessity at scale when you have dozens of services translating errors for their callers.

gRPC is a mature protocol with strong operational primitives, but it demands more deliberate infrastructure than HTTP/1.1. The teams that run it well have invested in the details: headless services or L7 proxies for real load distribution, deadlines on every call without exception, health checking that reflects actual readiness, and observability grounded in status codes rather than retrofitted HTTP conventions. The teams that struggle are usually the ones that lifted their HTTP/1.1 mental model wholesale and wondered why the metrics looked fine while half the fleet sat idle.


*Zak Hassan is a Staff SRE specializing in distributed systems, observability, and cloud-native infrastructure. Find him at zakhassan.com or on LinkedIn.*

Topic Paths

About the Author

Zak Hassan writes about reliability engineering under real scale constraints.

Staff-level SRE and platform engineer focused on identity reliability, Kubernetes, observability, cloud architecture, AI infrastructure, and reducing operational uncertainty.

Connect on LinkedIn