Zero-Downtime Deployments: Rolling Updates, Blue-Green, and Traffic Shifting

*By Zak Hassan — Staff SRE | May 2026*

Every engineering team eventually ships the "the team has zero-downtime deployments" slide in their reliability review. Then they get paged at 2am because a rolling update dropped three percent of requests during a high-traffic window, and the slide quietly disappears. Achieving real zero downtime is deceptively hard — not because the mechanisms are obscure, but because there are at least four independent failure modes that all have to be addressed simultaneously: in-flight requests hitting pods mid-termination, readiness probes that lie about pod health, iptables propagation lag between Kubernetes control plane and kube-proxy, and schema migrations that are incompatible with the version of code still running on surviving pods. Get three of four right and users still see errors.

Why Zero Downtime Is Harder Than It Sounds

The naive mental model is: spin up new pod, old pod exits, traffic moves over, done. The actual sequence involves more actors. When Kubernetes decides to terminate a pod, it simultaneously sends SIGTERM to the container *and* begins removing the pod's IP from the Endpoints object that backs the Service. The problem is that these two operations race. kube-proxy, which programs iptables rules on each node, processes Endpoints updates asynchronously. There is a window — typically one to three seconds on a healthy cluster, longer under load — where the old pod is already handling its SIGTERM and initiating shutdown while iptables rules on some nodes still route new connections to it. Any connection arriving in that window goes to a pod that may have already stopped accepting work.

Readiness probes add another wrinkle. A new pod that passes its readiness probe is added to the Endpoints object, but that addition also has to propagate through kube-proxy before traffic actually reaches it. There is a symmetric gap on pod startup: the pod is "ready" from Kubernetes' perspective before iptables on every node reflects that fact. Under a rolling deploy, this gap is usually fine — old pods are still running — but in aggressive configurations with maxUnavailable set high, you can momentarily have neither the old pod nor the new pod reliably reachable.

Kubernetes Rolling Update Mechanics

Rolling updates are controlled by two fields in the Deployment strategy: maxUnavailable (how many pods below desired count can be down at once) and maxSurge (how many pods above desired count are permitted). The readiness probe acts as the gate: Kubernetes will not proceed to terminate another old pod until the replacement has passed its readiness check. This means a poorly tuned readiness probe — one that returns 200 before the application has actually warmed up its connection pool, loaded its caches, or established downstream connections — will cause Kubernetes to route traffic to a pod that is technically alive but not ready to serve.

The correct configuration for a production rolling update looks like this:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
  namespace: production
spec:
  replicas: 6
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
      maxSurge: 2
  selector:
    matchLabels:
      app: api-server
  template:
    metadata:
      labels:
        app: api-server
    spec:
      terminationGracePeriodSeconds: 60
      containers:
        - name: api-server
          image: myregistry/api-server:v2.4.1
          ports:
            - containerPort: 8080
          readinessProbe:
            httpGet:
              path: /healthz/ready
              port: 8080
            initialDelaySeconds: 10
            periodSeconds: 5
            failureThreshold: 3
            successThreshold: 1
          livenessProbe:
            httpGet:
              path: /healthz/live
              port: 8080
            initialDelaySeconds: 30
            periodSeconds: 10
            failureThreshold: 3
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh", "-c", "sleep 10"]
          resources:
            requests:
              cpu: "500m"
              memory: "512Mi"
            limits:
              cpu: "1000m"
              memory: "1Gi"

The terminationGracePeriodSeconds: 60 gives the container sixty seconds from SIGTERM to exit before Kubernetes sends SIGKILL. The preStop hook fires before SIGTERM and counts against that budget.

The preStop Hook Pattern

The preStop hook running sleep 10 looks like cargo-culted boilerplate until you understand why it exists. When Kubernetes terminates a pod, it sends SIGTERM to PID 1 in the container. A well-written application catches this signal and begins a graceful shutdown — draining in-flight requests, closing database connections, flushing buffers. The application's graceful shutdown is correct, but it races against the iptables propagation described above.

The sleep 10 in the preStop hook introduces a deliberate pause *before* the SIGTERM reaches the application. During those ten seconds, kube-proxy has time to remove the pod's IP from iptables rules on all nodes in the cluster. By the time the application actually begins shutting down, no new connections are being routed to it. The pod is then draining only the in-flight requests that were already in progress when the Endpoints update was issued — a bounded, finite set.

Without the sleep, your graceful shutdown handles existing in-flight requests correctly, but new requests continue arriving from nodes where iptables has not yet converged, and those requests fail. The sleep is not a workaround for a bad application — it is the correct response to a distributed systems timing problem that exists at the infrastructure layer.

Blue-Green Deployments

Blue-green avoids the rolling update complexity entirely by maintaining two complete environments. The "blue" environment serves all production traffic. You deploy your new version to the idle "green" environment, run verification, then perform a single atomic swap by changing the Kubernetes Service selector. If anything is wrong, the rollback is another selector change.

# Live Service — points at blue initially
apiVersion: v1
kind: Service
metadata:
  name: api-server
  namespace: production
spec:
  selector:
    app: api-server
    slot: blue          # change to "green" to cut over
  ports:
    - port: 80
      targetPort: 8080

---
# Blue Deployment (currently live)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server-blue
  namespace: production
spec:
  replicas: 6
  selector:
    matchLabels:
      app: api-server
      slot: blue
  template:
    metadata:
      labels:
        app: api-server
        slot: blue
    spec:
      containers:
        - name: api-server
          image: myregistry/api-server:v2.3.0

---
# Green Deployment (new version, pre-warmed)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server-green
  namespace: production
spec:
  replicas: 6
  selector:
    matchLabels:
      app: api-server
      slot: green
  template:
    metadata:
      labels:
        app: api-server
        slot: green
    spec:
      containers:
        - name: api-server
          image: myregistry/api-server:v2.4.1

The cutover command is a single kubectl patch:

kubectl patch service api-server -n production \
  --type='json' \
  -p='[{"op": "replace", "path": "/spec/selector/slot", "value": "green"}]'

Rollback is the same command with "value": "blue". The trade-off is resource cost: you are running double the pods during the transition. For stateful services, you also need to ensure the green environment has warmed its caches and established connection pools before the swap — a cold green environment will spike your p99 latency immediately after cutover.

Canary Deployments with Traffic Weighting

Canary deployments shift a small percentage of traffic to the new version, measure error rates and latency, and only proceed when metrics meet defined thresholds. Argo Rollouts makes this first-class:

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: api-server
  namespace: production
spec:
  replicas: 6
  strategy:
    canary:
      steps:
        - setWeight: 5
        - pause:
            duration: 5m
        - setWeight: 20
        - pause:
            duration: 10m
        - setWeight: 50
        - pause:
            duration: 10m
        - setWeight: 100
      analysis:
        templates:
          - templateName: success-rate
        startingStep: 1
        args:
          - name: service-name
            value: api-server-canary
  selector:
    matchLabels:
      app: api-server
  template:
    metadata:
      labels:
        app: api-server
    spec:
      terminationGracePeriodSeconds: 60
      containers:
        - name: api-server
          image: myregistry/api-server:v2.4.1
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh", "-c", "sleep 10"]

---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
  namespace: production
spec:
  args:
    - name: service-name
  metrics:
    - name: success-rate
      interval: 60s
      successCondition: result[0] >= 0.99
      failureLimit: 3
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            sum(rate(http_requests_total{service="{{args.service-name}}",status!~"5.."}[5m]))
            /
            sum(rate(http_requests_total{service="{{args.service-name}}"}[5m]))

If the success rate drops below 99% three consecutive times, Argo Rollouts automatically aborts and rolls back to the stable version. No human intervention required.

For teams not running Argo Rollouts, NGINX Ingress provides a simpler weight-based approach using annotations:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: api-server-canary
  namespace: production
  annotations:
    nginx.ingress.kubernetes.io/canary: "true"
    nginx.ingress.kubernetes.io/canary-weight: "10"
spec:
  rules:
    - host: api.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: api-server-canary
                port:
                  number: 80

Setting canary-weight to 10 sends ten percent of requests to the canary Service. Incrementing it over time achieves the same progressive delivery pattern as Argo Rollouts, but without automated analysis — you have to watch the metrics yourself and manually update the annotation.

Database Migrations and Deployments

Rolling updates are the scenario where database migrations become genuinely dangerous. During the rollout, v2 and v1 pods are running simultaneously against the same database. Any migration that removes a column, renames a column, or changes a column's type will cause v1 pods — still running — to fail immediately. The expand-contract pattern, described in depth in the migration blog in this series, is the only correct approach here.

The sequence is: first deploy a migration that only *adds* new structure (the expand phase) without removing anything the old code depends on. Then deploy the new application code, which can now use both old and new schema. Once all v1 pods are gone and you have confirmed the old schema elements are unused, deploy a final migration that removes them (the contract phase). This means every schema change requires two separate deployment cycles. It is slower, but it is the only way to ensure that in-flight v1 pods do not see a schema that their code cannot handle.

Deployment Verification

Readiness probes confirm that a pod can handle traffic. They do not confirm that the deployment as a whole is behaving correctly. A pod can pass its /healthz/ready endpoint while its downstream dependency is returning errors on 30% of requests, or while a subtle regression is inflating latency on a specific endpoint. Deployment verification — running smoke tests and synthetic checks immediately after a deploy completes — closes this gap.

#!/usr/bin/env python3
"""Deployment gate: runs smoke tests after a rolling update and triggers
rollback via the Kubernetes API if thresholds are not met."""

import time
import sys
import requests
from kubernetes import client, config

DEPLOY_NAME = "api-server"
NAMESPACE = "production"
SMOKE_ENDPOINTS = [
    ("/api/v1/health", 200),
    ("/api/v1/users?limit=1", 200),
    ("/api/v1/products?limit=1", 200),
]
BASE_URL = "https://api.example.com"
ERROR_THRESHOLD = 0.02   # 2% error rate triggers rollback
WINDOW_SECONDS = 120     # observe for 2 minutes
PROMETHEUS_URL = "http://prometheus.monitoring:9090"


def check_error_rate(service: str, window: str = "2m") -> float:
    query = (
        f'sum(rate(http_requests_total{{service="{service}",status=~"5.."}}[{window}])) / '
        f'sum(rate(http_requests_total{{service="{service}"}}[{window}]))'
    )
    resp = requests.get(f"{PROMETHEUS_URL}/api/v1/query", params={"query": query}, timeout=10)
    resp.raise_for_status()
    result = resp.json()["data"]["result"]
    if not result:
        return 0.0
    return float(result[0]["value"][1])


def run_smoke_tests() -> bool:
    passed = True
    for path, expected_status in SMOKE_ENDPOINTS:
        try:
            resp = requests.get(f"{BASE_URL}{path}", timeout=10)
            if resp.status_code != expected_status:
                print(f"FAIL {path}: expected {expected_status}, got {resp.status_code}")
                passed = False
            else:
                print(f"OK   {path}: {resp.status_code}")
        except requests.RequestException as exc:
            print(f"FAIL {path}: {exc}")
            passed = False
    return passed


def rollback_deployment() -> None:
    config.load_incluster_config()
    apps_v1 = client.AppsV1Api()
    deploy = apps_v1.read_namespaced_deployment(DEPLOY_NAME, NAMESPACE)
    # Roll back by setting the revision annotation to trigger Kubernetes rollback
    deploy.metadata.annotations["deployment.kubernetes.io/revision"] = str(
        int(deploy.metadata.annotations.get("deployment.kubernetes.io/revision", "1")) - 1
    )
    apps_v1.patch_namespaced_deployment(DEPLOY_NAME, NAMESPACE, deploy)
    print("Rollback initiated.")


def main() -> None:
    print("Running smoke tests...")
    if not run_smoke_tests():
        print("Smoke tests failed — initiating rollback.")
        rollback_deployment()
        sys.exit(1)

    print(f"Smoke tests passed. Observing error rate for {WINDOW_SECONDS}s...")
    deadline = time.time() + WINDOW_SECONDS
    while time.time() < deadline:
        rate = check_error_rate(DEPLOY_NAME)
        remaining = int(deadline - time.time())
        print(f"  error rate: {rate:.4%}  ({remaining}s remaining)")
        if rate > ERROR_THRESHOLD:
            print(f"Error rate {rate:.4%} exceeds threshold {ERROR_THRESHOLD:.4%} — rolling back.")
            rollback_deployment()
            sys.exit(1)
        time.sleep(15)

    print("Deployment verification passed.")
    sys.exit(0)


if __name__ == "__main__":
    main()

This script is intended to run as a post-deploy Job in the same cluster, with in-cluster RBAC permissions to read and patch the Deployment. It combines smoke tests (deterministic, immediate) with metric-based observation (probabilistic, sustained) into a single gate. If either fails, the rollback fires before a human has to page-in.

The deployment gate pattern — making every deploy block on verification before proceeding — is what separates teams that discover failures in post-incident reviews from teams that catch them in the deployment pipeline. Zero downtime is not just about the mechanics of pod replacement; it is about knowing the moment something goes wrong and having the automation to reverse it faster than your users notice.

*Zak Hassan is a Staff SRE specializing in distributed systems reliability, progressive delivery, and platform engineering. Find him at zakhassan.com or on LinkedIn.*

Topic Paths

SRE and Reliability Kubernetes and Platform Engineering Observability and Incident Learning Cloud Cost and Capacity

About the Author

Zak Hassan writes about reliability engineering under real scale constraints.

Staff-level SRE and platform engineer focused on identity reliability, Kubernetes, observability, cloud architecture, AI infrastructure, and reducing operational uncertainty.

Connect on LinkedIn