Chaos Engineering in Practice: From GameDays to Continuous Verification

*By Zak Hassan — Staff SRE | May 2026*

Chaos engineering is the practice of deliberately introducing failures into a system to verify that it behaves correctly when those failures occur in production-like lab environments. The goal isn't to break things for the sake of breaking them — it's to answer a specific question: does the system actually tolerate the failures teams think it can tolerate?

Most reliability claims are untested. "teams have circuit breakers" doesn't mean the circuit breakers work correctly. "teams have multi-region failover" doesn't mean failover has been executed recently enough to trust. The gap between documented reliability and actual reliability is where real outages happen, and chaos engineering is the practice of closing that gap before a real outage exposes it for you.

The Hypothesis-Driven Approach

The difference between chaos engineering and random destruction is the hypothesis. Every chaos experiment starts with a specific, falsifiable prediction about system behavior:

Hypothesis template:
"When [failure condition], [system component] will [expected behavior], 
resulting in [observable outcome], and [business metric] will [not degrade / 
degrade by less than X%]."

Example hypotheses:

"When the payment-service database replica fails, the application will fail over 
to the primary within 30 seconds, resulting in a brief error spike of less than 
5%, and checkout success rate will recover to baseline within 60 seconds."

"When the experiment injects 200ms of latency into calls from checkout-service to 
inventory-service, the checkout flow will complete within 2 seconds due to 
the 1-second timeout and circuit breaker, and inventory calls will fall back 
to cached values."

"When the experiment kills 50% of the recommendation-service pods simultaneously, 
HPA will provision replacement pods within 3 minutes, and the product 
page will degrade gracefully to showing popular items instead of 
personalized recommendations."

A hypothesis that proves false isn't a failure — it's the most valuable output of chaos engineering. You've found a gap between assumed and actual reliability, before a production incident found it first.

The Blast Radius Framework

Chaos experiments must be bounded. running an experiment without a lab boundary-style environments-style environments without understanding the maximum possible impact is reckless, not brave.

# Blast radius assessment before any experiment
class BlastRadiusAssessment:
    def __init__(self, target_service: str, experiment_type: str):
        self.target = target_service
        self.experiment = experiment_type
    
    def assess(self) -> BlastRadius:
        # What services depend on the target?
        downstream_services = get_dependency_graph(self.target, direction="downstream")
        
        # What is the traffic volume affected?
        requests_per_second = get_current_rps(self.target)
        
        # What is the revenue exposure per minute of impact?
        revenue_per_minute = get_revenue_attribution(self.target)
        
        # Is this reversible immediately?
        is_immediately_reversible = self.experiment in [
            "latency_injection",   # Stop injecting — latency returns to normal
            "error_injection",     # Stop injecting — errors stop
            "cpu_stress",          # Kill the stress process
        ]
        
        return BlastRadius(
            affected_services=downstream_services,
            max_affected_rps=requests_per_second,
            revenue_exposure_per_minute=revenue_per_minute,
            is_reversible=is_immediately_reversible,
            recommendation=self._recommend_environment(revenue_per_minute)
        )
    
    def _recommend_environment(self, revenue_per_minute: float) -> str:
        if revenue_per_minute > 10000:
            return "staging_only"    # >$10K/minute risk: staging only
        elif revenue_per_minute > 1000:
            return "production_off_peak"  # Run during low traffic
        else:
            return "production_anytime"   # Low risk: run in a production-style lab-style environments-style environments

The chaos maturity ladder: start in development, move to staging, then production during off-peak, then production during normal traffic. Each step requires proving the previous step's experiments are stable and instrumented.

Tooling: Chaos Mesh on Kubernetes

Chaos Mesh is the production-style-grade chaos platform for Kubernetes environments. It provides declarative chaos experiments via Kubernetes custom resources.

# Network partition: isolate checkout-service from payment-service
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: checkout-payment-partition
  namespace: production
spec:
  action: partition
  mode: fixed-percent
  value: "50"      # Affect 50% of checkout-service pods
  selector:
    namespaces:
      - production
    labelSelectors:
      app: checkout-service
  direction: to
  target:
    selector:
      namespaces:
        - production
      labelSelectors:
        app: payment-service
    mode: all
  duration: "5m"   # Run for 5 minutes then auto-terminate

---
# Pod failure: kill 30% of pods
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: inventory-pod-failure
  namespace: production
spec:
  action: pod-kill
  mode: fixed-percent
  value: "30"
  selector:
    namespaces:
      - production
    labelSelectors:
      app: inventory-service
  duration: "3m"

---
# Latency injection: add 300ms to all outbound calls from search-service
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: search-dependency-latency
  namespace: staging
spec:
  action: delay
  mode: all
  selector:
    namespaces:
      - staging
    labelSelectors:
      app: search-service
  delay:
    latency: "300ms"
    correlation: "25"   # 25% correlation between successive packets
    jitter: "50ms"
  duration: "10m"

---
# Stress test: CPU pressure on API gateway
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
  name: api-gateway-cpu-stress
  namespace: staging
spec:
  mode: fixed
  value: "2"
  selector:
    namespaces:
      - staging
    labelSelectors:
      app: api-gateway
  stressors:
    cpu:
      workers: 2        # 2 CPU-burning goroutines
      load: 80          # Each at 80% CPU
  duration: "5m"

GameDay Design: The Human-in-the-Loop Exercise

A GameDay is a structured chaos experiment where on-call engineers practice responding to failures in a controlled setting. Unlike automated chaos experiments (which verify automation), GameDays test human response: Can the team detect the failure? How quickly? Can they diagnose it? Can they execute the remediation?

GameDay structure:

Pre-GameDay (1 week before):
- Define the scenario: what failure will be injected?
- Define success criteria: what does a successful response look like?
- Identify the observers (different from responders)
- Ensure monitoring is in place to observe the system during the experiment
- Brief responders that a GameDay is happening this week (not which day/time)

GameDay Day:
1. Observers convene (separate channel from responders)
2. Inject the failure at an unannounced time
3. Observe and record:
   - Time to first alert: did monitoring detect the failure?
   - Time to detection by engineer: how long until someone noticed?
   - Diagnostic actions taken: what did the engineer look at?
   - Time to correct diagnosis: how long until root cause was identified?
   - Remediation actions: were they correct?
   - Time to resolution: how long until the system recovered?
4. Terminate the experiment (observers have a kill switch)

Post-GameDay (same day):
1. Hot debrief: what happened? (30 minutes, immediate impressions)
2. Observers share their notes
3. Identify gaps: monitoring gaps, blog post gaps, response gaps
4. Create action items with owners and deadlines

Sample GameDay scenarios by maturity level:

Stage 1 (beginner): kill a single pod. Does the on-call engineer notice? Does it auto-recover? How long does it take?

Stage 2 (intermediate): kill all pods for a non-critical service. Does the dependent service degrade gracefully or fail completely? Does the team correctly diagnose the root cause?

Stage 3 (advanced): inject latency into a critical dependency to simulate a slow but not-failing dependency. This is the scenario that most teams handle worst — it causes timeouts that are harder to diagnose than outright failures.

Stage 4 (expert): multi-failure injection. A deployment happens at the same time as a database replica fails. This tests the team's ability to manage multiple simultaneous incidents without jumping to the wrong root cause.

Continuous Verification: Chaos as a Pipeline Stage

The highest-maturity chaos practice is integrating experiments into the deployment pipeline — every major deploy triggers a set of chaos experiments to verify that the new version maintains resilience properties.

# chaos_verification.py — runs as a pipeline step post-deploy
import subprocess
import time
import requests

STEADY_STATE_METRICS = {
    "error_rate": 0.01,           # Max 1% error rate
    "p99_latency_ms": 500,        # Max 500ms p99 latency
    "checkout_success_rate": 0.99  # Min 99% checkout success
}

def verify_steady_state(service: str) -> bool:
    """Verify the system is in a healthy baseline before injecting chaos."""
    metrics = get_current_metrics(service)
    
    for metric, threshold in STEADY_STATE_METRICS.items():
        actual = metrics.get(metric)
        if metric == "error_rate" and actual > threshold:
            return False
        if metric == "p99_latency_ms" and actual > threshold:
            return False
        if metric == "checkout_success_rate" and actual < threshold:
            return False
    
    return True

def run_chaos_experiment(experiment_name: str, duration_seconds: int = 300) -> ExperimentResult:
    # Apply the chaos experiment via kubectl
    subprocess.run(["kubectl", "apply", "-f", f"chaos/{experiment_name}.yaml"])
    
    start_time = time.time()
    metrics_during = []
    
    while time.time() - start_time < duration_seconds:
        metrics_during.append(get_current_metrics("checkout-service"))
        time.sleep(10)
    
    # Terminate experiment
    subprocess.run(["kubectl", "delete", "-f", f"chaos/{experiment_name}.yaml"])
    
    # Wait for recovery
    time.sleep(60)
    post_metrics = get_current_metrics("checkout-service")
    
    return ExperimentResult(
        experiment=experiment_name,
        metrics_during=metrics_during,
        recovered=verify_steady_state("checkout-service"),
        max_error_rate=max(m["error_rate"] for m in metrics_during),
        max_p99_latency=max(m["p99_latency_ms"] for m in metrics_during)
    )

# Pipeline integration
def chaos_verification_stage(deployed_version: str) -> bool:
    print(f"Running chaos verification for version {deployed_version}")
    
    if not verify_steady_state("checkout-service"):
        print("System not in steady state — skipping chaos verification")
        return False   # Don't add chaos to an already-degraded system
    
    results = []
    
    # Run each experiment sequentially with recovery time between
    for experiment in ["pod-kill-20pct", "latency-300ms", "dependency-partition"]:
        result = run_chaos_experiment(experiment, duration_seconds=180)
        results.append(result)
        
        if not result.recovered:
            print(f"System did not recover after {experiment} — failing pipeline")
            send_alert(f"Chaos verification FAILED: {experiment} in {deployed_version}")
            return False
        
        time.sleep(60)  # Recovery time between experiments
    
    print(f"Chaos verification PASSED for {deployed_version}")
    return True

Continuous verification catches regressions in resilience properties — a code change that accidentally removed a circuit breaker, a configuration change that broke a retry policy. Without continuous verification, these regressions sit silently until a production failure discovers them.

What to Do When an Experiment Fails

When an experiment reveals that the system doesn't behave as expected, the response is not to patch the chaos experiment — it's to fix the system.

The process: document the finding (what was expected vs. what actually happened), create a ticket with high priority (this is a real reliability gap), implement the fix, and re-run the experiment to confirm the fix works. Then update the hypothesis to reflect the new expected behavior.

The organizational learning is as important as the technical fix. An experiment that reveals a gap should trigger a conversation about whether similar gaps exist elsewhere — the same missing timeout configuration might be in ten services, not just the one that failed the experiment.

*Zak Hassan is a Staff SRE specializing in reliability engineering, chaos experimentation, and resilience verification. Find him at zakhassan.com or on LinkedIn.*

Topic Paths

SRE and Reliability Kubernetes and Platform Engineering Observability and Incident Learning Cloud Cost and Capacity

About the Author

Zak Hassan writes about reliability engineering under real scale constraints.

Staff-level SRE and platform engineer focused on identity reliability, Kubernetes, observability, cloud architecture, AI infrastructure, and reducing operational uncertainty.

Connect on LinkedIn