*By Zak Hassan — Staff SRE | May 2026*
Chaos engineering is the practice of deliberately introducing failures into a system to verify that it behaves correctly when those failures occur in production-like lab environments. The goal isn't to break things for the sake of breaking them — it's to answer a specific question: does the system actually tolerate the failures teams think it can tolerate?
Most reliability claims are untested. "teams have circuit breakers" doesn't mean the circuit breakers work correctly. "teams have multi-region failover" doesn't mean failover has been executed recently enough to trust. The gap between documented reliability and actual reliability is where real outages happen, and chaos engineering is the practice of closing that gap before a real outage exposes it for you.
The Hypothesis-Driven Approach
The difference between chaos engineering and random destruction is the hypothesis. Every chaos experiment starts with a specific, falsifiable prediction about system behavior:
Hypothesis template:
"When [failure condition], [system component] will [expected behavior],
resulting in [observable outcome], and [business metric] will [not degrade /
degrade by less than X%]."
Example hypotheses:
"When the payment-service database replica fails, the application will fail over
to the primary within 30 seconds, resulting in a brief error spike of less than
5%, and checkout success rate will recover to baseline within 60 seconds."
"When the experiment injects 200ms of latency into calls from checkout-service to
inventory-service, the checkout flow will complete within 2 seconds due to
the 1-second timeout and circuit breaker, and inventory calls will fall back
to cached values."
"When the experiment kills 50% of the recommendation-service pods simultaneously,
HPA will provision replacement pods within 3 minutes, and the product
page will degrade gracefully to showing popular items instead of
personalized recommendations."A hypothesis that proves false isn't a failure — it's the most valuable output of chaos engineering. You've found a gap between assumed and actual reliability, before a production incident found it first.
The Blast Radius Framework
Chaos experiments must be bounded. running an experiment without a lab boundary-style environments-style environments without understanding the maximum possible impact is reckless, not brave.
# Blast radius assessment before any experiment
class BlastRadiusAssessment:
def __init__(self, target_service: str, experiment_type: str):
self.target = target_service
self.experiment = experiment_type
def assess(self) -> BlastRadius:
# What services depend on the target?
downstream_services = get_dependency_graph(self.target, direction="downstream")
# What is the traffic volume affected?
requests_per_second = get_current_rps(self.target)
# What is the revenue exposure per minute of impact?
revenue_per_minute = get_revenue_attribution(self.target)
# Is this reversible immediately?
is_immediately_reversible = self.experiment in [
"latency_injection", # Stop injecting — latency returns to normal
"error_injection", # Stop injecting — errors stop
"cpu_stress", # Kill the stress process
]
return BlastRadius(
affected_services=downstream_services,
max_affected_rps=requests_per_second,
revenue_exposure_per_minute=revenue_per_minute,
is_reversible=is_immediately_reversible,
recommendation=self._recommend_environment(revenue_per_minute)
)
def _recommend_environment(self, revenue_per_minute: float) -> str:
if revenue_per_minute > 10000:
return "staging_only" # >$10K/minute risk: staging only
elif revenue_per_minute > 1000:
return "production_off_peak" # Run during low traffic
else:
return "production_anytime" # Low risk: run in a production-style lab-style environments-style environmentsThe chaos maturity ladder: start in development, move to staging, then production during off-peak, then production during normal traffic. Each step requires proving the previous step's experiments are stable and instrumented.
Tooling: Chaos Mesh on Kubernetes
Chaos Mesh is the production-style-grade chaos platform for Kubernetes environments. It provides declarative chaos experiments via Kubernetes custom resources.
# Network partition: isolate checkout-service from payment-service
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: checkout-payment-partition
namespace: production
spec:
action: partition
mode: fixed-percent
value: "50" # Affect 50% of checkout-service pods
selector:
namespaces:
- production
labelSelectors:
app: checkout-service
direction: to
target:
selector:
namespaces:
- production
labelSelectors:
app: payment-service
mode: all
duration: "5m" # Run for 5 minutes then auto-terminate
---
# Pod failure: kill 30% of pods
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: inventory-pod-failure
namespace: production
spec:
action: pod-kill
mode: fixed-percent
value: "30"
selector:
namespaces:
- production
labelSelectors:
app: inventory-service
duration: "3m"
---
# Latency injection: add 300ms to all outbound calls from search-service
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: search-dependency-latency
namespace: staging
spec:
action: delay
mode: all
selector:
namespaces:
- staging
labelSelectors:
app: search-service
delay:
latency: "300ms"
correlation: "25" # 25% correlation between successive packets
jitter: "50ms"
duration: "10m"
---
# Stress test: CPU pressure on API gateway
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
name: api-gateway-cpu-stress
namespace: staging
spec:
mode: fixed
value: "2"
selector:
namespaces:
- staging
labelSelectors:
app: api-gateway
stressors:
cpu:
workers: 2 # 2 CPU-burning goroutines
load: 80 # Each at 80% CPU
duration: "5m"GameDay Design: The Human-in-the-Loop Exercise
A GameDay is a structured chaos experiment where on-call engineers practice responding to failures in a controlled setting. Unlike automated chaos experiments (which verify automation), GameDays test human response: Can the team detect the failure? How quickly? Can they diagnose it? Can they execute the remediation?
GameDay structure:
Pre-GameDay (1 week before):
- Define the scenario: what failure will be injected?
- Define success criteria: what does a successful response look like?
- Identify the observers (different from responders)
- Ensure monitoring is in place to observe the system during the experiment
- Brief responders that a GameDay is happening this week (not which day/time)
GameDay Day:
1. Observers convene (separate channel from responders)
2. Inject the failure at an unannounced time
3. Observe and record:
- Time to first alert: did monitoring detect the failure?
- Time to detection by engineer: how long until someone noticed?
- Diagnostic actions taken: what did the engineer look at?
- Time to correct diagnosis: how long until root cause was identified?
- Remediation actions: were they correct?
- Time to resolution: how long until the system recovered?
4. Terminate the experiment (observers have a kill switch)
Post-GameDay (same day):
1. Hot debrief: what happened? (30 minutes, immediate impressions)
2. Observers share their notes
3. Identify gaps: monitoring gaps, blog post gaps, response gaps
4. Create action items with owners and deadlinesSample GameDay scenarios by maturity level:
Stage 1 (beginner): kill a single pod. Does the on-call engineer notice? Does it auto-recover? How long does it take?
Stage 2 (intermediate): kill all pods for a non-critical service. Does the dependent service degrade gracefully or fail completely? Does the team correctly diagnose the root cause?
Stage 3 (advanced): inject latency into a critical dependency to simulate a slow but not-failing dependency. This is the scenario that most teams handle worst — it causes timeouts that are harder to diagnose than outright failures.
Stage 4 (expert): multi-failure injection. A deployment happens at the same time as a database replica fails. This tests the team's ability to manage multiple simultaneous incidents without jumping to the wrong root cause.
Continuous Verification: Chaos as a Pipeline Stage
The highest-maturity chaos practice is integrating experiments into the deployment pipeline — every major deploy triggers a set of chaos experiments to verify that the new version maintains resilience properties.
# chaos_verification.py — runs as a pipeline step post-deploy
import subprocess
import time
import requests
STEADY_STATE_METRICS = {
"error_rate": 0.01, # Max 1% error rate
"p99_latency_ms": 500, # Max 500ms p99 latency
"checkout_success_rate": 0.99 # Min 99% checkout success
}
def verify_steady_state(service: str) -> bool:
"""Verify the system is in a healthy baseline before injecting chaos."""
metrics = get_current_metrics(service)
for metric, threshold in STEADY_STATE_METRICS.items():
actual = metrics.get(metric)
if metric == "error_rate" and actual > threshold:
return False
if metric == "p99_latency_ms" and actual > threshold:
return False
if metric == "checkout_success_rate" and actual < threshold:
return False
return True
def run_chaos_experiment(experiment_name: str, duration_seconds: int = 300) -> ExperimentResult:
# Apply the chaos experiment via kubectl
subprocess.run(["kubectl", "apply", "-f", f"chaos/{experiment_name}.yaml"])
start_time = time.time()
metrics_during = []
while time.time() - start_time < duration_seconds:
metrics_during.append(get_current_metrics("checkout-service"))
time.sleep(10)
# Terminate experiment
subprocess.run(["kubectl", "delete", "-f", f"chaos/{experiment_name}.yaml"])
# Wait for recovery
time.sleep(60)
post_metrics = get_current_metrics("checkout-service")
return ExperimentResult(
experiment=experiment_name,
metrics_during=metrics_during,
recovered=verify_steady_state("checkout-service"),
max_error_rate=max(m["error_rate"] for m in metrics_during),
max_p99_latency=max(m["p99_latency_ms"] for m in metrics_during)
)
# Pipeline integration
def chaos_verification_stage(deployed_version: str) -> bool:
print(f"Running chaos verification for version {deployed_version}")
if not verify_steady_state("checkout-service"):
print("System not in steady state — skipping chaos verification")
return False # Don't add chaos to an already-degraded system
results = []
# Run each experiment sequentially with recovery time between
for experiment in ["pod-kill-20pct", "latency-300ms", "dependency-partition"]:
result = run_chaos_experiment(experiment, duration_seconds=180)
results.append(result)
if not result.recovered:
print(f"System did not recover after {experiment} — failing pipeline")
send_alert(f"Chaos verification FAILED: {experiment} in {deployed_version}")
return False
time.sleep(60) # Recovery time between experiments
print(f"Chaos verification PASSED for {deployed_version}")
return TrueContinuous verification catches regressions in resilience properties — a code change that accidentally removed a circuit breaker, a configuration change that broke a retry policy. Without continuous verification, these regressions sit silently until a production failure discovers them.
What to Do When an Experiment Fails
When an experiment reveals that the system doesn't behave as expected, the response is not to patch the chaos experiment — it's to fix the system.
The process: document the finding (what was expected vs. what actually happened), create a ticket with high priority (this is a real reliability gap), implement the fix, and re-run the experiment to confirm the fix works. Then update the hypothesis to reflect the new expected behavior.
The organizational learning is as important as the technical fix. An experiment that reveals a gap should trigger a conversation about whether similar gaps exist elsewhere — the same missing timeout configuration might be in ten services, not just the one that failed the experiment.
*Zak Hassan is a Staff SRE specializing in reliability engineering, chaos experimentation, and resilience verification. Find him at zakhassan.com or on LinkedIn.*
Topic Paths