Testing Your Infrastructure Before It Fails: Chaos Engineering, Game Days, and IaC Validation

Most engineering organizations test their application code. Fewer test their infrastructure. And fewer still test their reliability — the system's ability to behave correctly under adverse conditions that don't happen in the normal development workflow. The gap between "this works in a test environment" and "this survives the things that actually happen in production" is where most incidents are born.

This is the testing discipline that reduces the gap: chaos engineering, game days, and infrastructure code validation.

What Infrastructure Testing Actually Means

There are three distinct things to test in the infrastructure domain:

Infrastructure code correctness. Does yTerraform actually create what you think it creates? Does the security group allow the right traffic? Is the IAM policy least-privilege? Does the RDS instance have encryption enabled? These are properties of the infrastructure definition that can be validated before anything is deployed.

Infrastructure behavior under normal conditions. Does the system work when everything is behaving correctly? This is your standard integration testing applied to infrastructure: can the service reach the database? Do health checks pass? Is the autoscaler configured correctly?

Infrastructure behavior under adverse conditions. Does the system behave correctly when things fail? What happens when an instance is terminated? When the network partitions? When the database connection pool is exhausted? When the disk fills up? This is chaos engineering territory.

Most teams do the first two inconsistently. Almost no teams systematically do the third until after an incident reveals the gaps.

Testing Infrastructure Code Before Deployment

Terraform Static Analysis

Every Terraform module should pass static analysis before being applied:

# Install tools
brew install tflint checkov

# tflint: Terraform linting with provider-specific rules
tflint --init
tflint --recursive

# checkov: Security and compliance scanning
checkov -d . --framework terraform \
  --check CKV_AWS_8   \  # Ensure EBS volume encryption
        CKV_AWS_111 \    # Ensure S3 bucket policy not public
        CKV_AWS_135     # Ensure RDS instance is encrypted

These tools catch the category of infrastructure misconfiguration that's obvious in retrospect but easy to miss in a PR review: S3 buckets with public access enabled, security groups open to 0.0.0.0/0, RDS instances without Multi-AZ, CloudTrail logging disabled.

Policy-as-code with OPA (Open Policy Agent). For organization-specific policies that off-the-shelf tools don't cover, OPA lets you write custom validation rules:

# policy/require_tags.rego
package terraform

deny[msg] {
    resource := input.resource_changes[_]
    resource.type == "aws_instance"
    not resource.change.after.tags.team
    msg := sprintf("EC2 instance '%s' is missing required 'team' tag", [resource.address])
}

deny[msg] {
    resource := input.resource_changes[_]
    resource.type == "aws_s3_bucket"
    not resource.change.after.tags.data_classification
    msg := sprintf("S3 bucket '%s' is missing required 'data_classification' tag", [resource.address])
}

Run OPA against the Terraform plan output in CI:

terraform plan -out=tfplan.binary
terraform show -json tfplan.binary > tfplan.json
opa eval --input tfplan.json --data policy/ "data.terraform.deny" --fail-defined

If any deny rules trigger, the CI job fails before a terraform apply happens.

Integration Testing with Terratest

Terratest (from Gruntwork) runs Go tests that deploy infrastructure, validate it, and tear it down. For critical infrastructure modules, this is the highest-confidence validation:

func TestAlbModule(t *testing.T) {
    t.Parallel()
    
    terraformOptions := &terraform.Options{
        TerraformDir: "../modules/alb",
        Vars: map[string]interface{}{
            "name":       fmt.Sprintf("test-alb-%s", random.UniqueId()),
            "vpc_id":     "vpc-12345",
            "subnet_ids": []string{"subnet-a", "subnet-b"},
        },
    }
    
    // Clean up at end of test
    defer terraform.Destroy(t, terraformOptions)
    
    // Apply the module
    terraform.InitAndApply(t, terraformOptions)
    
    // Validate the outputs
    albDnsName := terraform.Output(t, terraformOptions, "alb_dns_name")
    
    // Wait for ALB to be healthy and test it
    http_helper.HttpGetWithRetry(
        t,
        fmt.Sprintf("http://%s/health", albDnsName),
        nil,
        200,
        "OK",
        10,
        30*time.Second,
    )
    
    // Validate security configuration
    albArn := terraform.Output(t, terraformOptions, "alb_arn")
    validateAlbNotPublic(t, albArn)
    validateAlbHasDeletionProtection(t, albArn)
}

Terratest tests are slow (they're doing real infrastructure operations) but give you genuine confidence that the module works end-to-end in a real AWS account.

Chaos Engineering: Systematically Testing Failure

Chaos engineering is the practice of intentionally introducing failures into your system to verify that it handles them correctly. The goal is to find failures in controlled conditions rather than discovering them during production incidents.

Starting Small: AWS Fault Injection Service

AWS Fault Injection Service (FIS) lets you run chaos experiments against AWS infrastructure without managing chaos agent software:

{
  "description": "Test payment-service resilience to EC2 instance failure",
  "targets": {
    "PaymentInstances": {
      "resourceType": "aws:ec2:instance",
      "resourceTags": {"Service": "payment-service", "Environment": "staging"},
      "selectionMode": "PERCENT(25)"
    }
  },
  "actions": {
    "TerminateInstances": {
      "actionId": "aws:ec2:terminate-instances",
      "targets": {"Instances": "PaymentInstances"}
    }
  },
  "stopConditions": [
    {
      "source": "aws:cloudwatch:alarm",
      "value": "arn:aws:cloudwatch:...:alarm:payment-service-error-rate-critical"
    }
  ]
}

The stopConditions are critical: if the experiment causes a real problem (the alarm fires), FIS stops the experiment automatically. You're not doing uncontrolled chaos — you're doing bounded experiments with automatic stop conditions.

Common FIS experiments for SRE teams:

Terminate 25% of application instances — verifies autoscaling recovers correctly and load balancer health checks remove unhealthy instances.

Inject CPU stress on database instances — verifies connection pool behavior under degraded database performance.

Network latency injection (100ms added to all outbound traffic from a service) — reveals synchronous dependencies that aren't meeting their timeout budgets.

Packet loss injection — uncovers retry logic gaps and timeout misconfigurations.

The Chaos Hypothesis Pattern

Don't run chaos experiments without a documented hypothesis. A hypothesis-driven experiment:

Experiment: Payment service instance termination

Hypothesis: When 25% of payment-service instances are terminated simultaneously,
the ELB health checks will remove unhealthy instances within 30 seconds,
auto-scaling will replace them within 5 minutes, and the P99 error rate will
not exceed 0.5% during the recovery window.

Expected behavior during experiment:
  - Error rate: < 0.5% (some requests to terminated instances will fail)
  - Latency: < 500ms P95 (fewer instances handling same traffic)
  - Full recovery: < 5 minutes (ASG provisions replacement instances)

Stop conditions:
  - Error rate > 2% (unexpected impact)
  - Latency P95 > 2000ms (unexpected impact)

Rollback: FIS experiment termination restores normal traffic routing

When the experiment runs, you're validating the hypothesis against observed behavior. If the behavior matches: confirmed. If it doesn't: you found a gap, and you can fix it in staging rather than during a real incident.

Game Days: Testing Human Processes

Chaos engineering tests systems. Game days test humans and processes. The distinction matters because many incident response failures are not technical — they're procedural. The wrong person is on call, the blog post is outdated, the escalation path is unclear, the communication template doesn't exist.

A game day is a structured exercise where a team deliberately simulates an incident scenario and works through their response. The scenario is designed, the "incident" is fake, but the response is real — using the actual tools, the actual communication channels, and the actual decision-making process.

Game day design principles:

Start with a realistic but achievable scenario. "The payment service is down" is too vague. "The payment service is throwing 503s because the RDS connection pool is exhausted — your task is to diagnose and remediate" is concrete enough to run.

Run it without warning for maximum realism, or with warning for maximum learning. Warning-based game days let participants prepare, which reduces the realism but increases the learning. Surprise game days test cold recall and resilience to being caught unprepared. Both are valuable; alternating is ideal.

Include a dedicated observer. Someone not participating in the response should be watching, taking notes, and timing key events. How long did it take to identify the incident? When was the incident channel opened? How long to first diagnosis? When was the right person looped in? This data drives the postmortem.

Post-game-day postmortem template:

Scenario: [Description]
Duration: [Start to resolution]
Team: [Participants + observer]

Timeline:
  T+0:00 - Scenario triggered
  T+X:XX - Incident channel opened
  T+X:XX - Root cause identified
  T+X:XX - Remediation started
  T+X:XX - Service restored

Process gaps identified:
  - [Specific procedural gaps that slowed the response]

Tooling gaps identified:
  - [Missing dashboards, blog posts, or automation]

Documentation updates needed:
  - [Blog post updates, escalation path updates]

What went well:
  - [Things the team did correctly under pressure]

Game days generate the most specific, actionable improvements to incident response processes of any reliability engineering practice.

Continuous Validation: The Missing Layer

Between chaos experiments (periodic) and game days (quarterly), most teams have no systematic validation that their reliability assumptions are still true. Services change, dependencies change, and the properties that made you confident last quarter may not hold today.

Continuous reliability validation runs automated checks against production-like and staging environments on a schedule:

Synthetic transactions through critical user journeys (every 5 minutes)
Dependency health checks (every minute)
Failover drills in staging (weekly — automatically terminate a primary database, verify replica promotion)
Certificate expiry monitoring (alert 30 days before expiry, never after)
Backup restoration validation (monthly — actually restore a database from backup to a test cluster and verify data integrity)

The backup restoration check is the one I see skipped most often. Teams have automated backups but have never verified that those backups actually restore to a working database. The backup that can't be restored is not a backup — it's a false sense of security. Validate it on a schedule.

*Zak Hassan is a Staff SRE specializing in infrastructure reliability, chaos engineering, and AI-powered operations. Find him at zakhassan.com or on LinkedIn.*

Topic Paths

SRE and Reliability Kubernetes and Platform Engineering Observability and Incident Learning AI Infrastructure and Operations Cloud Cost and Capacity

About the Author

Zak Hassan writes about reliability engineering under real scale constraints.

Staff-level SRE and platform engineer focused on identity reliability, Kubernetes, observability, cloud architecture, AI infrastructure, and reducing operational uncertainty.

Connect on LinkedIn