*By Zak Hassan — Staff SRE | May 2026*
Most teams get Terraform right at the beginning — a handful of .tf files, a single state file, maybe one environment. Then the org grows, and what started as clean infrastructure-as-code quietly becomes a liability. State files balloon to thousands of resources. A botched apply in a shared module takes down prod and staging simultaneously. Engineers stop trusting terraform plan because the output is a wall of noise from five teams' worth of changes. The tooling is sound; the design choices accumulated silently until they weren't. This post is about what those design choices should have been from the start, and how to recover when they weren't.
The Monorepo vs. Per-Team Repo Problem
The monorepo instinct makes sense early on: one place for all infrastructure, easy cross-team sharing, a single CI pipeline. The problem is that Terraform's state model does not compose cleanly across teams at the filesystem level. A single root module that provisions networking, databases, compute, and IAM for multiple teams means a single terraform apply touches all of it. The blast radius of a misconfigured count expression or a renamed resource is the entire organization's infrastructure.
The per-team repo pattern avoids blast radius but creates its own failure mode: module duplication. Team A builds an RDS module, Team B builds a slightly different one, and six months later you have four incompatible abstractions for the same resource type. Nobody knows which is canonical, nobody wants to migrate to the "right" one, and the SRE team is debugging all four.
The durable solution is a hybrid: a shared internal module registry (a dedicated repository that publishes versioned modules via git tags), and per-team root module repositories that consume those versioned modules. Teams own their state; they share abstractions through the registry. Cross-stack dependencies are resolved via terraform_remote_state data sources, not by stitching modules together in the same root.
State File Design
The single biggest structural decision in any Terraform deployment is state granularity. A state file that covers everything — VPC, RDS clusters, ECS services, IAM roles, CloudFront distributions — for a production environment is an operational time bomb. Large state files slow every plan and apply operation proportionally. They also mean that every engineer who touches any resource in that environment holds an exclusive lock on the entire state. At more than about 200 resources in a single state file, contention and plan time become a daily friction point.
The right granularity is roughly: one state file per bounded domain per environment. Networking (VPCs, subnets, NAT gateways) is its own state. Each major service — "payments-service", "auth-service" — has its own state per environment. Shared platform resources (ECS cluster, load balancers, DNS zones) live in a separate "platform" state. This way a broken deployment of the payments service does not lock the auth team out of their own apply.
Remote state in S3 with DynamoDB locking is the AWS-native standard. A properly configured backend looks like this:
# backend.tf
terraform {
backend "s3" {
bucket = "acme-terraform-state-prod"
key = "services/payments/terraform.tfstate"
region = "us-east-1"
encrypt = true
kms_key_id = "arn:aws:kms:us-east-1:123456789012:key/mrk-abc123"
dynamodb_table = "acme-terraform-locks"
}
}The DynamoDB table needs a partition key named LockID of type String. That single table can serve all your state files — the lock key is the full S3 path, so there is no collision risk. Encryption at rest via KMS is non-negotiable; state files contain plaintext secrets in outputs more often than teams realize.
For multi-region deployments, add the region to the key path: services/payments/us-west-2/terraform.tfstate. Never share a state file across regions — resource IDs, ARNs, and availability zone names are region-scoped, and cross-region state creates confusing plan diffs when you least want them.
Module Design Principles
A reusable module is not a thin wrapper around a resource. It is an opinionated abstraction that encodes the organization's operational requirements — tagging standards, encryption defaults, deletion protection — and exposes only the knobs that legitimately vary between deployments.
Variable validation blocks are the first line of defense against misconfiguration. They run at plan time before any API calls are made, which means they fail fast and fail clearly:
# modules/rds-postgres/variables.tf
variable "instance_class" {
type = string
description = "RDS instance class. Must be db.t3 or larger for production."
validation {
condition = can(regex("^db\\.(t3|m5|m6g|r5|r6g)\\.", var.instance_class))
error_message = "instance_class must be a supported db.t3, m5, m6g, r5, or r6g type."
}
}
variable "environment" {
type = string
description = "Deployment environment."
validation {
condition = contains(["dev", "staging", "prod"], var.environment)
error_message = "environment must be one of: dev, staging, prod."
}
}
variable "allocated_storage_gb" {
type = number
description = "Allocated storage in gigabytes."
validation {
condition = var.allocated_storage_gb >= 20 && var.allocated_storage_gb <= 65536
error_message = "allocated_storage_gb must be between 20 and 65536."
}
}Outputs should be structured for downstream consumption. Any module that other modules depend on via terraform_remote_state should expose well-typed, stable outputs:
# modules/rds-postgres/outputs.tf
output "endpoint" {
description = "The connection endpoint for the RDS instance."
value = aws_db_instance.this.endpoint
sensitive = false
}
output "db_name" {
description = "The name of the database."
value = aws_db_instance.this.db_name
}
output "security_group_id" {
description = "Security group ID to reference in consumer service rules."
value = aws_security_group.rds.id
}
output "secret_arn" {
description = "ARN of the Secrets Manager secret containing credentials."
value = aws_secretsmanager_secret.db_creds.arn
sensitive = true
}Version modules with git tags following semver. A consuming root module pins to a tag, not a branch:
module "payments_db" {
source = "git::https://github.com/acme/terraform-modules.git//modules/rds-postgres?ref=v2.4.1"
environment = "prod"
instance_class = "db.m6g.large"
allocated_storage_gb = 100
db_name = "payments"
}Major version bumps in the module registry should be accompanied by a migration guide. Breaking a module's interface without a major version bump is how you erode trust in the internal registry.
Terraform CI/CD with Atlantis
Atlantis is the most operationally sound Terraform automation layer available. It runs as a server that listens for GitHub (or GitLab, or Bitbucket) webhooks and executes terraform plan on pull requests, posting the output as a PR comment. On merge (or on an explicit atlantis apply comment), it runs the apply. The workflow enforces the social contract: no infrastructure change lands without a reviewed plan.
A minimal GitHub Actions workflow handles the cases Atlantis doesn't cover — formatting checks, validation, and security scanning — while Atlantis owns the plan/apply lifecycle:
# .github/workflows/terraform-ci.yml
name: Terraform CI
on:
pull_request:
paths:
- '**.tf'
- '**.tfvars'
push:
branches:
- main
env:
TF_VERSION: "1.8.4"
jobs:
validate:
name: Format and Validate
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Setup Terraform
uses: hashicorp/setup-terraform@v3
with:
terraform_version: ${{ env.TF_VERSION }}
- name: Terraform Format Check
run: terraform fmt -check -recursive
- name: Terraform Init
run: terraform init -backend=false
working-directory: ./environments/prod
- name: Terraform Validate
run: terraform validate
working-directory: ./environments/prod
security-scan:
name: Policy and Security Scan
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run Checkov
uses: bridgecrewio/checkov-action@v12
with:
directory: .
framework: terraform
soft_fail: false
output_format: sarif
output_file_path: checkov-results.sarif
- name: Upload SARIF
uses: github/codeql-action/upload-sarif@v3
if: always()
with:
sarif_file: checkov-results.sarif
plan:
name: Terraform Plan (PR only)
runs-on: ubuntu-latest
if: github.event_name == 'pull_request'
needs: [validate, security-scan]
permissions:
id-token: write
contents: read
pull-requests: write
steps:
- uses: actions/checkout@v4
- name: Configure AWS Credentials
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::123456789012:role/terraform-ci-role
aws-region: us-east-1
- name: Setup Terraform
uses: hashicorp/setup-terraform@v3
with:
terraform_version: ${{ env.TF_VERSION }}
- name: Terraform Init
run: terraform init
working-directory: ./environments/prod
- name: Terraform Plan
id: plan
run: |
terraform plan -no-color -out=tfplan 2>&1 | tee plan-output.txt
echo "exitcode=${PIPESTATUS[0]}" >> $GITHUB_OUTPUT
working-directory: ./environments/prod
- name: Post Plan to PR
uses: actions/github-script@v7
with:
script: |
const fs = require('fs');
const plan = fs.readFileSync('./environments/prod/plan-output.txt', 'utf8');
const truncated = plan.length > 60000 ? plan.slice(0, 60000) + '\n... (truncated)' : plan;
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: `#### Terraform Plan \`prod\`\n\`\`\`\n${truncated}\n\`\`\``
});Atlantis handles workspace-per-environment by mapping directory paths to workspace names in its atlantis.yaml configuration. Each environment directory is its own project with its own state key, plan, and apply. An atlantis apply in environments/staging cannot accidentally touch environments/prod.
Drift Detection
Drift is the gap between what Terraform believes is deployed and what is actually running. It accumulates silently: a security group rule added manually during an incident, an instance type resized through the console, a tag modified by a compliance tool. If left unaddressed, drift makes plans noisy and eventually makes them wrong.
The detection mechanism is straightforward: run terraform plan on a schedule, treat a non-empty plan as a signal, and route that signal somewhere an engineer will see it. The challenge is operationalizing the output. A raw plan output is long and hard to parse in an alert. The following Python script extracts a summary and posts it to Slack:
#!/usr/bin/env python3
"""
drift-detector.py — Run terraform plan and alert on drift.
Intended for use in a scheduled CI job (e.g., cron via GitHub Actions).
"""
import subprocess
import json
import os
import sys
import urllib.request
SLACK_WEBHOOK = os.environ["SLACK_WEBHOOK_URL"]
ENVIRONMENT = os.environ.get("TF_ENVIRONMENT", "prod")
WORKING_DIR = os.environ.get("TF_WORKING_DIR", "./environments/prod")
def run_plan() -> tuple[int, str]:
result = subprocess.run(
["terraform", "plan", "-detailed-exitcode", "-no-color", "-json"],
capture_output=True,
text=True,
cwd=WORKING_DIR,
)
return result.returncode, result.stdout
def parse_changes(plan_json_output: str) -> dict:
changes = {"add": 0, "change": 0, "destroy": 0, "resources": []}
for line in plan_json_output.splitlines():
try:
msg = json.loads(line)
except json.JSONDecodeError:
continue
if msg.get("type") == "change_summary":
c = msg.get("changes", {})
changes["add"] = c.get("add", 0)
changes["change"] = c.get("change", 0)
changes["destroy"] = c.get("remove", 0)
if msg.get("type") == "planned_change":
addr = msg.get("change", {}).get("resource", {}).get("addr", "")
action = msg.get("change", {}).get("action", "")
if addr:
changes["resources"].append(f"{action}: {addr}")
return changes
def post_slack_alert(changes: dict) -> None:
resource_list = "\n".join(changes["resources"][:20])
if len(changes["resources"]) > 20:
resource_list += f"\n... and {len(changes['resources']) - 20} more"
payload = {
"text": f":warning: *Terraform drift detected in `{ENVIRONMENT}`*",
"blocks": [
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": (
f":warning: *Drift detected* — `{ENVIRONMENT}`\n"
f"*+{changes['add']}* to add, "
f"*~{changes['change']}* to change, "
f"*-{changes['destroy']}* to destroy\n\n"
f"```{resource_list}```"
),
},
}
],
}
data = json.dumps(payload).encode()
req = urllib.request.Request(
SLACK_WEBHOOK,
data=data,
headers={"Content-Type": "application/json"},
)
urllib.request.urlopen(req, timeout=10)
def main():
exitcode, output = run_plan()
# terraform plan exit codes:
# 0 = no changes, 1 = error, 2 = changes present
if exitcode == 0:
print(f"[{ENVIRONMENT}] No drift detected.")
sys.exit(0)
elif exitcode == 1:
print(f"[{ENVIRONMENT}] terraform plan failed. Check CI logs.", file=sys.stderr)
sys.exit(1)
else:
changes = parse_changes(output)
print(f"[{ENVIRONMENT}] Drift detected: {changes}")
post_slack_alert(changes)
sys.exit(0) # Don't fail the job — alerting is enough
if __name__ == "__main__":
main()Resources modified outside Terraform are the hardest case. AWS Config with the TERRAFORM_MANAGED tag convention helps: tag every Terraform-managed resource at creation, and use a Config rule to flag tagged resources whose live configuration diverges from the last known Terraform-applied state. This does not replace drift detection but gives you a second signal layer for resources where terraform plan is expensive or slow.
Import and the Legacy Infrastructure Problem
Every organization has infrastructure that predates their Terraform adoption. Bringing those resources under management is one of the highest-leverage reliability investments available — it means they are documented, reviewable, and reproducible — but the import workflow is not as smooth as it should be.
The workflow is: write the HCL for the resource as you want it to look, then import the existing resource's ID into the state file. Terraform will then reconcile the live resource against your HCL on the next plan. If there are differences, the plan will show them. The dangerous step is applying a plan that shows changes on a resource you just imported — verify carefully that those changes are intentional normalization (adding a required tag) and not accidental destructive changes (a security group rule being removed).
Since Terraform 1.5, the import block makes this repeatable and reviewable in code:
# Import an existing RDS instance into the payments module
import {
to = module.payments_db.aws_db_instance.this
id = "payments-prod-postgres"
}Run terraform plan with this block present and Terraform will generate a plan that includes the import. No state manipulation via CLI, no risk of running the import interactively and forgetting to commit the state. For bulk imports — bringing in 50 S3 buckets from a legacy account — combine this with terraform import scripting and the aws CLI to generate import blocks programmatically.
Testing Infrastructure Code
Infrastructure code is not exempt from testing. Untested modules have a failure mode that application code does not: they provision real infrastructure when broken, and the cost of that is time, money, and sometimes an outage.
Checkov is the fastest feedback layer. It runs statically against HCL before any cloud API calls and catches common misconfigurations — S3 buckets without versioning, security groups with 0.0.0.0/0 ingress, RDS instances without deletion protection. It integrates into the CI workflow shown above in under a minute of configuration.
Terratest provides integration-level testing. It provisions real infrastructure in a test account, runs assertions against it via AWS APIs or HTTP endpoints, and destroys it when done. A minimal test for the RDS module looks like this:
// modules/rds-postgres/test/rds_test.go
package test
import (
"testing"
"github.com/gruntwork-io/terratest/modules/terraform"
"github.com/gruntwork-io/terratest/modules/aws"
"github.com/stretchr/testify/assert"
)
func TestRdsPostgresModule(t *testing.T) {
t.Parallel()
opts := &terraform.Options{
TerraformDir: "../examples/basic",
Vars: map[string]interface{}{
"environment": "dev",
"instance_class": "db.t3.micro",
"allocated_storage_gb": 20,
"db_name": "testdb",
},
}
defer terraform.Destroy(t, opts)
terraform.InitAndApply(t, opts)
instanceID := terraform.Output(t, opts, "db_instance_id")
region := "us-east-1"
db := aws.GetRdsInstanceDetailsE(t, instanceID, region)
assert.True(t, *db.StorageEncrypted, "RDS instance should have encryption enabled")
assert.True(t, *db.DeletionProtection, "RDS instance should have deletion protection enabled")
assert.Equal(t, "postgres", *db.Engine)
}Terratest tests are slow (a full provision-test-destroy cycle for an RDS instance is 10-15 minutes) and they cost real money. Run them in a dedicated test AWS account, gate them behind a manual trigger or a label on pull requests rather than running on every commit, and set a strict budget alert on the test account. The feedback they provide — that your module actually provisions what it claims to, in the actual cloud environment — is not available any other way.
The combination of Checkov in every PR (fast, free, catches most misconfigurations) and Terratest on a weekly schedule or pre-release (slow, thorough, catches behavioral regressions) gives you meaningful confidence without making CI unusable.
*Zak Hassan is a Staff SRE specializing in infrastructure automation, platform reliability, and developer tooling. Find him at zakhassan.com or on LinkedIn.*
Topic Paths