Terraform is the de facto standard for infrastructure as code. Most engineering organizations use it. Fewer use it well. The gap between "teams have Terraform" and "Terraform is maintainable, testable, and safe to run in production" is where most of the operational pain lives.
This is the guide for teams that have outgrown the basic Terraform tutorials and need patterns that work at scale.
State Management: The Foundation Everything Else Depends On
Terraform state is the record of what resources Terraform manages and their current configuration. Everything in Terraform depends on state being accurate, consistent, and accessible.
Remote state is non-negotiable for teams. Local state files have no locking, no history, and no way to share across team members. Use S3 (with DynamoDB locking) or Terraform Cloud for all environments beyond a personal sandbox.
# backend.tf — S3 remote state with locking
terraform {
backend "s3" {
bucket = "your-org-terraform-state"
key = "prod/us-east-1/networking/terraform.tfstate"
region = "us-east-1"
encrypt = true
dynamodb_table = "terraform-state-locks"
# Use a unique key per workspace/environment/component
# Never share state files across components
}
}State key design is critical. A single state file for all your infrastructure is a time bomb: any operation locks the entire state, a corruption event affects everything, and terraform plan takes 10 minutes. The right structure: one state file per logical unit of infrastructure that changes independently.
A practical key hierarchy:
{account}/{region}/{component}/terraform.tfstate
Examples:
prod/us-east-1/networking/terraform.tfstate
prod/us-east-1/data-platform/terraform.tfstate
prod/us-east-1/applications/payment-service/terraform.tfstate
staging/us-east-1/networking/terraform.tfstateState drift is an ongoing concern. State drift occurs when the real infrastructure diverges from Terraform state — someone manually modified a resource, a resource was changed by AWS automation, a resource was deleted outside Terraform. Periodic drift detection runs (terraform plan without apply, compare output) surface drift before it causes problems.
Module Architecture for Scale
A module library that multiple teams can use requires thoughtful design. The patterns that scale:
Three-layer module structure:
modules/
aws/ # Low-level AWS resource wrappers
rds-postgres/
eks-cluster/
alb/
patterns/ # Opinionated combinations of resources
web-service/ # ALB + ECS + target group + security groups
data-pipeline/ # Kinesis + Lambda + S3 + IAM
applications/ # Team-specific application infrastructure
payment-service/
user-service/The aws/ modules are thin wrappers around AWS resources that enforce organizational standards (required tags, encryption defaults, logging). The patterns/ modules combine aws/ modules into common infrastructure patterns. The applications/ modules use patterns/ to build team-specific infrastructure.
Module versioning with semver:
# Reference a specific module version — never float to "latest"
module "payment_service_alb" {
source = "git::https://github.com/your-org/terraform-modules//aws/alb?ref=v2.3.1"
name = "payment-service"
vpc_id = var.vpc_id
subnet_ids = var.public_subnet_ids
certificate_arn = var.acm_certificate_arn
}Module consumers should pin to a specific version. Module producers should use semantic versioning: breaking changes increment the major version, new features increment minor, bug fixes increment patch. Consumers upgrade intentionally, not automatically.
Avoid deeply nested modules. Modules that call other modules that call other modules become impossible to debug. Two levels of nesting is a comfortable maximum. If you need more layers, reconsider whether you're using modules for the right purpose.
Workspaces vs. Directory Separation
Terraform workspaces allow multiple state files from the same configuration. They're intended for managing multiple environments (dev, staging, prod) from a single configuration.
My recommendation: use directory separation over workspaces for environment management. The reasons:
Workspaces make it easy to accidentally run terraform apply against production when you meant staging (same directory, different workspace). The blast radius of a workspace mistake is high.
Directory separation makes the intended environment explicit in the path: /prod/payment-service vs. /staging/payment-service. It's harder to accidentally apply staging config to production when you have to cd to the right directory.
Shared configuration between environments can be handled with Terraform modules or with data sources that read remote state from other workspaces.
environments/
prod/
us-east-1/
payment-service/
main.tf # Prod-specific values
backend.tf
staging/
us-east-1/
payment-service/
main.tf # Staging-specific values
backend.tfCI/CD for Terraform
Manual terraform apply runs in production are an organizational anti-pattern. Every production infrastructure change should go through a pipeline:
- PR opened →
terraform fmtcheck,tflint,checkovsecurity scan - PR open →
terraform planruns, plan output posted as PR comment - PR approved → merge to main
- Main merged →
terraform applyruns automatically (or with one-click approval)
The plan-as-comment pattern is particularly valuable: reviewers see exactly what resources will be created, modified, or destroyed before approving the PR.
# GitHub Actions: Terraform plan on PR
name: Terraform Plan
on:
pull_request:
paths:
- 'infrastructure/**'
jobs:
plan:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Setup Terraform
uses: hashicorp/setup-terraform@v3
with:
terraform_version: "1.9.0"
- name: Terraform Init
run: terraform init
working-directory: infrastructure/prod/us-east-1/payment-service
- name: Terraform Plan
id: plan
run: |
terraform plan -no-color -out=tfplan 2>&1 | tee plan_output.txt
echo "plan_output<<EOF" >> $GITHUB_OUTPUT
cat plan_output.txt >> $GITHUB_OUTPUT
echo "EOF" >> $GITHUB_OUTPUT
working-directory: infrastructure/prod/us-east-1/payment-service
- name: Post Plan to PR
uses: actions/github-script@v7
with:
script: |
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: `## Terraform Plan\n\`\`\`\n${{ steps.plan.outputs.plan_output }}\n\`\`\``
})Protecting against destructive changes: Some Terraform changes are destructive: aws_instance replacement, aws_rds_cluster recreation, security group deletion. Add a check that fails the CI job if the plan contains resource deletions or replacements for critical resources:
# Fail CI if the plan contains any deletions of production databases
if terraform show -json tfplan | jq -e '
.resource_changes[] |
select(.type == "aws_rds_cluster") |
select(.change.actions[] == "delete")
' > /dev/null 2>&1; then
echo "ERROR: Plan contains RDS cluster deletion. Manual review required."
exit 1
fiImport and Drift Remediation
The terraform import command brings existing resources under Terraform management. It's the tool for migrating manually-created infrastructure into IaC — a common situation for organizations maturing their infrastructure practices.
The workflow:
# 1. Write the Terraform resource block for the existing resource
resource "aws_security_group" "payment_service" {
# (fill in based on existing resource configuration)
}
# 2. Import the existing resource into state
terraform import aws_security_group.payment_service sg-0abc123def456
# 3. Run terraform plan — it should show no changes if the config matches reality
terraform plan # Expected: "No changes. Your infrastructure matches the configuration."
# 4. If plan shows differences, update the config to match reality
# Then re-plan until cleanFor large-scale imports (migrating hundreds of manually-created resources), the new import block in Terraform 1.5+ enables declaring imports in configuration files rather than running imperative commands:
import {
to = aws_security_group.payment_service
id = "sg-0abc123def456"
}This makes imports reproducible and reviewable in the same PR workflow as other Terraform changes.
*Zak Hassan is a Staff SRE specializing in infrastructure automation, GitOps, and data platform reliability. Find him at zakhassan.com or on LinkedIn.*
Topic Paths