Terraform at Scale: State Management, Module Patterns, and Avoiding the Common Traps

Terraform is the de facto standard for infrastructure as code. Most engineering organizations use it. Fewer use it well. The gap between "teams have Terraform" and "Terraform is maintainable, testable, and safe to run in production" is where most of the operational pain lives.

This is the guide for teams that have outgrown the basic Terraform tutorials and need patterns that work at scale.

State Management: The Foundation Everything Else Depends On

Terraform state is the record of what resources Terraform manages and their current configuration. Everything in Terraform depends on state being accurate, consistent, and accessible.

Remote state is non-negotiable for teams. Local state files have no locking, no history, and no way to share across team members. Use S3 (with DynamoDB locking) or Terraform Cloud for all environments beyond a personal sandbox.

# backend.tf — S3 remote state with locking
terraform {
  backend "s3" {
    bucket         = "your-org-terraform-state"
    key            = "prod/us-east-1/networking/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-state-locks"
    
    # Use a unique key per workspace/environment/component
    # Never share state files across components
  }
}

State key design is critical. A single state file for all your infrastructure is a time bomb: any operation locks the entire state, a corruption event affects everything, and terraform plan takes 10 minutes. The right structure: one state file per logical unit of infrastructure that changes independently.

A practical key hierarchy:

{account}/{region}/{component}/terraform.tfstate

Examples:
prod/us-east-1/networking/terraform.tfstate
prod/us-east-1/data-platform/terraform.tfstate
prod/us-east-1/applications/payment-service/terraform.tfstate
staging/us-east-1/networking/terraform.tfstate

State drift is an ongoing concern. State drift occurs when the real infrastructure diverges from Terraform state — someone manually modified a resource, a resource was changed by AWS automation, a resource was deleted outside Terraform. Periodic drift detection runs (terraform plan without apply, compare output) surface drift before it causes problems.

Module Architecture for Scale

A module library that multiple teams can use requires thoughtful design. The patterns that scale:

Three-layer module structure:

modules/
  aws/                    # Low-level AWS resource wrappers
    rds-postgres/
    eks-cluster/
    alb/
  patterns/               # Opinionated combinations of resources
    web-service/          # ALB + ECS + target group + security groups
    data-pipeline/        # Kinesis + Lambda + S3 + IAM
  applications/           # Team-specific application infrastructure
    payment-service/
    user-service/

The aws/ modules are thin wrappers around AWS resources that enforce organizational standards (required tags, encryption defaults, logging). The patterns/ modules combine aws/ modules into common infrastructure patterns. The applications/ modules use patterns/ to build team-specific infrastructure.

Module versioning with semver:

# Reference a specific module version — never float to "latest"
module "payment_service_alb" {
  source  = "git::https://github.com/your-org/terraform-modules//aws/alb?ref=v2.3.1"
  
  name            = "payment-service"
  vpc_id          = var.vpc_id
  subnet_ids      = var.public_subnet_ids
  certificate_arn = var.acm_certificate_arn
}

Module consumers should pin to a specific version. Module producers should use semantic versioning: breaking changes increment the major version, new features increment minor, bug fixes increment patch. Consumers upgrade intentionally, not automatically.

Avoid deeply nested modules. Modules that call other modules that call other modules become impossible to debug. Two levels of nesting is a comfortable maximum. If you need more layers, reconsider whether you're using modules for the right purpose.

Workspaces vs. Directory Separation

Terraform workspaces allow multiple state files from the same configuration. They're intended for managing multiple environments (dev, staging, prod) from a single configuration.

My recommendation: use directory separation over workspaces for environment management. The reasons:

Workspaces make it easy to accidentally run terraform apply against production when you meant staging (same directory, different workspace). The blast radius of a workspace mistake is high.

Directory separation makes the intended environment explicit in the path: /prod/payment-service vs. /staging/payment-service. It's harder to accidentally apply staging config to production when you have to cd to the right directory.

Shared configuration between environments can be handled with Terraform modules or with data sources that read remote state from other workspaces.

environments/
  prod/
    us-east-1/
      payment-service/
        main.tf       # Prod-specific values
        backend.tf
  staging/
    us-east-1/
      payment-service/
        main.tf       # Staging-specific values
        backend.tf

CI/CD for Terraform

Manual terraform apply runs in production are an organizational anti-pattern. Every production infrastructure change should go through a pipeline:

PR opened → terraform fmt check, tflint, checkov security scan
PR open → terraform plan runs, plan output posted as PR comment
PR approved → merge to main
Main merged → terraform apply runs automatically (or with one-click approval)

The plan-as-comment pattern is particularly valuable: reviewers see exactly what resources will be created, modified, or destroyed before approving the PR.

# GitHub Actions: Terraform plan on PR
name: Terraform Plan

on:
  pull_request:
    paths:
      - 'infrastructure/**'

jobs:
  plan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: "1.9.0"
      
      - name: Terraform Init
        run: terraform init
        working-directory: infrastructure/prod/us-east-1/payment-service
      
      - name: Terraform Plan
        id: plan
        run: |
          terraform plan -no-color -out=tfplan 2>&1 | tee plan_output.txt
          echo "plan_output<<EOF" >> $GITHUB_OUTPUT
          cat plan_output.txt >> $GITHUB_OUTPUT
          echo "EOF" >> $GITHUB_OUTPUT
        working-directory: infrastructure/prod/us-east-1/payment-service
      
      - name: Post Plan to PR
        uses: actions/github-script@v7
        with:
          script: |
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: `## Terraform Plan\n\`\`\`\n${{ steps.plan.outputs.plan_output }}\n\`\`\``
            })

Protecting against destructive changes: Some Terraform changes are destructive: aws_instance replacement, aws_rds_cluster recreation, security group deletion. Add a check that fails the CI job if the plan contains resource deletions or replacements for critical resources:

# Fail CI if the plan contains any deletions of production databases
if terraform show -json tfplan | jq -e '
  .resource_changes[] | 
  select(.type == "aws_rds_cluster") | 
  select(.change.actions[] == "delete")
' > /dev/null 2>&1; then
  echo "ERROR: Plan contains RDS cluster deletion. Manual review required."
  exit 1
fi

Import and Drift Remediation

The terraform import command brings existing resources under Terraform management. It's the tool for migrating manually-created infrastructure into IaC — a common situation for organizations maturing their infrastructure practices.

The workflow:

# 1. Write the Terraform resource block for the existing resource
resource "aws_security_group" "payment_service" {
  # (fill in based on existing resource configuration)
}

# 2. Import the existing resource into state
terraform import aws_security_group.payment_service sg-0abc123def456

# 3. Run terraform plan — it should show no changes if the config matches reality
terraform plan  # Expected: "No changes. Your infrastructure matches the configuration."

# 4. If plan shows differences, update the config to match reality
# Then re-plan until clean

For large-scale imports (migrating hundreds of manually-created resources), the new import block in Terraform 1.5+ enables declaring imports in configuration files rather than running imperative commands:

import {
  to = aws_security_group.payment_service
  id = "sg-0abc123def456"
}

This makes imports reproducible and reviewable in the same PR workflow as other Terraform changes.

*Zak Hassan is a Staff SRE specializing in infrastructure automation, GitOps, and data platform reliability. Find him at zakhassan.com or on LinkedIn.*

Topic Paths

SRE and Reliability Kubernetes and Platform Engineering Observability and Incident Learning Cloud Cost and Capacity

About the Author

Zak Hassan writes about reliability engineering under real scale constraints.

Staff-level SRE and platform engineer focused on identity reliability, Kubernetes, observability, cloud architecture, AI infrastructure, and reducing operational uncertainty.

Connect on LinkedIn