Database Migration Safety at Scale: Zero-Downtime Schema Changes

*By Zak Hassan — Staff SRE | May 2026*

Database schema migrations are among the riskiest operations in reliability engineering. A migration that takes a table lock on a 500-million-row table will block all reads and writes on that table for the duration — minutes or hours. A migration that drops a column still referenced by running application code will cause immediate errors. A migration that changes a column type without a compatible transformation will corrupt data.

These failures are common because the development environment where migrations are tested is nothing like production: smaller tables, fewer concurrent connections, no long-running transactions holding locks, no read replicas that need to stay synchronized.

Zero-downtime migration is a discipline, not a tool. The tools (pt-online-schema-change, gh-ost, pglogical) help, but the discipline — understanding lock acquisition, understanding deployment sequencing, using the expand-contract pattern — is what actually prevents incidents.

Why Naive Migrations Cause Outages

PostgreSQL's ALTER TABLE is the primary culprit. Understanding what it does explains why naive migrations fail in production-like lab environments.

ALTER TABLE users ADD COLUMN last_seen_at TIMESTAMP acquires an ACCESS EXCLUSIVE lock — the strongest lock, which blocks all concurrent reads and writes. On a small table, this is instantaneous. On a table with 100 million rows, acquiring this lock requires waiting for every open transaction on the table to complete. In a busy production database with long-running OLAP queries, this can mean minutes of waiting — during which all attempts to read or write the table queue up and time out.

-- What's waiting for your migration lock?
SELECT
    pid,
    now() - query_start AS duration,
    state,
    wait_event_type,
    wait_event,
    left(query, 100) as query_preview
FROM pg_stat_activity
WHERE wait_event_type = 'Lock'
ORDER BY duration DESC;

-- What locks are currently held that block ALTER TABLE?
SELECT
    locktype,
    relation::regclass,
    mode,
    granted,
    pid,
    left(query, 60)
FROM pg_locks
JOIN pg_stat_activity USING (pid)
WHERE relation = 'users'::regclass
ORDER BY granted DESC, mode;

A migration that starts and then waits for a lock is not paused — it's holding a queue of requests behind it. Every application request that needs to touch the users table is now waiting, and they'll continue waiting until either the migration acquires the lock or times out. Setting lock_timeout prevents indefinite queue buildup:

-- Set a timeout so the migration fails fast rather than blocking production
SET lock_timeout = '3s';
ALTER TABLE users ADD COLUMN last_seen_at TIMESTAMP;
-- If this doesn't acquire the lock within 3 seconds, it fails
-- Retry when the blocking transaction completes

The Expand-Contract Pattern

The expand-contract pattern (also called parallel change) is the fundamental technique for zero-downtime migrations. It decomposes any breaking schema change into a sequence of backward-compatible steps, each safe to deploy independently.

The pattern for renaming a column user_name → display_name:

Step 1 (Expand): Add the new column
  ALTER TABLE users ADD COLUMN display_name VARCHAR(255);
  -- Old code still uses user_name; new column is null for now

Step 2 (Populate): Backfill the new column
  UPDATE users SET display_name = user_name WHERE display_name IS NULL;
  -- Done in batches to avoid long-running transactions

Step 3 (Sync): Keep columns in sync with a trigger
  CREATE TRIGGER sync_display_name
  BEFORE INSERT OR UPDATE ON users
  FOR EACH ROW EXECUTE FUNCTION sync_user_name_display_name();

Step 4 (Deploy): Update application code to write and read from new column
  -- Deploy application version that reads display_name with user_name fallback
  -- e.g.: COALESCE(display_name, user_name)

Step 5 (Contract): Remove the old column
  ALTER TABLE users DROP COLUMN user_name;
  DROP TRIGGER sync_display_name ON users;
  -- Now safe: all application code uses display_name

Each step is a separate deployment, often days apart. The slowness is the point — each step is independently safe, and any step can be rolled back without affecting the others.

Online Schema Change Tools

For large tables, even the ADD COLUMN step must be done with an online schema change tool to avoid lock contention.

pg_repack for PostgreSQL (repacks the table online, avoiding exclusive locks):

# Install pg_repack
apt-get install postgresql-14-repack

# Add a column online — no table-level lock
# pg_repack creates a new table, migrates data, swaps atomically
pg_repack \
  --host=postgres-primary \
  --dbname=production \
  --table=users \
  --jobs=4            # Use 4 parallel workers for the copy phase
  
# For Postgres-specific: adding a column with a default in PG11+ is instant
# (PG11 added deferred default filling — no rewrite needed)
ALTER TABLE users ADD COLUMN created_at TIMESTAMP DEFAULT NOW();
-- Instant in PG11+! The default is stored in catalog, applied on read.

gh-ost for MySQL — runs a shadow copy approach:

# gh-ost: safe online ALTER for MySQL
gh-ost \
  --user="ghost_user" \
  --password="$GHOST_PW" \
  --host=mysql-primary \
  --database=production \
  --table=users \
  --alter="ADD COLUMN display_name VARCHAR(255) DEFAULT NULL" \
  --allow-on-master \
  --execute

# gh-ost creates _users_gho (shadow table), replicates changes via binlog,
# then does a quick atomic rename swap. The table is never locked.

Batched Backfills: Avoiding Long-Running Transactions

The "Populate" step — updating all existing rows to fill a new column — is dangerous if done naively:

-- DANGEROUS: locks all rows for the duration, blocks production traffic
UPDATE users SET display_name = user_name;
-- On 50M rows: takes 10+ minutes, holds row locks throughout

The safe approach processes rows in small batches with explicit pauses between batches:

import time
import psycopg2

def backfill_display_name(batch_size: int = 10_000, sleep_between_batches: float = 0.1):
    """
    Backfill users.display_name from users.user_name in small batches.
    Each batch is a separate transaction — no long-running locks.
    """
    conn = psycopg2.connect(DATABASE_URL)
    
    total_updated = 0
    last_id = 0
    
    while True:
        with conn.cursor() as cur:
            cur.execute("""
                UPDATE users
                SET display_name = user_name
                WHERE id > %s
                  AND display_name IS NULL
                  AND id IN (
                    SELECT id FROM users
                    WHERE id > %s AND display_name IS NULL
                    ORDER BY id
                    LIMIT %s
                  )
                RETURNING id
            """, (last_id, last_id, batch_size))
            
            updated_ids = cur.fetchall()
            conn.commit()
            
            if not updated_ids:
                break  # Done
            
            last_id = max(row[0] for row in updated_ids)
            total_updated += len(updated_ids)
            
            print(f"Backfilled {total_updated} rows (last ID: {last_id})")
            
            # Pause to let production traffic breathe
            time.sleep(sleep_between_batches)
    
    print(f"Backfill complete: {total_updated} rows updated")
    conn.close()

Monitor replication lag during the backfill — a batch backfill generates WAL that replicas must apply. If replica lag starts growing, increase the sleep between batches.

Migration Sequencing with Application Deployments

The most dangerous migrations are ones where the application code and schema change are deployed simultaneously. The timing is almost never exactly right — either the new schema is live before the code, or the new code is live before the schema.

The safe deployment sequence:

Phase 1: Deploy backward-compatible schema changes
  - Add new columns (nullable, or with defaults)
  - Add new tables
  - Add new indexes (CONCURRENTLY — no locks)
  - DO NOT: drop columns, rename columns, change types, drop tables

Phase 2: Deploy application code that handles both old and new schema
  - New code reads new column if present, falls back to old column
  - New code writes to both old and new columns
  - This version must be compatible with both Phase 1 and Phase 3 schema

Phase 3: Deploy contract schema changes
  - Drop old columns
  - Remove backward compatibility triggers
  - These changes are now safe because no running application code references them

Phase 4: Deploy application code cleanup
  - Remove backward compatibility code paths
  - This is cosmetic — the system is already fully migrated

The time between phases can be minutes (for low-risk changes with fast deployment pipelines) or days (for high-risk changes where you want monitoring time between steps).

Migration Testing: Closing the Development-Production Gap

The gap between development and production is the root cause of most migration incidents. Production has things development doesn't: long-running transactions, hundreds of concurrent connections, replicas with replication lag.

Testing migrations against production-scale data:

# Create a production-scale test environment using logical replication
# 1. Take a recent full-scale backup
pg_dump --no-owner --no-acl --format=directory \
  --jobs=8 production_db > /backup/prod_backup

# 2. Restore to a test cluster
pg_restore --dbname=migration_test \
  --jobs=8 /backup/prod_backup

# 3. Run the migration against the full dataset
psql -h migration-test-cluster -d migration_test \
  -c "SET lock_timeout = '3s';" \
  -f migrations/20260508_add_display_name.sql

# 4. Measure duration, lock wait time, replication lag

Automated migration linting:

# Check migrations for common safety problems before release
DANGEROUS_PATTERNS = [
    (r"ALTER TABLE \w+ ADD COLUMN \w+ \w+ NOT NULL(?! DEFAULT)", 
     "NOT NULL without DEFAULT requires table rewrite — use nullable first"),
    (r"DROP COLUMN", 
     "Dropping columns is irreversible — ensure no code references this column"),
    (r"ALTER TABLE \w+ ALTER COLUMN \w+ TYPE",
     "Type changes require table rewrite — verify compatibility with running code"),
    (r"CREATE INDEX (?!CONCURRENTLY)",
     "INDEX without CONCURRENTLY will lock the table — use CREATE INDEX CONCURRENTLY"),
    (r"TRUNCATE TABLE",
     "TRUNCATE acquires ACCESS EXCLUSIVE lock — blocked by any open transaction"),
]

def lint_migration(migration_sql: str) -> list[str]:
    warnings = []
    for pattern, message in DANGEROUS_PATTERNS:
        if re.search(pattern, migration_sql, re.IGNORECASE):
            warnings.append(message)
    return warnings

Running migration linting in CI catches the most common problems before they reach the migration review stage, when the pressure to ship may have already increased.

*Zak Hassan is a Staff SRE specializing in database reliability, data platform engineering, and zero-downtime operations. Find him at zakhassan.com or on LinkedIn.*

Topic Paths

SRE and Reliability Kubernetes and Platform Engineering Observability and Incident Learning Cloud Cost and Capacity

About the Author

Zak Hassan writes about reliability engineering under real scale constraints.

Staff-level SRE and platform engineer focused on identity reliability, Kubernetes, observability, cloud architecture, AI infrastructure, and reducing operational uncertainty.

Connect on LinkedIn