PostgreSQL at Scale: The SRE Guide to Operating Postgres in Production

PostgreSQL is the database of choice for a huge fraction of production systems. It's reliable, feature-rich, and well-understood. It's also a system that can quietly accumulate operational debt — autovacuum lag, bloated tables, connection pool exhaustion, replication lag — that surfaces catastrophically during high-traffic events or after months of gradual accumulation.

This is the SRE operating guide for Postgres: the metrics that matter, the failure modes that will get you, and the tooling that helps.

Connection Management: The First Crisis You'll Hit

Postgres has a hard connection limit configured by max_connections. Default is 100 in many managed deployments; even with custom configuration it's rarely above 1,000. A modern application with 50 service instances, each maintaining a connection pool of 10, already has 500 connections. Add another 50 instances during a traffic spike, and you're trying to open connections that Postgres cannot accept.

The symptoms of connection exhaustion: new connections fail with "sorry, too many clients already," existing transactions time out waiting for connections, and the application layer starts cascading — services that can't get a database connection start returning errors, health checks fail, and what started as a database connection limit becomes a service outage.

PgBouncer is the solution. PgBouncer is a connection pooler that sits between your application and Postgres. Applications connect to PgBouncer (which can accept thousands of connections), and PgBouncer maintains a smaller pool of actual Postgres connections. In transaction pooling mode, a Postgres connection is held only for the duration of a transaction — a single Postgres connection can serve hundreds of application connections.

# pgbouncer.ini — production configuration
[databases]
mydb = host=postgres-primary port=5432 dbname=mydb

[pgbouncer]
listen_addr = 0.0.0.0
listen_port = 5432
auth_type = md5
auth_file = /etc/pgbouncer/userlist.txt

pool_mode = transaction        # Transaction pooling: most efficient
max_client_conn = 10000        # PgBouncer accepts up to 10k app connections
default_pool_size = 50         # 50 actual Postgres connections per database/user pair
min_pool_size = 10             # Keep 10 connections warm
reserve_pool_size = 10         # Extra connections for burst

# Timeouts
server_connect_timeout = 15
query_timeout = 0              # No timeout on queries (application should set this)
client_idle_timeout = 600      # Disconnect idle clients after 10 minutes

# Monitoring
stats_period = 60              # Emit stats every 60 seconds

Monitor PgBouncer stats:

-- Connect to PgBouncer admin interface
SHOW POOLS;        -- Connection pool status per database/user
SHOW STATS;        -- Throughput, latency stats
SHOW CLIENTS;      -- Current client connections
SHOW SERVERS;      -- Current Postgres connections

Alert on cl_waiting (clients waiting for a connection) — any non-zero value sustained for more than 30 seconds indicates the pool is too small for current load.

Autovacuum: The Background Process That Prevents Catastrophe

Postgres uses Multi-Version Concurrency Control (MVCC) — rather than updating rows in place, it writes new versions while keeping old versions visible to in-progress transactions. This enables non-blocking reads but creates "dead tuples" — old row versions no longer visible to any transaction that need to be cleaned up.

Autovacuum is the background process that cleans up dead tuples. If autovacuum falls behind — because it's too aggressive on the database load, or because it's been misconfigured, or because a long-running transaction is blocking cleanup — dead tuples accumulate. The consequences:

Table bloat. The table grows in disk footprint as dead tuples accumulate. Queries read more pages than necessary, slowing everything down.

Index bloat. Indexes also accumulate dead entries, slowing index scans.

Transaction ID wraparound. This is the catastrophic failure. Postgres uses 32-bit transaction IDs. When the counter approaches its limit (2^32 ≈ 4 billion), Postgres enters a safety mode that prevents all writes until autovacuum can advance the horizon. If autovacuum has been blocked by a long-running transaction for an extended period, you can approach this limit faster than expected.

Monitoring autovacuum health:

-- Tables that haven't been vacuumed recently (dead tuple accumulation risk)
SELECT 
    schemaname,
    relname as table_name,
    n_dead_tup as dead_tuples,
    n_live_tup as live_tuples,
    ROUND(100.0 * n_dead_tup / NULLIF(n_live_tup + n_dead_tup, 0), 2) as dead_pct,
    last_autovacuum,
    last_autoanalyze
FROM pg_stat_user_tables
WHERE n_dead_tup > 10000
ORDER BY dead_pct DESC
LIMIT 20;

-- Transaction ID age (distance from wraparound — alert if > 1 billion)
SELECT 
    datname,
    age(datfrozenxid) as xid_age,
    2000000000 - age(datfrozenxid) as transactions_until_wraparound
FROM pg_database
ORDER BY xid_age DESC;

Alert at 500 million transactions until wraparound. At that distance, you have time to investigate and fix the root cause. Waiting until 100 million is a crisis.

Slow Query Analysis with pg_stat_statements

pg_stat_statements is the essential extension for understanding where your database time is going. It aggregates statistics for every distinct query type:

-- Enable the extension
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;

-- Top 10 queries by total execution time
SELECT 
    LEFT(query, 100) as query_preview,
    calls,
    ROUND(total_exec_time::numeric, 2) as total_ms,
    ROUND(mean_exec_time::numeric, 2) as avg_ms,
    ROUND(stddev_exec_time::numeric, 2) as stddev_ms,
    rows,
    100.0 * shared_blks_hit / NULLIF(shared_blks_hit + shared_blks_read, 0) AS cache_hit_pct
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;

-- Queries with high cache miss (reading from disk, not memory)
SELECT 
    LEFT(query, 100) as query_preview,
    calls,
    shared_blks_hit,
    shared_blks_read,
    100.0 * shared_blks_hit / NULLIF(shared_blks_hit + shared_blks_read, 0) AS cache_hit_pct
FROM pg_stat_statements
WHERE shared_blks_read > 1000
ORDER BY cache_hit_pct ASC
LIMIT 10;

A cache hit percentage below 95% for any significant query indicates the working set doesn't fit in shared_buffers (Postgres's buffer cache). Either increase shared_buffers (typically set to 25% of system RAM) or investigate whether the query is reading more data than necessary (missing index, poor query structure).

Replication Lag: Monitoring and Alerting

Postgres streaming replication propagates WAL (write-ahead log) from primary to replicas. Replication lag — the delay between a write being committed on the primary and becoming visible on the replica — is the primary replica health metric.

-- On the primary: check replica lag
SELECT 
    application_name,
    client_addr,
    state,
    pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn) AS send_lag_bytes,
    pg_wal_lsn_diff(sent_lsn, write_lsn) AS write_lag_bytes,
    pg_wal_lsn_diff(write_lsn, flush_lsn) AS flush_lag_bytes,
    pg_wal_lsn_diff(flush_lsn, replay_lsn) AS replay_lag_bytes,
    write_lag,
    flush_lag,
    replay_lag
FROM pg_stat_replication;

-- On the replica: check its lag from primary
SELECT 
    now() - pg_last_xact_replay_timestamp() AS replication_lag_time,
    pg_is_in_recovery() AS is_replica;

Alert thresholds depend on your use case. For read replicas serving analytics: 30-second lag is acceptable. For replicas that might need to be promoted in a failover: lag above 10 seconds is concerning because failover will replay that lag before the replica can serve writes.

Long-running transactions block replication cleanup. A transaction open on the primary for hours prevents WAL segments from being cleaned up and can cause replica lag to grow even when replication is healthy. Monitor for long-running transactions:

-- Transactions older than 10 minutes
SELECT 
    pid,
    now() - pg_stat_activity.query_start AS duration,
    query,
    state,
    wait_event_type,
    wait_event
FROM pg_stat_activity
WHERE (now() - pg_stat_activity.query_start) > interval '10 minutes'
  AND state != 'idle'
ORDER BY duration DESC;

Index Management: When Indexes Help and Hurt

Indexes speed up reads but slow down writes (every write must update all relevant indexes) and consume disk space. The right indexing strategy requires understanding your read/write ratio and query patterns.

Missing indexes show up as sequential scans on large tables in EXPLAIN ANALYZE output:

-- Check for sequential scans on large tables (potential missing index)
SELECT 
    schemaname,
    relname as table_name,
    seq_scan,
    seq_tup_read,
    idx_scan,
    idx_tup_fetch
FROM pg_stat_user_tables
WHERE seq_scan > 1000
  AND n_live_tup > 100000  -- Only care about large tables
ORDER BY seq_scan DESC
LIMIT 20;

Unused indexes waste write overhead and disk space:

-- Indexes that are never used
SELECT 
    schemaname,
    tablename,
    indexname,
    idx_scan,
    pg_size_pretty(pg_relation_size(indexrelid)) as index_size
FROM pg_stat_user_indexes
JOIN pg_index USING (indexrelid)
WHERE idx_scan = 0
  AND NOT indisprimary
  AND NOT indisunique
ORDER BY pg_relation_size(indexrelid) DESC;

Before dropping any "unused" index, verify it's been unused since pg_stat_reset() was last called, and verify it's not being used for constraint enforcement or occasional maintenance queries that run infrequently.

*Zak Hassan is a Staff SRE specializing in database reliability, data platform engineering, and AI-powered operations. Find him at zakhassan.com or on LinkedIn.*

Topic Paths

SRE and Reliability Kubernetes and Platform Engineering Observability and Incident Learning Cloud Cost and Capacity

About the Author

Zak Hassan writes about reliability engineering under real scale constraints.

Staff-level SRE and platform engineer focused on identity reliability, Kubernetes, observability, cloud architecture, AI infrastructure, and reducing operational uncertainty.

Connect on LinkedIn