The database choice is the most consequential reliability decision in most system designs, and it's often made early — sometimes too early, before the actual access patterns and consistency requirements are well understood. AWS's database portfolio has expanded to the point where the selection decision is genuinely complex, and getting it wrong costs far more than the effort of getting it right.

This is an SRE-perspective guide to AWS's core database options: what their reliability profiles look like, what failure modes you should plan for, and when each is the right choice.


Aurora: The Workhorse

Amazon Aurora (PostgreSQL and MySQL compatible) is the default choice for most relational workloads on AWS, and for good reason. Aurora separates compute from storage, replicates data 6 ways across 3 AZs automatically, and provides automatic failover without data loss.

The Reliability Profile

Storage durability: 6-way replication means Aurora can lose 2 copies without data loss and continue operating. The storage layer is effectively independent of the compute layer — provisioning a new writer instance against existing storage is fast.

Failover time: Aurora Multi-AZ failover completes in under 30 seconds typically, under 10 seconds in many cases with the latest improvements. Compare this to RDS Multi-AZ (1-2 minutes) or a self-managed database (manual intervention).

What Aurora does not protect against: Application-level errors. A DELETE FROM table WHERE 1=1 executed by a runaway migration is replicated to all 6 storage copies immediately. Aurora's backtrack feature (continuous backup that can be replayed forward to any point in time) and automated snapshots are your protection against this category.

Operational Patterns

Read replicas for query isolation. Run analytics queries against Aurora read replicas, not the writer. A slow analytical query on the writer competes for resources with OLTP workloads. Aurora's replica lag is typically sub-second for write-heavy workloads; monitor it and alert if it exceeds acceptable thresholds for your use case.

Aurora Global Database for multi-region. For active-active or active-passive multi-region configurations, Aurora Global Database replicates from a primary region to up to 5 secondary regions with typical lag under 1 second. Planned failover (when you want to promote a secondary) takes under 1 minute. Unplanned failover (the primary region is unavailable) takes longer but is fully automated.

Connection pooling is not optional at scale. Aurora's connection limit is a function of instance size. A db.r7g.large has ~2,000 maximum connections. If your application opens 10 connections per pod and you run 300 pods, you'll exhaust the connection pool. RDS Proxy or PgBouncer is required for applications that scale horizontally.

python
# Connection pool monitoring — alert before exhaustion
def check_aurora_connections(cluster_id: str, threshold_pct: float = 0.8):
    cw = boto3.client('cloudwatch')
    
    current = get_metric(cw, cluster_id, 'DatabaseConnections')
    max_connections = get_aurora_max_connections(cluster_id)
    
    utilization = current / max_connections
    if utilization > threshold_pct:
        alert(f"Aurora connection pool at {utilization:.0%} — consider scaling instance or adding proxy")

Cluster parameter groups are shared config. A change to a cluster parameter group affects all instances in the cluster. Test parameter changes in non-production before applying to production, and prefer dynamic parameters (take effect immediately) over static parameters (require reboot) for production clusters.


DynamoDB: The Different Beast

DynamoDB is not a relational database. It's a key-value and document store designed for single-digit millisecond latency at any scale. Its reliability profile is fundamentally different from Aurora's, and using it like a relational database produces both reliability and performance problems.

The Reliability Profile

Effectively unlimited scale. DynamoDB's managed sharding means you can scale to millions of requests per second by provisioning capacity. The database infrastructure never becomes the bottleneck if you're provisioning correctly.

Throughput mode matters: On-demand mode handles any traffic pattern but costs more per request. Provisioned mode is cheaper at predictable load but throttles requests that exceed provisioned capacity. Throttled requests return a ProvisionedThroughputExceededException. Your application must handle throttling with exponential backoff, and your monitoring must surface throttling events.

Hot partitions are the silent reliability killer. DynamoDB distributes data across internal partitions based on the partition key. If 90% of your traffic hits one partition key, that partition's throughput is limited regardless of total provisioned capacity. A user ID as the partition key for a small table where one "user" is actually a shared test account that generates millions of requests — this is a hot partition waiting to happen.

The Access Pattern Constraint

DynamoDB requires you to define your access patterns upfront. The key insight: you cannot do arbitrary queries against DynamoDB. You query by primary key (exact match) or by sort key range. Secondary indexes extend this, but they're predefined at table creation. Ad hoc queries that work against Aurora are not possible against DynamoDB without a full table scan (expensive and slow at scale).

This constraint is why DynamoDB and Aurora often coexist: DynamoDB for high-volume, predictable access pattern workloads (user sessions, feature flags, product catalog); Aurora for workloads requiring flexible querying, joins, or transactions across multiple entities.

DynamoDB Streams for Change Data Capture

DynamoDB Streams captures every change to a table as an ordered stream of events. This is the integration point for making DynamoDB changes trigger downstream processing — updating search indexes, publishing events to Kafka, invalidating caches:

python
# Lambda triggered by DynamoDB Stream
def handle_dynamodb_stream(event, context):
    for record in event['Records']:
        if record['eventName'] == 'INSERT':
            new_image = deserialize(record['dynamodb']['NewImage'])
            publish_to_kafka('user-events', new_image)
        
        elif record['eventName'] == 'MODIFY':
            old_image = deserialize(record['dynamodb']['OldImage'])
            new_image = deserialize(record['dynamodb']['NewImage'])
            invalidate_cache(new_image['user_id'])
        
        elif record['eventName'] == 'REMOVE':
            old_image = deserialize(record['dynamodb']['OldImage'])
            publish_to_kafka('user-deletions', old_image)

The stream is ordered per partition key but not across partition keys. If strict global ordering matters, use DynamoDB as an event source into an ordered queue (SQS FIFO, Kafka with appropriate partitioning) rather than assuming stream event order.


ElastiCache: The Supporting Cast

ElastiCache (Redis or Valkey) is the caching layer that makes Aurora and DynamoDB work at high request rates. Rather than treating caching as an afterthought, build it into your reliability model from the start.

The cache-aside pattern reliability considerations:

Cache misses under load are a reliability risk. If your cache has a high miss rate (cold start after a deployment, cache evictions under memory pressure, TTL expiry at scale), your database absorbs traffic that the cache was shielding. A cache stampede — where many requests simultaneously miss the cache for the same key and all hit the database — can take down an undersized database.

Mitigations: probabilistic early expiration (begin refreshing cache entries before they expire, while still serving stale values), circuit breakers that detect high miss rates and shed load, and read-through caching patterns where the application layer coordinates single-request refreshes rather than allowing thundering herds.

ElastiCache Global Datastore replicates Redis data to secondary regions for read access. Writes go to the primary region; secondary region clusters serve reads with replication lag typically under 1 second. For global applications where session state or user preferences need to be readable near the user, Global Datastore eliminates cross-region latency for reads.


The Decision Framework

For a new workload, the questions that determine the right database choice:

Do you need arbitrary query flexibility? Yes → Aurora/RDS. Complex joins, aggregations, and ad hoc queries require a relational model.

Is your access pattern key-based with known query shapes? Yes → Consider DynamoDB. If you always query by user ID, always get a session by session token, and never need joins — DynamoDB is faster and simpler at scale.

Is 99.999% availability a hard requirement? → Aurora Global Database with planned failover capability, or DynamoDB (multi-region active-active tables).

Will this see >10,000 requests per second on a single table? → DynamoDB is the safer choice; Aurora can handle this but requires careful connection pooling and instance sizing.

Do you need strong ACID transactions across multiple entities? → Aurora. DynamoDB transactions exist but are more limited and more expensive per operation.

The failure I see most often: teams start with Aurora for everything (because SQL is familiar), hit scale limits, and migrate to DynamoDB under pressure — a painful migration that would have been avoided by making the choice correctly earlier. The other failure: teams use DynamoDB for workloads that genuinely need relational capabilities, then build increasingly complex application-layer logic to compensate for missing SQL features.

Match the database to the access pattern. Both Aurora and DynamoDB are excellent at what they're designed for and problematic when used outside their design envelope.


*Zak Hassan is a Staff SRE specializing in distributed systems reliability, AWS infrastructure, and data platform engineering. Find him at zakhassan.com or on LinkedIn.*

Topic Paths

About the Author

Zak Hassan writes about reliability engineering under real scale constraints.

Staff-level SRE and platform engineer focused on identity reliability, Kubernetes, observability, cloud architecture, AI infrastructure, and reducing operational uncertainty.

Connect on LinkedIn