Redis at Homelab Scale: The SRE Guide to Operating Redis at Scale

*By Zak Hassan — Staff SRE | May 2026*

Redis is deceptively easy to get started with and deceptively hard to operate well. You spin up a single instance, point the application at it, and everything feels fast. Then traffic grows, a node restarts, a hot key melts a single CPU core, or Redis hits its memory limit and starts silently dropping data the application assumes is there. This guide covers what actually matters when you're responsible for Redis in production-like lab environments: topology decisions, persistence trade-offs, memory management, failure modes, and the documented recovery steps you reach for when things go wrong.

Topology: Standalone, Sentinel, and Cluster

Standalone is one Redis process. It is appropriate for development, for low-traffic caches where downtime is acceptable, or for small datasets where a single node's RAM is sufficient. Do not use standalone for anything where an unplanned restart would cause a user-visible incident. A standalone instance restarts cold; the application must warm its own cache.

Sentinel adds high availability without horizontal scaling. Three or more Sentinel processes monitor the primary. When the primary fails, Sentinels hold an election and promote the least-lagged replica. Applications connect to Sentinel addresses, not the primary directly; the Sentinel API resolves the current primary on demand.

# sentinel.conf
sentinel monitor mymaster 10.0.1.10 6379 2
sentinel down-after-milliseconds mymaster 5000
sentinel failover-timeout mymaster 60000
sentinel parallel-syncs mymaster 1

The quorum 2 means two Sentinels must agree the primary is down before a failover begins. A two-Sentinel setup is dangerous — a single Sentinel failure means you can never reach quorum. Run at minimum three Sentinels on separate failure domains.

Sentinel failover takes 5–30 seconds in practice. During that window, writes are rejected and reads may serve stale data from replicas (if replica-serve-stale-data yes is set). Design the application to handle short write unavailability.

Redis Cluster shards data across multiple primary nodes using 16,384 hash slots. Each key is assigned to a slot via CRC16(key) % 16384. Cluster is the right choice when your dataset exceeds a single node's RAM, or when write throughput saturates a single CPU core.

# redis.conf for a cluster node
cluster-enabled yes
cluster-config-file nodes.conf
cluster-node-timeout 15000
cluster-announce-ip 10.0.1.10
cluster-announce-port 6379

Cluster introduces operational complexity: multi-key operations only work if all keys hash to the same slot (use hash tags: {user:1234}:session and {user:1234}:cart share a slot), and Lua scripts must operate within a single slot. Assess whether your access patterns require cross-key atomicity before committing to Cluster.

Persistence: RDB, AOF, and the Case for Not Skipping It

The most common mistake teams make is disabling persistence because "it's just a cache." Redis persistence is not only about surviving restarts; it is also about controlling your cache warm-up time and the application's cold-start behavior after a failure.

RDB takes point-in-time snapshots. It is compact, fast to load on restart, and has minimal runtime overhead. The trade-off: you lose writes that happened since the last snapshot. A 5-minute snapshot interval means up to 5 minutes of data loss.

# redis.conf
save 900 1      # snapshot if ≥1 key changed in 900s
save 300 10     # snapshot if ≥10 keys changed in 300s
save 60 10000   # snapshot if ≥10000 keys changed in 60s
rdbcompression yes
rdbfilename dump.rdb
dir /var/lib/redis

AOF (Append-Only File) logs every write command. On restart, Redis replays the AOF to reconstruct state. The appendfsync everysec setting offers a reasonable balance: at most one second of data loss, with a minor write amplification cost.

appendonly yes
appendfilename "appendonly.aof"
appendfsync everysec
auto-aof-rewrite-percentage 100
auto-aof-rewrite-min-size 64mb
aof-rewrite-incremental-fsync yes

For most production-like lab systems, the correct answer is RDB + AOF hybrid mode (aof-use-rdb-preamble yes, which is the default since Redis 4.0). Redis writes an RDB snapshot as the AOF base and appends only the delta since the last rewrite. Restart is fast; data loss window is small.

"This is only ephemeral cache" is a statement that ages poorly at 3 a.m. when a node restarts and 40 million keys need to be re-fetched from the database.

Memory Management and Eviction Policies

Redis is an in-memory store with a configurable ceiling. When maxmemory is reached, Redis applies an eviction policy. Choosing the wrong policy is a class of bug that is invisible until it causes a cache stampede or data corruption.

maxmemory 12gb
maxmemory-policy allkeys-lru
maxmemory-samples 10

The key policies and when to use them:

Policy	Behavior	Use When
`noeviction`	Reject writes, return OOM errors	Session store, you cannot afford data loss
`allkeys-lru`	Evict least-recently-used across all keys	General cache; all keys are expendable
`volatile-lru`	LRU only among keys with a TTL set	Mixed store: some persistent, some cache
`allkeys-lfu`	Evict least-frequently-used (Redis 4.0+)	Hot/cold access patterns; prefer recency
`volatile-ttl`	Evict keys with shortest remaining TTL first	You want to control expiry yourself

noeviction is correct when Redis holds data that cannot be regenerated — session tokens, rate-limit counters, queued jobs. When Redis hits maxmemory under noeviction, writes fail immediately. Your application must handle this gracefully or your users will see errors.

allkeys-lru is correct for pure caching. Redis silently drops the coldest keys to make room. The risk is evicting keys the application assumes are present, causing a cache miss spike that overwhelms the database.

The Hot Key Problem

A hot key is a single Redis key that receives a disproportionate share of traffic — enough that the requests to that key saturate the single CPU thread handling its slot. In Cluster mode this is especially dangerous because that one slot's primary bears all the load regardless of how many other nodes exist.

Detection is straightforward:

# Redis 4.0+ built-in hot key detection (requires maxmemory-policy LFU)
redis-cli --hotkeys -h 10.0.1.10 -p 6379

# Or use MONITOR for a short sample (high overhead — use sparingly)
redis-cli -h 10.0.1.10 monitor | head -n 5000 | awk '{print $4}' | sort | uniq -c | sort -rn | head -20

For ongoing monitoring, parse the keyspace stats in your metrics pipeline:

import redis
import re
from prometheus_client import Gauge

r = redis.Redis(host="10.0.1.10", port=6379)

def collect_hotkey_metrics():
    info = r.info("commandstats")
    ops = {cmd: stats["calls"] for cmd, stats in info.items()}
    return ops

# Track GET call rate as a proxy for hot key load
GET_OPS = Gauge("redis_get_ops_total", "Total GET operations")

def scrape():
    stats = r.info("commandstats")
    if "cmdstat_get" in stats:
        GET_OPS.set(stats["cmdstat_get"]["calls"])

Mitigation patterns:

Local in-process cache: Cache the hot key's value in application memory for 1–5 seconds. Reduces Redis load by the number of application instances. Use a library like Caffeine (JVM) or cachetools (Python).

Key sharding: Split one key into N shards — popular_item:0 through popular_item:15. On read, pick a shard at random. On write, update all shards. Effective but adds write complexity.

Read replicas with client-side routing: Direct reads for known hot keys to replicas. Works with Sentinel; in Cluster, requires READONLY mode on replica connections.

Eviction Monitoring: Bug vs. Expected Behavior

Eviction is not always a problem. Under allkeys-lru, eviction is the intended steady-state for a well-sized cache with more data than memory. The signal to watch is eviction rate relative to your hit rate.

import redis

r = redis.Redis(host="10.0.1.10", port=6379)

def cache_health():
    info = r.info("stats")
    hits = info["keyspace_hits"]
    misses = info["keyspace_misses"]
    evictions = info["evicted_keys"]
    
    total = hits + misses
    hit_rate = hits / total if total > 0 else 0
    
    print(f"Hit rate:       {hit_rate:.2%}")
    print(f"Evicted keys:   {evictions}")
    print(f"Eviction rate:  (scrape delta over time for rate)")
    
    # Alert threshold: hit rate < 85% suggests undersizing
    if hit_rate < 0.85:
        print("WARNING: hit rate below threshold — consider increasing maxmemory")

cache_health()

Eviction is a bug when: you are using noeviction and writes are failing; you are evicting session or rate-limit keys that must be durable; your hit rate is collapsing because recently-written keys are being immediately evicted.

Eviction is expected when: you have a fixed-size LRU cache that holds working-set data; eviction rate is stable and hit rate is above your SLO threshold.

Key Operational Metrics

import redis

r = redis.Redis(host="10.0.1.10", port=6379, decode_responses=True)

def redis_metrics():
    mem = r.info("memory")
    stats = r.info("stats")
    clients = r.info("clients")
    
    # RSS vs used_memory: fragmentation indicator
    used = mem["used_memory"]
    rss = mem["used_memory_rss"]
    frag_ratio = rss / used
    
    print(f"used_memory:              {used / 1e9:.2f} GB")
    print(f"used_memory_rss:          {rss / 1e9:.2f} GB")
    print(f"Fragmentation ratio:      {frag_ratio:.2f}  (>1.5 = concern)")
    print(f"connected_clients:        {clients['connected_clients']}")
    print(f"instantaneous_ops_per_sec:{stats['instantaneous_ops_per_sec']}")
    
    hits = stats["keyspace_hits"]
    misses = stats["keyspace_misses"]
    total = hits + misses
    print(f"keyspace_hit_rate:        {hits/total:.2%}" if total else "no ops yet")

redis_metrics()

used_memory_rss vs used_memory: used_memory is what Redis thinks it is using. used_memory_rss is what the OS reports (includes allocator fragmentation). A fragmentation ratio above 1.5 indicates significant fragmentation — consider scheduling a MEMORY PURGE during a low-traffic window, or plan a rolling restart.

connected_clients: Sudden spikes indicate a connection leak. A value near your maxclients limit (default 10,000) will cause new connections to be refused. Ensure the application uses a connection pool, not per-request connections.

keyspace_hits / keyspace_misses: The primary cache efficiency signal. Track the hit rate as a ratio over time, not as raw counts. A downward trend in hit rate while traffic is stable indicates either eviction pressure or TTL misconfiguration.

Redis Cluster Failure Modes

In Redis Cluster, each shard has one primary and one or more replicas. When a primary fails, the cluster holds an election among its replicas. Failover requires that at least half of the remaining primary nodes can communicate (the majority quorum).

Single node failure: Replicas detect the failure after cluster-node-timeout (default 15 seconds). A replica in the same shard is promoted. During this window, slots owned by the failed primary return CLUSTERDOWN errors. Design the application to retry with exponential backoff.

Split-brain: If a network partition isolates a minority of primaries, those nodes stop accepting writes after cluster-node-timeout elapses. The majority partition promotes new primaries. When the partition heals, the minority-side nodes become replicas and discard any writes they accepted before recognizing the partition. This is why you should not enable cluster-allow-reads-when-down yes on primaries without understanding the consistency implications.

Slot migration: When rebalancing the cluster (adding nodes, removing nodes), slots are migrated using CLUSTER SETSLOT / MIGRATE. During migration, keys in the moving slot return ASK redirects. Ensure your Redis client handles ASK redirections — most do, but older client versions may not.

# Check cluster health
redis-cli -h 10.0.1.10 -p 6379 cluster info | grep cluster_state
redis-cli -h 10.0.1.10 -p 6379 cluster nodes | awk '{print $1, $3, $8}'

A healthy cluster returns cluster_state:ok. Any other value means at least one slot has no available primary.

Blog Post: Responding to High Memory Usage

Symptom: used_memory approaching maxmemory; eviction rate spiking; application reporting cache misses or OOM write errors.

Step 1 — Confirm scope.

redis-cli -h $HOST info memory | grep -E "used_memory_human|maxmemory_human|mem_fragmentation_ratio"
redis-cli -h $HOST info stats | grep evicted_keys

Step 2 — Check fragmentation. If mem_fragmentation_ratio > 1.5, fragmentation may be the cause, not actual data growth.

redis-cli -h $HOST memory purge   # defragment (Redis 4.0+, brief latency spike)

Step 3 — Identify large key contributors.

redis-cli -h $HOST --bigkeys
# Or for a faster sample:
redis-cli -h $HOST memory usage <suspicious_key>

Step 4 — Check TTL hygiene. Unexpired keys from a bug or config change can fill memory silently.

redis-cli -h $HOST info keyspace
# Look for db0:keys=<N>,expires=<M> — if M << N, most keys have no TTL set

Step 5 — Immediate relief options (in order of risk):

Increase maxmemory if headroom exists on the host (CONFIG SET maxmemory 16gb).
Switch policy temporarily to allkeys-lru if currently noeviction, after confirming data loss is acceptable.
Trigger an AOF rewrite to reclaim space from deleted-but-not-yet-compacted entries (BGREWRITEAOF).

Step 6 — Durable fix. Investigate root cause: data growth (add a node or increase memory), TTL misconfiguration (add TTLs to key patterns missing them), or a hot write path creating unbounded keys. Set a memory usage alert at 75% of maxmemory so you have lead time next cycle.

Closing Thoughts

Redis rewards operators who understand it below the API level. The difference between a cache that silently degrades under load and one that fails predictably and recovers cleanly is almost always in the configuration decisions made before the incident: the right persistence mode for your durability requirements, an eviction policy that matches your data semantics, and a topology that reflects your actual HA and scaling needs. Instrument early, size conservatively, and treat your first production incident as a configuration audit opportunity.

*Zak Hassan is a Staff SRE specializing in distributed systems, caching infrastructure, and production reliability. Find him at zakhassan.com or on LinkedIn.*

Topic Paths

SRE and Reliability Kubernetes and Platform Engineering Observability and Incident Learning Cloud Cost and Capacity

About the Author

Zak Hassan writes about reliability engineering under real scale constraints.

Staff-level SRE and platform engineer focused on identity reliability, Kubernetes, observability, cloud architecture, AI infrastructure, and reducing operational uncertainty.

Connect on LinkedIn