*By Zak Hassan — Staff SRE | May 2026*
Most teams treat Elasticsearch like a black box until something breaks. You stand up a cluster, point the application at it, and things work — until they don't. Then you're staring at a red cluster at 2 AM, unsure whether to restart nodes, force-allocate shards, or call it a data loss event. The problem isn't that Elasticsearch is fragile; it's that the defaults are built for development, not production, and the path from "it works in staging" to "it survives production load" involves a set of operational decisions most teams never make deliberately. This post covers the ones that matter most.
Cluster Health: What Green, Yellow, and Red Actually Mean
The cluster health API is the first thing you check, but the color codes mislead more than they guide if you don't understand what sits underneath them.
curl -s "http://localhost:9200/_cluster/health?pretty"{
"cluster_name": "production-search",
"status": "yellow",
"number_of_nodes": 3,
"number_of_data_nodes": 3,
"active_primary_shards": 45,
"active_shards": 45,
"relocating_shards": 0,
"initializing_shards": 0,
"unassigned_shards": 45,
"delayed_unassigned_shards": 0
}Yellow means all primary shards are assigned and operational, but one or more replica shards are unassigned. On a single-node development cluster this is expected — you cannot place a replica on the same node as its primary, so they sit unassigned. On a three-node production cluster, yellow is a warning that your data has no redundancy somewhere. If you lose a node right now, you may lose data or face temporary unavailability.
Red means at least one primary shard is unassigned. Queries against indices with unassigned primaries return partial or no results. This is the state that pages the on-call.
The distinction between primary and replica shards matters operationally. A primary shard is the authoritative copy — writes go there first. Replicas serve two purposes: redundancy for node failure, and read scalability (queries can be routed to any copy). When you create an index with number_of_replicas: 1, you are doubling your storage and your read throughput capacity in exchange for fault tolerance.
# See shard assignment details for unassigned shards
curl -s "http://localhost:9200/_cluster/allocation/explain?pretty" \
-H "Content-Type: application/json" \
-d '{"index": "logs-2026.05", "shard": 0, "primary": false}'The allocation explain API tells you exactly why a shard is unassigned — disk watermarks exceeded, no eligible node, node filter mismatch. Read it before you do anything else.
The Shard Sizing Problem
Elasticsearch's most common misconfiguration is shard count. The default of five primary shards per index was chosen arbitrarily and codified into muscle memory. Teams end up with hundreds of tiny indices, each with five shards, totaling thousands of shards in a cluster that has no business holding that many.
Each shard is a Lucene index. Each Lucene index holds file handles, in-memory segment metadata, and a portion of JVM heap. A cluster with 10,000 shards on three nodes is allocating roughly 3,300 shards per node. At approximately 1.5KB of heap per shard, that's 5MB of overhead — manageable. But each segment within those shards also consumes heap, and an index that is never force-merged accumulates segments aggressively. The real heap pressure comes from the segment count, not the shard count alone.
The operational rule of thumb is to target 10–50GB per shard, with 20–30GB being the sweet spot for most search workloads. You can inspect current shard sizes with:
curl -s "http://localhost:9200/_cat/shards?v&h=index,shard,prirep,state,store&s=store:desc" | head -30For time-series data, the right approach is rolling daily or weekly indices with a fixed shard count per index, sized to land each index in the 10–50GB range when it rolls over. If your data volume is 5GB per day, a single primary shard per index is correct. If it's 200GB per day, five shards is reasonable.
Index Lifecycle Management
ILM is how you automate the entire lifecycle of time-series indices — from hot ingest to cold archive to deletion — without manual intervention. The policy below captures the common production pattern.
curl -X PUT "http://localhost:9200/_ilm/policy/logs-policy" \
-H "Content-Type: application/json" \
-d '{
"policy": {
"phases": {
"hot": {
"min_age": "0ms",
"actions": {
"rollover": {
"max_primary_shard_size": "30gb",
"max_age": "1d"
},
"set_priority": {"priority": 100}
}
},
"warm": {
"min_age": "2d",
"actions": {
"shrink": {"number_of_shards": 1},
"forcemerge": {"max_num_segments": 1},
"set_priority": {"priority": 50},
"allocate": {"require": {"data": "warm"}}
}
},
"cold": {
"min_age": "30d",
"actions": {
"searchable_snapshot": {
"snapshot_repository": "s3-snapshots"
},
"set_priority": {"priority": 0}
}
},
"delete": {
"min_age": "90d",
"actions": {
"delete": {}
}
}
}
}
}'The hot phase handles active ingest. Rollover triggers when either condition is met — whichever comes first. In warm, the index is shrunk to a single shard (reducing overhead) and force-merged to a single segment, which minimizes heap usage and speeds up read-only query performance significantly. In cold, a searchable snapshot mounts the index directly from object storage; the data is off local disk entirely but still queryable, at the cost of higher query latency. This is the right trade-off for logs older than 30 days that are rarely searched.
Apply the policy by referencing it in your index template, then verify ILM is moving indices as expected:
curl -s "http://localhost:9200/logs-*/_ilm/explain?pretty" | \
python3 -c "
import sys, json
data = json.load(sys.stdin)
for idx, info in data['indices'].items():
print(f\"{idx}: phase={info.get('phase','?')} step={info.get('step','?')} age={info.get('age','?')}\")"Slow Query Analysis
Slow queries degrade cluster health insidiously. Unlike a disk watermark breach, they don't produce a clear alert — they manifest as gradual p99 latency creep and thread pool saturation. Enable slow logs at the index level with thresholds tuned to your SLO:
curl -X PUT "http://localhost:9200/logs-*/_settings" \
-H "Content-Type: application/json" \
-d '{
"index.search.slowlog.threshold.query.warn": "5s",
"index.search.slowlog.threshold.query.info": "2s",
"index.search.slowlog.threshold.fetch.warn": "1s",
"index.search.slowlog.level": "info"
}'When a slow query appears in the logs, use the Explain API to understand the scoring and execution path:
curl -X GET "http://localhost:9200/logs-current/_explain/doc_id_here" \
-H "Content-Type: application/json" \
-d '{
"query": {
"match": {"message": "connection refused"}
}
}'Three query patterns consistently cause performance problems. Deep pagination using from and size forces Elasticsearch to fetch and discard enormous result sets — at from: 10000, size: 10 the cluster scores 10,010 documents across all shards. Use search_after with a pit (point-in-time) instead. Leading-wildcard queries like *error cannot use the inverted index and degrade to a full segment scan; reindex with an ngram tokenizer if prefix search is genuinely needed. High-cardinality terms aggregations — collecting per-user or per-session metrics across millions of unique values — balloon memory usage; cap them with shard_size and consider pre-aggregating at write time.
JVM and Memory Tuning
Elasticsearch's memory model has two parts: the JVM heap and the OS filesystem cache. Both matter. The heap holds segment metadata, filter caches, query caches, and field data. The filesystem cache holds the actual Lucene segment bytes and is what makes repeated queries fast. If you give Elasticsearch too much heap, you starve the filesystem cache and performance degrades.
The rules are firm: set the JVM heap to no more than 50% of available RAM, and never exceed 32GB. The 32GB ceiling exists because the JVM uses compressed ordinary object pointers (compressed OOPs) below that threshold, which reduces memory overhead by 30–40%. Crossing it causes a sudden jump in heap consumption for the same data. Set heap via jvm.options:
-Xms16g
-Xmx16gAlways set Xms and Xmx to the same value to avoid heap resizing at runtime. Monitor GC pressure through the nodes stats API:
curl -s "http://localhost:9200/_nodes/stats/jvm?pretty" | \
python3 -c "
import sys, json
data = json.load(sys.stdin)
for node, info in data['nodes'].items():
name = info['name']
heap_pct = info['jvm']['mem']['heap_used_percent']
gc_old = info['jvm']['gc']['collectors']['old']['collection_time_in_millis']
print(f'{name}: heap={heap_pct}% old_gc_ms={gc_old}')"Old-generation GC time above 10 seconds per minute is a sign of heap pressure. The fix is usually a combination of reducing shard count, force-merging old indices, and reviewing field data usage on text fields with high cardinality.
Monitoring Key Metrics
A minimal but complete Elasticsearch monitoring setup watches six metrics. Here is a Python poller that emits structured output suitable for ingestion into any observability platform:
import requests, json, time
ES_HOST = "http://localhost:9200"
def collect_metrics():
metrics = {}
health = requests.get(f"{ES_HOST}/_cluster/health").json()
metrics["cluster_status"] = {"green": 0, "yellow": 1, "red": 2}[health["status"]]
metrics["unassigned_shards"] = health["unassigned_shards"]
stats = requests.get(f"{ES_HOST}/_nodes/stats/jvm,indices,thread_pool").json()
heap_pcts, query_times, rejections, segments = [], [], [], []
for node in stats["nodes"].values():
heap_pcts.append(node["jvm"]["mem"]["heap_used_percent"])
query_times.append(node["indices"]["search"]["query_time_in_millis"])
rejections.append(node["thread_pool"]["search"]["rejected"])
segments.append(node["indices"]["segments"]["count"])
metrics["jvm_heap_used_percent_max"] = max(heap_pcts)
metrics["search_query_time_ms_total"] = sum(query_times)
metrics["search_thread_pool_rejected_total"] = sum(rejections)
metrics["segment_count_total"] = sum(segments)
return metrics
while True:
print(json.dumps({"timestamp": time.time(), **collect_metrics()}))
time.sleep(60)Alert thresholds that reflect production reality rather than arbitrary defaults:
# Prometheus alerting rules (or adapt for the platform)
groups:
- name: elasticsearch
rules:
- alert: ESClusterRed
expr: cluster_status == 2
for: 1m
severity: critical
- alert: ESClusterYellowExtended
expr: cluster_status == 1
for: 15m
severity: warning
- alert: ESHeapPressure
expr: jvm_heap_used_percent_max > 85
for: 5m
severity: warning
- alert: ESSearchRejections
expr: rate(search_thread_pool_rejected_total[5m]) > 0
for: 2m
severity: critical
- alert: ESSegmentCountHigh
expr: segment_count_total > 10000
for: 30m
severity: warningSearch thread pool rejections are the most critical of these. A non-zero rejection rate means queries are being dropped. The cluster is resource-saturated and clients are receiving errors right now.
Recovering from Red Cluster State
When you hit red, work through this checklist in order. Do not skip steps and do not force-allocate shards until you have exhausted every other option — forced allocation can overwrite newer data with older copies.
First, establish what is unassigned and why:
curl -s "http://localhost:9200/_cat/shards?v&h=index,shard,prirep,state,unassigned.reason&s=state" | grep UNASSIGNED
curl -s "http://localhost:9200/_cluster/allocation/explain?pretty"Check disk usage. Elasticsearch blocks shard assignment when any node exceeds the high watermark (90% by default):
curl -s "http://localhost:9200/_cat/allocation?v&h=node,disk.used_percent,shards"If disk is the issue, free space first. Temporarily raising the watermark buys time but does not fix the underlying problem. If a node is missing entirely, verify it is running and network-reachable before touching any allocation settings.
If the cluster has lost a node that is not coming back and you need to recover from available copies:
# Retry allocation of all unassigned shards (uses existing copies)
curl -X POST "http://localhost:9200/_cluster/reroute?retry_failed=true"Force allocation is the nuclear option — it allows assigning a primary from a potentially stale copy, acknowledging possible data loss:
curl -X POST "http://localhost:9200/_cluster/reroute" \
-H "Content-Type: application/json" \
-d '{
"commands": [{
"allocate_stale_primary": {
"index": "logs-2026.05.01",
"shard": 0,
"node": "es-node-1",
"accept_data_loss": true
}
}]
}'Use this only when the alternative is permanent unavailability and you have confirmed the node containing the authoritative copy is gone. After the cluster returns to green, audit your ILM policies and snapshot schedule to understand how you got here and close the gap.
The clusters that survive production are not the ones that never have problems — they are the ones where operators made deliberate decisions about shard sizing, ILM, and monitoring before the incident, so when things go wrong, the path back is already understood.
*Zak Hassan is a Staff SRE specializing in distributed search systems, observability, and reliability engineering. Find him at zakhassan.com or on LinkedIn.*
Topic Paths