Kafka Operations: Consumer Lag, Partition Management, and Reliability

*By Zak Hassan — Staff SRE | May 2026*

Most teams that adopt Kafka treat it like a faster, distributed message queue. They monitor queue depth, they alert on "is the queue empty," and they celebrate when consumer throughput catches up to producer throughput. This mental model works fine until it doesn't — until a consumer group silently falls hours behind, until a partition rebalance causes cascading timeouts, until a broker failure triggers an under-replicated partition storm that takes down a downstream service. The problem is that Kafka is not a queue. It is a distributed commit log, and operating it well requires internalizing that distinction at every level of your monitoring, alerting, and incident response practice.

The Log Abstraction and What It Changes

In a traditional message queue, consuming a message removes it. The queue's size is the primary operational signal. Kafka inverts this: messages are written to an immutable, ordered log and retained for a configured period regardless of whether any consumer has read them. Consumers track their own position in the log using an offset — a monotonically increasing integer per partition. Multiple consumer groups can read the same topic independently, each with its own offset pointer.

This changes what "caught up" means operationally. A consumer group is current when its committed offset equals the log end offset for every partition it owns. The gap between those two numbers is consumer lag — the single most important metric in Kafka operations.

Consumer group offsets are committed back to Kafka itself (in the internal __consumer_offsets topic), which means offset management is part of your reliability surface. An application that crashes before committing offsets will reprocess messages on restart. An application that commits offsets too eagerly — before processing is confirmed — will silently drop messages on crash. Getting this right requires deliberate design.

Consumer Lag: Your Primary Operational Metric

Consumer lag quantifies how far behind a consumer group is relative to the latest produced message, measured in number of messages per partition. A lag of zero means the consumer is processing in real time. A lag of ten million means the consumer is hours or days behind, depending on throughput.

Lag grows when the consumer processing rate is lower than the producer write rate. The causes are numerous: slow downstream dependencies (a database that's under load, an API that's rate limiting), GC pauses, network saturation, insufficient consumer instances, or a consumer that crashed and left its partitions unassigned during rebalance.

Measuring lag from the command line is straightforward:

# Check lag for a specific consumer group
kafka-consumer-groups.sh \
  --bootstrap-server kafka-broker-1:9092 \
  --describe \
  --group my-service-consumer-group

# Output: TOPIC, PARTITION, CURRENT-OFFSET, LOG-END-OFFSET, LAG, CONSUMER-ID, HOST, CLIENT-ID

# List all consumer groups
kafka-consumer-groups.sh \
  --bootstrap-server kafka-broker-1:9092 \
  --list

# Reset offsets to earliest (useful for reprocessing)
kafka-consumer-groups.sh \
  --bootstrap-server kafka-broker-1:9092 \
  --group my-service-consumer-group \
  --topic my-topic \
  --reset-offsets \
  --to-earliest \
  --execute

For production alerting, you want lag exposed to Prometheus. The Confluent JMX exporter or kafka_exporter (by Daniel Czerwonk) exposes consumer group lag as a gauge. A Prometheus alert rule for lag looks like this:

groups:
  - name: kafka_consumer_alerts
    rules:
      - alert: KafkaConsumerGroupLagHigh
        expr: |
          kafka_consumergroup_lag{consumergroup="my-service-consumer-group"} > 100000
        for: 5m
        labels:
          severity: warning
          team: platform
        annotations:
          summary: "Consumer group {{ $labels.consumergroup }} lag is high"
          description: >
            Consumer group {{ $labels.consumergroup }} on topic {{ $labels.topic }}
            partition {{ $labels.partition }} has lag of {{ $value }} messages.
            Current value has been above threshold for 5 minutes.

      - alert: KafkaConsumerGroupLagCritical
        expr: |
          kafka_consumergroup_lag{consumergroup="my-service-consumer-group"} > 1000000
        for: 2m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "Consumer group {{ $labels.consumergroup }} lag is critical"
          description: >
            Consumer group {{ $labels.consumergroup }} lag has exceeded 1M messages.
            Immediate investigation required.

Alert thresholds need to be set relative to your topic's write rate and your SLA on processing latency. A lag of 100,000 messages on a topic that receives 10,000 messages per second represents ten seconds of latency — acceptable for many workloads. The same lag on a topic that receives 100 messages per minute represents nearly seventeen hours of backlog.

Partition Count Decisions

Partitions are the unit of parallelism in Kafka. Each partition is an ordered, immutable sequence of records assigned to exactly one consumer instance per consumer group at any given time. More partitions means more parallelism — up to the number of consumer instances you can run.

The challenge is that partition count is effectively permanent. Kafka allows you to increase partitions after topic creation, but it does not rearrange existing messages — messages already written to partitions 0-7 stay there when you expand to 16 partitions. For topics using key-based partitioning (where message routing depends on hash(key) % partition_count), adding partitions breaks the key-to-partition mapping for all future writes. Consumers expecting all messages for a given key to land in the same partition will see split history. In practice, this means you almost never increase partition count on a production topic with key-based partitioning without a coordinated migration.

Sizing partitions for throughput: a single partition on commodity hardware can sustain roughly 10-50 MB/s of write throughput, depending on message size, replication factor, and whether compression is enabled. Start with a partition count that gives you 2-3x your current peak throughput headroom, and make sure it is divisible by your expected consumer count so partitions distribute evenly.

# Create a topic with appropriate partition count and replication
kafka-topics.sh \
  --bootstrap-server kafka-broker-1:9092 \
  --create \
  --topic my-high-throughput-topic \
  --partitions 24 \
  --replication-factor 3 \
  --config retention.ms=604800000 \
  --config min.insync.replicas=2

# Describe a topic to check its configuration
kafka-topics.sh \
  --bootstrap-server kafka-broker-1:9092 \
  --describe \
  --topic my-high-throughput-topic

Too many partitions creates its own problems. Each partition is a file handle on the broker. At 100,000 partitions per cluster, leader election times balloon, controller failover slows, and JVM garbage collection pressure increases. Keep total partition count per broker under 4,000 as a practical ceiling.

Consumer Group Rebalancing

A rebalance is the process by which Kafka redistributes partition assignments across the members of a consumer group. It is triggered by any membership change: a new consumer joining, an existing consumer leaving (gracefully or via timeout), or a change in the topic's partition count. During a classic eager rebalance, all consumers in the group stop processing, revoke all their partition assignments, and wait for the group coordinator to issue new assignments. This stop-the-world pause can last seconds to tens of seconds depending on group size and coordinator load.

In a high-throughput service, a rebalance triggered by a rolling deployment can cause a visible spike in consumer lag. Deployments that cycle through thirty consumer instances will trigger thirty sequential rebalances if using the default eager protocol.

Modern Kafka clients (Kafka 2.4+ with CooperativeStickyAssignor) address this with cooperative incremental rebalancing. Instead of revoking all assignments, the group coordinator issues a diff: consumers only revoke the specific partitions being moved. Consumers that retain their assignments never pause.

from confluent_kafka import Consumer, KafkaError, KafkaException
import logging

logger = logging.getLogger(__name__)

def create_consumer(bootstrap_servers: str, group_id: str) -> Consumer:
    return Consumer({
        "bootstrap.servers": bootstrap_servers,
        "group.id": group_id,
        "auto.offset.reset": "earliest",
        # Disable auto-commit — offsets are managed manually
        "enable.auto.commit": False,
        # Use cooperative sticky rebalancing to minimize stop-the-world pauses
        "partition.assignment.strategy": "cooperative-sticky",
        # Heartbeat and session timeout tuning
        "session.timeout.ms": 30000,
        "heartbeat.interval.ms": 3000,
        "max.poll.interval.ms": 300000,
    })

def process_messages(consumer: Consumer, topic: str):
    consumer.subscribe([topic])
    try:
        while True:
            msg = consumer.poll(timeout=1.0)
            if msg is None:
                continue
            if msg.error():
                if msg.error().code() == KafkaError._PARTITION_EOF:
                    # Reached end of partition — not an error
                    continue
                raise KafkaException(msg.error())

            try:
                # Your processing logic here
                handle_message(msg)
                # Commit after successful processing — at-least-once semantics
                consumer.commit(message=msg, asynchronous=False)
            except Exception as e:
                logger.error(
                    "Failed to process message offset=%d partition=%d: %s",
                    msg.offset(), msg.partition(), e
                )
                # Do not commit — message will be redelivered on restart
                # Depending on your error handling strategy, you may want to
                # send to a dead-letter topic instead of blocking here
                raise
    finally:
        consumer.close()

def handle_message(msg):
    value = msg.value().decode("utf-8")
    logger.info("Processing key=%s partition=%d offset=%d",
                msg.key(), msg.partition(), msg.offset())
    # ... actual business logic

Topic Management at Scale

Retention policy is a first-class operational concern. Kafka supports two retention dimensions: time (retention.ms) and size (retention.bytes). Whichever limit is hit first triggers log segment deletion. For event-driven pipelines where consumers are expected to stay current, time-based retention of seven to fourteen days is typical. For audit logs or compliance data, you may need thirty to ninety days, which has direct implications for broker disk provisioning.

Compacted topics change the semantics entirely. With cleanup.policy=compact, Kafka guarantees that the log will retain at least the most recent value for each key. Old values for a key are eligible for deletion during compaction. This makes compacted topics suitable for changelog streams and state materialization — the topic functions as a durable, compacted key-value store that can be fully replayed to reconstruct current state.

# Create a compacted topic for a changelog stream
kafka-topics.sh \
  --bootstrap-server kafka-broker-1:9092 \
  --create \
  --topic user-profile-changelog \
  --partitions 12 \
  --replication-factor 3 \
  --config cleanup.policy=compact \
  --config min.cleanable.dirty.ratio=0.1 \
  --config segment.ms=3600000

# Alter retention on an existing topic
kafka-configs.sh \
  --bootstrap-server kafka-broker-1:9092 \
  --alter \
  --entity-type topics \
  --entity-name my-topic \
  --add-config retention.ms=259200000

Monitoring Kafka with the Prometheus JMX Exporter

Kafka exposes extensive metrics over JMX. The Prometheus JMX exporter runs as a Java agent alongside each broker and translates JMX MBeans into Prometheus-scrapable metrics. The metrics that matter most split into broker health and consumer health.

Key broker metrics:

kafka_server_replicamanager_underreplicatedpartitions — should be zero at all times. Any non-zero value means at least one partition's replica set is incomplete.
kafka_controller_controllerstats_leaderelectionrateandtimems_rate — leader elections per second. Sporadic is acceptable; sustained elevation indicates broker instability.
kafka_network_requestmetrics_requesthandleravgidlepercent — broker request handler thread idle percentage. Below 30% indicates the broker is CPU-bound on request processing.

groups:
  - name: kafka_broker_alerts
    rules:
      - alert: KafkaUnderReplicatedPartitions
        expr: kafka_server_replicamanager_underreplicatedpartitions > 0
        for: 1m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "Kafka broker {{ $labels.instance }} has under-replicated partitions"
          description: >
            {{ $value }} partitions are under-replicated on {{ $labels.instance }}.
            Data durability is compromised. Investigate broker health immediately.

      - alert: KafkaBrokerRequestHandlerIdleLow
        expr: kafka_network_requestmetrics_requesthandleravgidlepercent < 0.2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Kafka broker request handler threads near saturation"
          description: >
            Request handler idle percent on {{ $labels.instance }} is {{ $value | humanizePercentage }}.
            Broker may be approaching throughput limits.

Failure Modes: Broker Failures, Leader Election, and ISR Shrinkage

When a broker fails, Kafka must elect new leaders for every partition that broker led. With the KRaft controller (Kafka 3.x+), this process is faster than the old ZooKeeper-based election, but it is not instantaneous. During election, producers configured with acks=all will receive NOT_LEADER_FOR_PARTITION errors and retry. Consumers will see fetch errors and reconnect. The Kafka client libraries handle this automatically — your concern is whether your retry configuration and timeout settings give clients enough time to recover without surfacing errors to end users.

ISR (In-Sync Replica) shrinkage is a leading indicator of trouble. A replica falls out of the ISR when it cannot keep up with the leader — typically due to network saturation, disk I/O pressure, or GC pauses. When the ISR drops below min.insync.replicas, producers configured with acks=all begin receiving errors. This is the configuration that protects you from silent data loss: do not lower min.insync.replicas under operational pressure without understanding the durability tradeoff.

Operationally responding to ISR shrinkage:

# Check which partitions are out of sync
kafka-topics.sh \
  --bootstrap-server kafka-broker-1:9092 \
  --describe \
  --under-replicated-partitions

# Check which topics have offline partitions
kafka-topics.sh \
  --bootstrap-server kafka-broker-1:9092 \
  --describe \
  --unavailable-partitions

# Trigger preferred replica election after a broker returns
kafka-leader-election.sh \
  --bootstrap-server kafka-broker-1:9092 \
  --election-type preferred \
  --all-topic-partitions

After a broker recovers from a failure, it rejoins the cluster as a follower and begins catching up with partition leaders. Once it is fully caught up, it re-enters the ISR. Only after ISR membership is restored does the cluster return to full durability. Monitor under-replicated-partitions and do not declare an incident resolved until it reads zero.

The recurring pattern across all of these operational areas — lag management, partition sizing, rebalance behavior, retention policy, failure response — is that Kafka's reliability properties emerge from the intersection of configuration choices made at topic creation, client configuration at deployment time, and real-time operational response. There is no single knob to turn. Building robust Kafka operations means encoding these relationships into your blog posts, your alerting thresholds, and your deployment practices before an incident forces the question.

*Zak Hassan is a Staff SRE specializing in distributed systems reliability, observability, and streaming infrastructure. Find him at zakhassan.com or on LinkedIn.*

Topic Paths

SRE and Reliability Kubernetes and Platform Engineering Observability and Incident Learning Cloud Cost and Capacity

About the Author

Zak Hassan writes about reliability engineering under real scale constraints.

Staff-level SRE and platform engineer focused on identity reliability, Kubernetes, observability, cloud architecture, AI infrastructure, and reducing operational uncertainty.

Connect on LinkedIn