Kafka Reliability at Scale: The Operator's Field Guide

Apache Kafka is the backbone of modern event-driven architecture. It's also one of the more operationally demanding systems in the data infrastructure stack. Kafka can handle extraordinary throughput reliably — but only if you understand its failure modes, tune it correctly, and have monitoring that tells you when something is going wrong before your consumers notice.

This is the operational field guide: the decisions that matter, the failure modes you'll encounter, and the monitoring you should have in place before you need it.

The Replication Factor Decision

The most consequential configuration decision for Kafka reliability is the replication factor. The replication factor determines how many broker copies of each partition exist. A replication factor of 1 means one copy — lose that broker, lose that data. A replication factor of 3 means data exists on three brokers simultaneously, and the topic remains available as long as at least one of the three is healthy (depending on your min.insync.replicas setting).

The interplay between replication.factor, min.insync.replicas, and producer acks setting determines your actual durability and availability guarantee:

# The production-safe configuration:
replication.factor = 3         # Data on 3 brokers
min.insync.replicas = 2        # Producer write requires 2 brokers to acknowledge
acks = all                     # Producer waits for all ISR replicas to acknowledge

# What this means:
# - Can tolerate 1 broker failure without data loss
# - Can tolerate 1 broker failure without write availability loss
# - Write latency increases slightly (waiting for 2 acks instead of 1)

# The dangerous configuration teams accidentally use:
replication.factor = 3
min.insync.replicas = 1        # Only 1 broker needs to ack
acks = all

# What this means:
# - "Replication factor 3" provides false confidence
# - With min.insync.replicas=1, acks=all only requires 1 broker
# - If the leader fails after ack but before replication, data is lost

For critical topics (payment events, audit logs, anything requiring at-least-once delivery guarantees), use replication.factor=3, min.insync.replicas=2, and acks=all. Accept the small latency overhead; it's the right tradeoff.

Consumer Lag: The Most Important Metric

Consumer lag — the number of messages in a partition that haven't been consumed yet — is the primary signal of consumer health. A consumer that's keeping up has near-zero lag. A consumer that's falling behind has growing lag. A consumer that's completely stopped has lag equal to the number of unconsumed messages since it stopped.

Lag is measured per consumer group, per topic, per partition. The aggregate (sum across all partitions) is the top-level health indicator; the per-partition breakdown tells you whether the lag is uniform or concentrated in specific partitions (which points to different root causes).

# Kafka consumer lag monitoring via kafka-python
from kafka.admin import KafkaAdminClient
from kafka import KafkaConsumer, TopicPartition
import boto3

def get_consumer_lag(bootstrap_servers: str, group_id: str, topic: str) -> dict:
    admin = KafkaAdminClient(bootstrap_servers=bootstrap_servers)
    consumer = KafkaConsumer(bootstrap_servers=bootstrap_servers, group_id=group_id)
    
    # Get current consumer offsets
    partitions = consumer.partitions_for_topic(topic)
    topic_partitions = [TopicPartition(topic, p) for p in partitions]
    
    committed_offsets = {
        tp: consumer.committed(tp) or 0 
        for tp in topic_partitions
    }
    
    # Get end offsets (latest available messages)
    end_offsets = consumer.end_offsets(topic_partitions)
    
    # Calculate lag per partition
    lag_by_partition = {
        tp.partition: end_offsets[tp] - committed_offsets[tp]
        for tp in topic_partitions
    }
    
    return {
        "total_lag": sum(lag_by_partition.values()),
        "by_partition": lag_by_partition,
        "max_partition_lag": max(lag_by_partition.values()),
        "lagging_partitions": [p for p, lag in lag_by_partition.items() if lag > 1000]
    }

# Emit to CloudWatch for alerting
def publish_lag_metrics(lag_data: dict, group_id: str, topic: str):
    cw = boto3.client('cloudwatch')
    cw.put_metric_data(
        Namespace='Kafka/ConsumerGroups',
        MetricData=[
            {
                'MetricName': 'ConsumerLag',
                'Dimensions': [
                    {'Name': 'ConsumerGroup', 'Value': group_id},
                    {'Name': 'Topic', 'Value': topic}
                ],
                'Value': lag_data['total_lag'],
                'Unit': 'Count'
            }
        ]
    )

Alert thresholds for consumer lag should be calibrated to your message production rate. An alert at 10,000 messages lag is meaningless if you produce 50,000 messages/second (it's 0.2 seconds of lag). An alert at 1,000 messages lag is critical if you produce 100 messages/second and your SLA requires processing within 5 seconds.

Alert on lag rate of change, not just absolute lag. A consumer that's currently at 50,000 messages lag but catching up is different from one at 5,000 messages lag and growing at 500 messages/second.

Partition Design: Get It Right at Creation

Kafka partitions are the unit of parallelism. More partitions = higher potential throughput and more consumer parallelism. But partitions are also the unit of replication overhead, leader election, and metadata management.

The rules:

You can increase partitions; you cannot decrease them. Adding partitions to an existing topic is supported. Removing them is not. This means underprovisioning partitions is a worse mistake than overprovisioning — a topic that needs more partitions requires either a new topic (with data migration) or accepting the throughput ceiling.

Consumer parallelism is bounded by partition count. A consumer group cannot have more active consumers than partitions. If you create a topic with 10 partitions and try to scale your consumer group to 20 instances, 10 of those instances will sit idle. Design partition count with your maximum consumer parallelism in mind.

Key-based ordering is within a partition, not across. If your consumers need to see all messages for a given key in order (all events for a given user ID, all operations for a given order), those messages must land in the same partition. The default partitioner hashes the message key — consistent hashing means the same key always goes to the same partition. This works as long as you don't change the partition count (which reshuffles the hash ring).

Recommended starting point: For new topics where you don't know the volume yet, start with partitions = (expected peak messages/second) / (your consumer throughput per instance). Then multiply by 2 for growth headroom.

Exactly-Once Semantics (EOS): The Full Picture

Kafka's exactly-once semantics guarantee that each message is delivered and processed exactly once — no duplicates, no losses. EOS requires coordination across producers, brokers, and consumers, and it comes with specific configuration requirements and performance tradeoffs.

Producer side: Enable idempotent producers and transactions.

from kafka import KafkaProducer

producer = KafkaProducer(
    bootstrap_servers='kafka:9092',
    enable_idempotence=True,         # Prevents duplicate delivery on retries
    acks='all',                      # All ISR replicas must acknowledge
    retries=5,
    max_in_flight_requests_per_connection=5,  # Required with idempotence
    transactional_id='my-service-producer-1'  # Required for transactions
)

producer.init_transactions()

try:
    producer.begin_transaction()
    producer.send('output-topic', key=b'key', value=b'value')
    producer.send_offsets_to_transaction(
        offsets,
        group_metadata
    )
    producer.commit_transaction()
except Exception as e:
    producer.abort_transaction()
    raise

Consumer side: Set isolation.level=read_committed to only read messages from committed transactions.

The EOS tradeoff: ~20-30% throughput reduction compared to at-least-once, and increased producer complexity. For most streaming workloads, idempotent downstream processing (deduplicate at the consumer) achieves equivalent practical semantics at lower cost. Reserve true EOS for financial or audit-critical pipelines where the correctness guarantee is worth the overhead.

Broker Health Monitoring

Beyond consumer lag, the broker-level metrics to watch:

Under-replicated partitions: Any value above zero means at least one partition doesn't have its full complement of in-sync replicas. Under-replication can be caused by broker failures, network issues between brokers, or brokers that are too slow to keep up with replication throughput.

kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions

Alert: any value > 0 sustained for more than 2 minutes.

Active controller count: Exactly one broker in the cluster should be the active controller at any time. Zero controllers means no leader election is possible. More than one indicates a split-brain condition.

kafka.controller:type=KafkaController,name=ActiveControllerCount

Alert: any value other than 1.

Request handler idle ratio: If your broker's request handler threads are busy more than 70% of the time, the broker is approaching saturation. Increase broker count or increase num.io.threads.

Log flush latency: How long it takes to flush messages to disk. Spikes indicate disk I/O contention, which degrades both throughput and latency.

Kafka on Kubernetes vs. Managed (MSK/Confluent)

A frequent architectural decision: self-manage Kafka on Kubernetes (using Strimzi or similar) or use a managed service (AWS MSK, Confluent Cloud)?

Self-managed on Kubernetes with Strimzi:

Full control over configuration and tuning
No vendor lock-in
Significant operational overhead: broker scaling, ZooKeeper/KRaft management, upgrade procedures, certificate rotation
Ya platform team needs deep Kafka expertise

AWS MSK:

Managed broker provisioning, patching, and replacement
Native AWS IAM authentication
Limited configuration control — some tuning parameters are not exposed
Higher cost than self-managed at equivalent capacity

Confluent Cloud:

Fully managed, serverless pricing model
Best-in-class management UI and tooling
Most expensive option at scale
Strong enterprise support SLAs

The right choice depends on your platform team's Kafka expertise, your operational bandwidth, and your cost constraints. For teams without dedicated Kafka expertise, MSK or Confluent removes significant operational risk. For teams with expert Kafka operators who need fine-grained tuning, self-managed with Strimzi gives you the control.

What I'd avoid: self-managed Kafka on Kubernetes operated by a team that doesn't have deep Kafka experience. It's a demanding system that punishes operational gaps.

*Zak Hassan is a Staff SRE specializing in data platform reliability, streaming infrastructure, and AI-powered operations. Find him at zakhassan.com or on LinkedIn.*

Topic Paths

SRE and Reliability Kubernetes and Platform Engineering Observability and Incident Learning AI Infrastructure and Operations Cloud Cost and Capacity

About the Author

Zak Hassan writes about reliability engineering under real scale constraints.

Staff-level SRE and platform engineer focused on identity reliability, Kubernetes, observability, cloud architecture, AI infrastructure, and reducing operational uncertainty.

Connect on LinkedIn