Batch pipelines and real-time streaming pipelines look similar on paper — data moves from source to sink, transformations happen in between — but they fail differently, they scale differently, and operating them reliably requires a different mindset. I've operated both at scale, and Apache Flink is the streaming engine where the gap between "getting it working" and "operating it reliably" is widest.
This is the production operating guide I wish I'd had earlier.
Why Flink Is Operationally Complex
Flink's programming model is elegant: define a streaming DAG, choose your state backend, configure checkpointing, and let the runtime handle fault tolerance and distributed execution. The complexity is in the details of that runtime behavior, which surfaces in ways that aren't obvious from the API.
State is the distinguishing factor. A stateless Flink job — read, transform, write, no intermediate state — is relatively easy to operate. The interesting Flink jobs are stateful: windowed aggregations, join operations, deduplication, session processing. State lives in memory on TaskManagers, spills to disk when it exceeds memory limits, and is checkpointed to durable storage for fault tolerance. When state grows unexpectedly, when checkpoints slow down, or when state becomes corrupted, your job is in trouble in ways that are hard to observe from the outside.
Backpressure cascades. Flink's reactive model propagates backpressure upstream when a downstream operator can't keep up. This is the correct behavior, but it means a slow sink (a database write that's getting slower as the table grows) will slow your entire pipeline. The backpressure is visible in the Flink UI but invisible to standard infrastructure monitoring.
Checkpoint failures are silent until they matter. Your job appears healthy — it's processing records, it's not throwing errors — but its checkpoints have been failing for the last hour. Then a TaskManager fails, and there's no recent checkpoint to recover from, so your job restarts from a much older checkpoint and has to reprocess hours of data. You may not discover this until recovery takes 10x longer than expected.
The Monitoring Stack for Flink
Standard Flink deployments expose metrics via the Prometheus reporter. The metrics that matter:
Checkpoint metrics — the most important category:
# What to alert on
flink_jobmanager_job_lastCheckpointDuration:
alert_threshold: 60000 # ms — checkpoint taking over 1 minute is a warning sign
severity: warning
flink_jobmanager_job_lastCheckpointSize:
# Track over time; rapid growth means your state is growing unexpectedly
flink_jobmanager_job_numberOfFailedCheckpoints:
alert_threshold: 3 # Any failed checkpoints should alert
severity: critical
# A job with failed checkpoints has reduced fault tolerance
flink_jobmanager_job_numberOfCompletedCheckpoints:
# Confirm checkpoints are completing; alert if rate drops to 0Backpressure metrics:
flink_taskmanager_job_task_backPressuredTimeMsPerSecond:
alert_threshold: 100 # More than 100ms/s of backpressure is notable
# High backpressure = downstream bottleneck; investigate the slowest operator
flink_taskmanager_job_task_busyTimeMsPerSecond:
# Near 1000ms/s = operator is at capacity; may need parallelism increaseLag metrics (for Kafka sources):
flink_taskmanager_job_task_operator_KafkaSourceReader_KafkaConsumer_records_lag_max:
alert_threshold: 10000 # Records behind; calibrate to your use case
# Growing lag = pipeline can't keep up with source; needs scaling or optimizationState Management Best Practices
Choose the right state backend. Flink's default MemoryStateBackend stores state in TaskManager JVM heap — fast but limited to whatever memory you've allocated. For production jobs with significant state, use RocksDB (EmbeddedRocksDBStateBackend). RocksDB spills state to disk, allowing far larger state than fits in memory, at the cost of higher read latency. The tradeoff is correct for most production stateful jobs.
// Configure RocksDB state backend with incremental checkpointing
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
EmbeddedRocksDBStateBackend rocksDbBackend = new EmbeddedRocksDBStateBackend(
true // incremental checkpointing — only checkpoint state changes, not full state
);
env.setStateBackend(rocksDbBackend);
// Configure checkpoint storage (use S3 or GCS, not local disk)
env.getCheckpointConfig().setCheckpointStorage("s3://your-bucket/flink-checkpoints/");
// Checkpoint interval: balance between recovery point and checkpoint overhead
env.enableCheckpointing(60_000); // Every 60 seconds
// If a checkpoint takes longer than 10 minutes, abort it
env.getCheckpointConfig().setCheckpointTimeout(600_000);
// Allow only one checkpoint at a time (default; prevents overlap)
env.getCheckpointConfig().setMaxConcurrentCheckpoints(1);Incremental checkpointing is essential at scale. Full checkpoints write all state on every checkpoint interval — with large state, this can take minutes and generate significant S3 traffic. Incremental checkpointing writes only the state changes since the last checkpoint. The performance difference at scale (GB of state) is dramatic.
State expiry (TTL) for bounded state growth. Any stateful operation on unbounded data will eventually run out of memory without explicit TTL. If you're deduplicating events using a keyed state store, and you're never removing old keys, your state grows without bound. Use Flink's state TTL to expire entries that haven't been updated recently:
// State TTL configuration: expire dedup keys after 24 hours of inactivity
StateTtlConfig ttlConfig = StateTtlConfig
.newBuilder(Time.hours(24))
.setUpdateType(StateTtlConfig.UpdateType.OnReadAndWrite)
.setStateVisibility(StateTtlConfig.StateVisibility.NeverReturnExpired)
.build();
ValueStateDescriptor<Boolean> seenDescriptor =
new ValueStateDescriptor<>("seen-events", Boolean.class);
seenDescriptor.enableTimeToLive(ttlConfig);Exactly-Once Semantics: What It Costs
Flink supports exactly-once processing semantics end-to-end when paired with a compatible source (Kafka) and sink (Kafka, JDBC with transactions, or any sink that supports the two-phase commit protocol). This is a meaningful reliability guarantee for financial and audit-critical pipelines.
The cost: exactly-once requires transactions, and transactions add latency. The Kafka sink with exactly-once commits transactions on each checkpoint interval — records written between checkpoints are not visible to downstream consumers until the checkpoint completes and the transaction commits. This means your downstream consumers see records in batches corresponding to your checkpoint interval, not in real time.
For low-latency use cases, at-least-once with idempotent downstream writes (deduplicate at the sink) achieves better latency with equivalent effective semantics for most workloads.
Recovery and Savepoints
Savepoints vs. checkpoints: Checkpoints are automated, managed by Flink, and used for automatic recovery. Savepoints are manually triggered, stored separately, and used for intentional recovery scenarios: upgrading your job, changing parallelism, migrating to a new cluster.
Before any significant change to a production Flink job, take a savepoint:
# Trigger a savepoint before making changes
flink savepoint <job-id> s3://your-bucket/flink-savepoints/pre-upgrade-$(date +%Y%m%d)
# Restore from savepoint after making changes
flink run \
--detached \
-s s3://your-bucket/flink-savepoints/pre-upgrade-20260501 \
my-job.jarState schema evolution. Flink savepoints are tied to your state schema. If you change the structure of your state (add a field to a state class, change a type), you may not be able to restore from a previous savepoint. Use Flink's state schema migration support for changes, or plan your rollouts so the state schema is backward compatible.
The Operational Blog post for Common Flink Failures
Checkpoint timeout / failure:
- Check checkpoint metrics — how long are checkpoints taking? What's the state size?
- Check for backpressure — a slow operator can delay checkpoints
- Check S3/GCS write latency — checkpoint storage must be responsive
- Immediate action: if checkpoints are failing but the job is processing, watch the trend. If failures persist >30 minutes, restart the job from the last successful checkpoint.
Growing Kafka lag:
- Check job parallelism vs. Kafka partition count — you cannot have more Flink subtasks than Kafka partitions for a source
- Check TaskManager CPU and memory — saturation will slow processing
- Check for backpressure in the Flink UI — identify the bottleneck operator
- Scale up: increase job parallelism (may require a savepoint + restart) or optimize the slow operator
TaskManager OOM:
- Check state size metrics — unbounded state growth is the most common cause
- Check for state TTL configuration — is state being expired?
- Check for data skew — one key generating disproportionate state
- Immediate action: if the job is still running, take a savepoint; increase TaskManager memory and restart
*Zak Hassan is a Staff SRE specializing in real-time data platform reliability and AI-powered infrastructure automation. Find him at zakhassan.com or on LinkedIn.*
Topic Paths