Apache Flink in Production: Building Reliable Real-Time Data Pipelines

Batch pipelines and real-time streaming pipelines look similar on paper — data moves from source to sink, transformations happen in between — but they fail differently, they scale differently, and operating them reliably requires a different mindset. I've operated both at scale, and Apache Flink is the streaming engine where the gap between "getting it working" and "operating it reliably" is widest.

This is the production operating guide I wish I'd had earlier.

Why Flink Is Operationally Complex

Flink's programming model is elegant: define a streaming DAG, choose your state backend, configure checkpointing, and let the runtime handle fault tolerance and distributed execution. The complexity is in the details of that runtime behavior, which surfaces in ways that aren't obvious from the API.

State is the distinguishing factor. A stateless Flink job — read, transform, write, no intermediate state — is relatively easy to operate. The interesting Flink jobs are stateful: windowed aggregations, join operations, deduplication, session processing. State lives in memory on TaskManagers, spills to disk when it exceeds memory limits, and is checkpointed to durable storage for fault tolerance. When state grows unexpectedly, when checkpoints slow down, or when state becomes corrupted, your job is in trouble in ways that are hard to observe from the outside.

Backpressure cascades. Flink's reactive model propagates backpressure upstream when a downstream operator can't keep up. This is the correct behavior, but it means a slow sink (a database write that's getting slower as the table grows) will slow your entire pipeline. The backpressure is visible in the Flink UI but invisible to standard infrastructure monitoring.

Checkpoint failures are silent until they matter. Your job appears healthy — it's processing records, it's not throwing errors — but its checkpoints have been failing for the last hour. Then a TaskManager fails, and there's no recent checkpoint to recover from, so your job restarts from a much older checkpoint and has to reprocess hours of data. You may not discover this until recovery takes 10x longer than expected.

The Monitoring Stack for Flink

Standard Flink deployments expose metrics via the Prometheus reporter. The metrics that matter:

Checkpoint metrics — the most important category:

# What to alert on
flink_jobmanager_job_lastCheckpointDuration:
  alert_threshold: 60000  # ms — checkpoint taking over 1 minute is a warning sign
  severity: warning

flink_jobmanager_job_lastCheckpointSize:
  # Track over time; rapid growth means your state is growing unexpectedly

flink_jobmanager_job_numberOfFailedCheckpoints:
  alert_threshold: 3  # Any failed checkpoints should alert
  severity: critical
  # A job with failed checkpoints has reduced fault tolerance

flink_jobmanager_job_numberOfCompletedCheckpoints:
  # Confirm checkpoints are completing; alert if rate drops to 0

Backpressure metrics:

flink_taskmanager_job_task_backPressuredTimeMsPerSecond:
  alert_threshold: 100  # More than 100ms/s of backpressure is notable
  # High backpressure = downstream bottleneck; investigate the slowest operator

flink_taskmanager_job_task_busyTimeMsPerSecond:
  # Near 1000ms/s = operator is at capacity; may need parallelism increase

Lag metrics (for Kafka sources):

flink_taskmanager_job_task_operator_KafkaSourceReader_KafkaConsumer_records_lag_max:
  alert_threshold: 10000  # Records behind; calibrate to your use case
  # Growing lag = pipeline can't keep up with source; needs scaling or optimization

State Management Best Practices

Choose the right state backend. Flink's default MemoryStateBackend stores state in TaskManager JVM heap — fast but limited to whatever memory you've allocated. For production jobs with significant state, use RocksDB (EmbeddedRocksDBStateBackend). RocksDB spills state to disk, allowing far larger state than fits in memory, at the cost of higher read latency. The tradeoff is correct for most production stateful jobs.

// Configure RocksDB state backend with incremental checkpointing
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

EmbeddedRocksDBStateBackend rocksDbBackend = new EmbeddedRocksDBStateBackend(
    true  // incremental checkpointing — only checkpoint state changes, not full state
);
env.setStateBackend(rocksDbBackend);

// Configure checkpoint storage (use S3 or GCS, not local disk)
env.getCheckpointConfig().setCheckpointStorage("s3://your-bucket/flink-checkpoints/");

// Checkpoint interval: balance between recovery point and checkpoint overhead
env.enableCheckpointing(60_000);  // Every 60 seconds

// If a checkpoint takes longer than 10 minutes, abort it
env.getCheckpointConfig().setCheckpointTimeout(600_000);

// Allow only one checkpoint at a time (default; prevents overlap)
env.getCheckpointConfig().setMaxConcurrentCheckpoints(1);

Incremental checkpointing is essential at scale. Full checkpoints write all state on every checkpoint interval — with large state, this can take minutes and generate significant S3 traffic. Incremental checkpointing writes only the state changes since the last checkpoint. The performance difference at scale (GB of state) is dramatic.

State expiry (TTL) for bounded state growth. Any stateful operation on unbounded data will eventually run out of memory without explicit TTL. If you're deduplicating events using a keyed state store, and you're never removing old keys, your state grows without bound. Use Flink's state TTL to expire entries that haven't been updated recently:

// State TTL configuration: expire dedup keys after 24 hours of inactivity
StateTtlConfig ttlConfig = StateTtlConfig
    .newBuilder(Time.hours(24))
    .setUpdateType(StateTtlConfig.UpdateType.OnReadAndWrite)
    .setStateVisibility(StateTtlConfig.StateVisibility.NeverReturnExpired)
    .build();

ValueStateDescriptor<Boolean> seenDescriptor = 
    new ValueStateDescriptor<>("seen-events", Boolean.class);
seenDescriptor.enableTimeToLive(ttlConfig);

Exactly-Once Semantics: What It Costs

Flink supports exactly-once processing semantics end-to-end when paired with a compatible source (Kafka) and sink (Kafka, JDBC with transactions, or any sink that supports the two-phase commit protocol). This is a meaningful reliability guarantee for financial and audit-critical pipelines.

The cost: exactly-once requires transactions, and transactions add latency. The Kafka sink with exactly-once commits transactions on each checkpoint interval — records written between checkpoints are not visible to downstream consumers until the checkpoint completes and the transaction commits. This means your downstream consumers see records in batches corresponding to your checkpoint interval, not in real time.

For low-latency use cases, at-least-once with idempotent downstream writes (deduplicate at the sink) achieves better latency with equivalent effective semantics for most workloads.

Recovery and Savepoints

Savepoints vs. checkpoints: Checkpoints are automated, managed by Flink, and used for automatic recovery. Savepoints are manually triggered, stored separately, and used for intentional recovery scenarios: upgrading your job, changing parallelism, migrating to a new cluster.

Before any significant change to a production Flink job, take a savepoint:

# Trigger a savepoint before making changes
flink savepoint <job-id> s3://your-bucket/flink-savepoints/pre-upgrade-$(date +%Y%m%d)

# Restore from savepoint after making changes
flink run \
  --detached \
  -s s3://your-bucket/flink-savepoints/pre-upgrade-20260501 \
  my-job.jar

State schema evolution. Flink savepoints are tied to your state schema. If you change the structure of your state (add a field to a state class, change a type), you may not be able to restore from a previous savepoint. Use Flink's state schema migration support for changes, or plan your rollouts so the state schema is backward compatible.

The Operational Blog post for Common Flink Failures

Checkpoint timeout / failure:

Check checkpoint metrics — how long are checkpoints taking? What's the state size?
Check for backpressure — a slow operator can delay checkpoints
Check S3/GCS write latency — checkpoint storage must be responsive
Immediate action: if checkpoints are failing but the job is processing, watch the trend. If failures persist >30 minutes, restart the job from the last successful checkpoint.

Growing Kafka lag:

Check job parallelism vs. Kafka partition count — you cannot have more Flink subtasks than Kafka partitions for a source
Check TaskManager CPU and memory — saturation will slow processing
Check for backpressure in the Flink UI — identify the bottleneck operator
Scale up: increase job parallelism (may require a savepoint + restart) or optimize the slow operator

TaskManager OOM:

Check state size metrics — unbounded state growth is the most common cause
Check for state TTL configuration — is state being expired?
Check for data skew — one key generating disproportionate state
Immediate action: if the job is still running, take a savepoint; increase TaskManager memory and restart

*Zak Hassan is a Staff SRE specializing in real-time data platform reliability and AI-powered infrastructure automation. Find him at zakhassan.com or on LinkedIn.*

Topic Paths

SRE and Reliability Kubernetes and Platform Engineering Observability and Incident Learning Cloud Cost and Capacity

About the Author

Zak Hassan writes about reliability engineering under real scale constraints.

Staff-level SRE and platform engineer focused on identity reliability, Kubernetes, observability, cloud architecture, AI infrastructure, and reducing operational uncertainty.

Connect on LinkedIn