Running Apache Spark in production-style environments is a different discipline from running web services. The reliability concerns are different, the failure modes are different, and the tooling your platform team already uses for service reliability often doesn't transfer directly. In homelab and cloud-lab exercises, the most useful Spark reliability lessons come from modeling petabyte-scale constraints, distributed clusters, and the failure modes that appear when batch systems grow.


Why Spark Is Hard to Operate

Spark's programming model is elegant: express your data transformation as a DAG of operations, and the runtime figures out how to execute it across your cluster. The operational reality is more complex.

Failures are opaque. When a Spark job fails, the error message is frequently in the executor logs of a worker that may no longer exist, wrapped in multiple layers of Java exception chains, pointing to a line in generated bytecode. Getting to the actual cause — a null pointer from unexpected data, a shuffle that ran out of disk space, a skewed partition that caused an executor OOM — requires experience navigating Spark's logging infrastructure.

Performance is data-dependent. A Spark job that runs in 20 minutes with typical data might take 4 hours with data that has different distribution characteristics. Skewed keys in a join or aggregation can cause one executor to process 100x more data than others while the rest wait. Your SLAs are effectively conditional on your data distribution, which is not a property of your infrastructure.

Resource management interacts with correctness. If you allocate too little executor memory, jobs fail with OOM errors. Too much, and you're wasting cluster capacity. Dynamic allocation helps, but getting the memory configuration right for a job that processes variable-sized partitions is genuinely hard.

Cluster state degrades over time. Long-running Spark clusters accumulate state: cached datasets that are no longer needed, shuffle files from completed stages, driver memory that has grown from accumulated metadata. Restarting clusters periodically is operational maintenance, not a sign that something is wrong.


The Monitoring Stack That Works

Standard infrastructure monitoring (CPU, memory, disk) is necessary but not sufficient for Spark. You need application-level observability.

Spark History Server. This is your primary debugging tool for completed jobs. The History Server stores Spark event logs and provides a UI showing stage timelines, task distributions, and executor resource usage. For production reliability, you need the History Server running against a durable event log store (S3 or GCS, not local disk), and you need a retention policy that keeps job history for at least 30 days.

SparkMeasure. This is an underused tool. SparkMeasure instruments Spark jobs to collect stage and task-level metrics — bytes read/written, shuffle data, GC time, task duration distribution — and publishes them to a time-series store. For long-running production pipelines, correlating SparkMeasure metrics with your infrastructure metrics gives you visibility into the performance characteristics of each pipeline, not just "did it succeed or fail."

Custom metrics via SparkListener. For pipelines where you need business-level metrics (records processed, data quality check results, transformation accuracy), implement a SparkListener that publishes custom metrics to CloudWatch or Prometheus at the stage level:

scala
class ProductionMetricsListener(metricsClient: MetricsClient) 
    extends SparkListener {
  
  override def onStageCompleted(stageCompleted: SparkListenerStageCompleted): Unit = {
    val info = stageCompleted.stageInfo
    val taskMetrics = info.taskMetrics
    
    val tags = Map(
      "pipeline" -> info.name,
      "stage_id" -> info.stageId.toString,
      "attempt" -> info.attemptNumber.toString
    )
    
    metricsClient.gauge("spark.stage.duration_ms", 
      info.completionTime.getOrElse(0L) - info.submissionTime.getOrElse(0L), tags)
    metricsClient.gauge("spark.stage.task_count", 
      info.numTasks, tags)
    metricsClient.gauge("spark.stage.shuffle_read_bytes", 
      taskMetrics.shuffleReadMetrics.totalBytesRead, tags)
    metricsClient.gauge("spark.stage.shuffle_write_bytes", 
      taskMetrics.shuffleWriteMetrics.bytesWritten, tags)
    metricsClient.gauge("spark.stage.gc_time_ms", 
      taskMetrics.jvmGCTime, tags)
    
    // Alert on high GC — sign of memory pressure
    val durationMs = info.completionTime.getOrElse(0L) - info.submissionTime.getOrElse(0L)
    val gcPercent = taskMetrics.jvmGCTime.toDouble / durationMs * 100
    if (gcPercent > 20) {
      metricsClient.increment("spark.stage.high_gc_alert", tags)
    }
  }
}

The Data Skew Problem (And How to Detect It)

Data skew is the most common performance killer in production Spark jobs. It happens when one or a few partition keys appear disproportionately often in your data — the key "unknown" in a join column, a customer ID that represents a disproportionate share of your transactions, a date that corresponds to a large historical import.

Detecting skew: Look at the task duration distribution in the Spark UI for any stage involving a join or aggregation. If you see tasks that took 10x longer than the median, you have skew. SparkMeasure's task-level duration distribution makes this easy to quantify.

Remediating skew:

python
# Salting: Add a random suffix to skewed keys to distribute them
# across multiple partitions, then aggregate the results

from pyspark.sql import functions as F

SALT_FACTOR = 50  # Number of sub-partitions for skewed keys

def salt_dataframe(df, key_column, skewed_values):
    """Add salt to known skewed values to distribute their load."""
    return df.withColumn(
        f"{key_column}_salted",
        F.when(
            F.col(key_column).isin(skewed_values),
            F.concat(
                F.col(key_column),
                F.lit("_"),
                (F.rand() * SALT_FACTOR).cast("int").cast("string")
            )
        ).otherwise(F.col(key_column))
    )

# Use AQE (Adaptive Query Execution) to handle skew automatically
spark = SparkSession.builder \
    .config("spark.sql.adaptive.enabled", "true") \
    .config("spark.sql.adaptive.skewJoin.enabled", "true") \
    .config("spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes", "256mb") \
    .getOrCreate()

Adaptive Query Execution (AQE) in Spark 3.x handles many skew cases automatically by splitting large partitions and coalescing small ones at runtime. Enable it by default for all production jobs. The performance improvement for skewed workloads is substantial.


SLAs for Data Pipelines

Data pipeline SLAs are different from service SLAs. A web service SLA is about latency and availability: "p99 response time < 200ms, 99.9% uptime." A data pipeline SLA is about freshness and completeness: "data for day T is fully available in the warehouse by 08:00 UTC on day T+1, with no more than 0.01% missing records."

SLA definition principles:

Freshness over duration. Saying "the pipeline runs in under 4 hours" is less useful than "data is available by 08:00 UTC." The former is a metric; the latter is a business commitment. Monitor against the commitment.

Completeness is mandatory. A pipeline that runs in 2 hours but drops 5% of records is worse than one that runs in 6 hours with 100% completeness. Build data quality checks into your pipeline and fail loudly when completeness drops below threshold.

Define what constitutes a failure. A late pipeline is different from an incomplete pipeline, which is different from a pipeline with incorrect transformations. Each has different severity and remediation paths.

python
# Data quality checkpoint pattern
class PipelineQualityCheck:
    def __init__(self, pipeline_name: str, expected_record_count_tolerance: float = 0.01):
        self.pipeline_name = pipeline_name
        self.tolerance = expected_record_count_tolerance
    
    def check_completeness(
        self, 
        df: DataFrame, 
        expected_count: int,
        partition_date: str
    ) -> None:
        actual_count = df.count()
        variance = abs(actual_count - expected_count) / expected_count
        
        if variance > self.tolerance:
            raise DataQualityError(
                f"Pipeline {self.pipeline_name}: Record count variance {variance:.2%} "
                f"exceeds tolerance {self.tolerance:.2%}. "
                f"Expected ~{expected_count}, got {actual_count} for {partition_date}"
            )
        
        logger.info(f"Completeness check passed: {actual_count} records ({variance:.2%} variance)")
    
    def check_null_rate(self, df: DataFrame, column: str, max_null_rate: float = 0.001) -> None:
        null_count = df.filter(F.col(column).isNull()).count()
        total_count = df.count()
        null_rate = null_count / total_count
        
        if null_rate > max_null_rate:
            raise DataQualityError(
                f"Column {column} null rate {null_rate:.2%} exceeds max {max_null_rate:.2%}"
            )

The Cluster Lifecycle: Don't Fight Entropy

Spark clusters accumulate state. Driver memory grows as job metadata accumulates. Cached datasets from completed jobs sit in memory. Shuffle files from stages that completed hours ago occupy disk.

The instinct is to try to clean this up in place. The better practice is to embrace cluster lifecycle management:

Ephemeral clusters for batch jobs. For batch pipelines that run on a schedule (daily, hourly), use ephemeral clusters: start the cluster, run the job, terminate the cluster. The cluster starts clean every time. AWS EMR, Databricks Jobs clusters, and GCP Dataproc all support this pattern. The startup cost (2-5 minutes for cluster initialization) is negligible compared to a multi-hour batch job.

Long-lived clusters for interactive workloads. Data scientists using notebooks, interactive query workloads, and development environments benefit from persistent clusters that are already running. But these clusters need a maintenance window — a weekly restart at minimum — to reclaim accumulated state.

Preemptible/Spot instances for scale-out. A cluster with a few on-demand core nodes and a large pool of Spot nodes for task scaling is the economical choice for most batch workloads. Spark handles task re-execution when a Spot node is preempted. The only thing you need to ensure is that your shuffle data (which lives on worker nodes) is handled gracefully — use External Shuffle Service on your persistent nodes so Spark can recover shuffle data even if the node that wrote it is gone.


Iceberg as the Storage Layer

Apache Iceberg as the table format for your Spark pipelines is rapidly becoming the default choice for good reason. The ACID guarantees, schema evolution, and partition pruning benefits that Iceberg provides (covered in detail in my earlier post on the Iceberg + Athena observability stack) are equally valuable for data platform reliability:

  • Multiple Spark jobs can write to the same table concurrently without corruption
  • Failed jobs don't leave partial data visible to readers (the data is written but not committed)
  • Schema changes don't require rewriting historical data
  • Table maintenance (compaction, snapshot cleanup) can run concurrently with reads

For production data platforms, run Spark jobs against Iceberg tables on S3 or GCS. The operational experience is meaningfully better than raw Parquet with Hive metastore.


*Zak Hassan is a Staff SRE specializing in data platform reliability and AI-powered infrastructure automation. Find him at zakhassan.com or on LinkedIn.*

Topic Paths

About the Author

Zak Hassan writes about reliability engineering under real scale constraints.

Staff-level SRE and platform engineer focused on identity reliability, Kubernetes, observability, cloud architecture, AI infrastructure, and reducing operational uncertainty.

Connect on LinkedIn