CI/CD Pipeline Reliability: Flaky Tests, Build Observability, and Deployment Gates

*By Zak Hassan — Staff SRE | May 2026*

Most engineering teams invest heavily in making user-facing services reliable — SLOs, blog posts, on-call rotations, incident reviews. Then they let the system that deploys those services run as an unmonitored, undocumented mess of shell scripts and third-party actions pinned to @main. The CI/CD pipeline is a production system. It has users (every engineer on the team), uptime requirements (a pipeline that won't run at 2pm on a Tuesday is an outage), and failure modes that compound quietly until a release grinds to a halt. Treating it as an afterthought doesn't just slow deployments — it erodes trust in the entire delivery process, causes engineers to skip tests to "unblock" themselves, and turns what should be a five-minute feedback loop into a multi-hour ordeal.

The Pipeline Is a Service — Treat It Like One

A CI/CD pipeline has the same reliability properties as any other distributed system. It has throughput (jobs per hour), latency (p50 and p99 build times), error rate (pipeline failure percentage), and dependencies (external registries, artifact stores, test environments). The difference is that most teams define no SLOs for it and measure nothing.

A reasonable starting set of pipeline SLOs looks like this: p99 build time under 12 minutes, pipeline success rate above 95% on the main branch, mean time to detect a broken build under 3 minutes. These aren't arbitrary — they map directly to developer productivity. A build that takes 20 minutes at p99 means engineers context-switch away, lose focus, and start batching commits to avoid the wait. Each of these SLOs needs a measurement strategy, which is where pipeline observability comes in.

Flaky Tests: The Reliability Cancer

A flaky test is one that produces inconsistent results across runs without any change to the code under test. Left unaddressed, flaky tests are catastrophic to pipeline reliability. They cause engineers to re-run CI on a green commit hoping for a green result. They mask real failures. They breed a culture of "just retry it" that eventually applies to actual breakage.

The root causes fall into a handful of categories. Time dependencies occur when tests assert against wall-clock time or use sleep() as synchronization. Shared state appears when tests rely on global singletons, database rows, or filesystem paths that bleed between test cases. External dependencies bite when tests call real APIs or DNS names that can be slow or unavailable. Race conditions surface in concurrent code where tests assert on timing rather than observable state.

Detection requires statistical analysis across multiple runs. The following script queries CI history data to compute a flakiness score per test:

#!/usr/bin/env python3
"""
detect_flaky_tests.py — Compute flakiness scores from CI run history.
Expects a JSONL file where each line is:
  {"test_name": "...", "run_id": "...", "passed": true, "commit_sha": "..."}
"""

import json
import sys
from collections import defaultdict
from dataclasses import dataclass, field
from typing import List

FLAKINESS_THRESHOLD = 0.15  # flag tests with >15% inconsistency rate


@dataclass
class TestHistory:
    name: str
    results_by_commit: dict = field(default_factory=lambda: defaultdict(list))

    def flakiness_score(self) -> float:
        """
        A test is flaky if the same commit SHA produces both pass and fail.
        Score = fraction of commits that produced mixed results.
        """
        commits_with_mixed = 0
        for sha, results in self.results_by_commit.items():
            if len(results) > 1 and len(set(results)) > 1:
                commits_with_mixed += 1
        total = len(self.results_by_commit)
        return commits_with_mixed / total if total > 0 else 0.0

    def failure_rate(self) -> float:
        all_results = [r for runs in self.results_by_commit.values() for r in runs]
        return all_results.count(False) / len(all_results) if all_results else 0.0


def load_history(path: str) -> dict[str, TestHistory]:
    history: dict[str, TestHistory] = {}
    with open(path) as f:
        for line in f:
            record = json.loads(line.strip())
            name = record["test_name"]
            if name not in history:
                history[name] = TestHistory(name=name)
            history[name].results_by_commit[record["commit_sha"]].append(record["passed"])
    return history


def main():
    path = sys.argv[1] if len(sys.argv) > 1 else "ci_history.jsonl"
    history = load_history(path)

    flaky = [
        (name, t.flakiness_score(), t.failure_rate())
        for name, t in history.items()
        if t.flakiness_score() >= FLAKINESS_THRESHOLD
    ]
    flaky.sort(key=lambda x: x[1], reverse=True)

    print(f"{'Test Name':<60} {'Flakiness':>10} {'Fail Rate':>10}")
    print("-" * 82)
    for name, score, fail_rate in flaky:
        print(f"{name:<60} {score:>9.1%} {fail_rate:>9.1%}")

    print(f"\n{len(flaky)} flaky tests detected (threshold: {FLAKINESS_THRESHOLD:.0%})")


if __name__ == "__main__":
    main()

Once identified, flaky tests belong in quarantine — a separate suite that runs in parallel but whose failures do not block the main pipeline. Quarantine is not retirement. Every quarantined test gets a GitHub issue, an owner, and a two-week deadline before it is deleted. A test that cannot be fixed is worse than no test: it costs CI time while providing false confidence.

Build Caching at Scale

Build time is dominated by work that was already done. Effective caching is the highest-leverage optimization available for pipeline latency. In Docker builds, layer ordering is the foundation — place COPY instructions for dependency manifests before application source so that package installation layers are reused across commits that don't change dependencies.

GitHub Actions provides a cache action backed by a content-addressable store. Cache keys should include the OS, language runtime version, and a hash of the relevant lock file:

- name: Cache Python dependencies
  uses: actions/cache@v4
  with:
    path: ~/.cache/pip
    key: ${{ runner.os }}-py${{ matrix.python-version }}-${{ hashFiles('**/requirements.lock') }}
    restore-keys: |
      ${{ runner.os }}-py${{ matrix.python-version }}-

For monorepos or large builds, remote caching via Bazel or Buildkite's remote cache provides cross-machine cache sharing. Cache hit rate is a first-class metric — target above 80% for dependency caches and above 60% for build artifact caches. Instrument it explicitly rather than inferring it from build times.

Pipeline Observability with OpenTelemetry

If you can't measure the pipeline, you can't improve it. OpenTelemetry provides a vendor-neutral way to emit traces and metrics from CI jobs. The following pattern instruments a GitHub Actions workflow step by posting spans to an OTEL collector:

# otel_ci_span.py — wrap a CI step in an OTel span
import os, time, requests

OTEL_ENDPOINT = os.environ["OTEL_EXPORTER_OTLP_ENDPOINT"]
PIPELINE = os.environ.get("GITHUB_WORKFLOW", "unknown")
JOB = os.environ.get("GITHUB_JOB", "unknown")
RUN_ID = os.environ.get("GITHUB_RUN_ID", "0")


def emit_span(name: str, start_ns: int, end_ns: int, status: str, attrs: dict):
    payload = {
        "resourceSpans": [{
            "resource": {
                "attributes": [
                    {"key": "ci.pipeline", "value": {"stringValue": PIPELINE}},
                    {"key": "ci.job", "value": {"stringValue": JOB}},
                    {"key": "ci.run_id", "value": {"stringValue": RUN_ID}},
                ]
            },
            "scopeSpans": [{
                "spans": [{
                    "name": name,
                    "startTimeUnixNano": str(start_ns),
                    "endTimeUnixNano": str(end_ns),
                    "status": {"code": 1 if status == "ok" else 2},
                    "attributes": [
                        {"key": k, "value": {"stringValue": str(v)}}
                        for k, v in attrs.items()
                    ],
                }]
            }]
        }]
    }
    requests.post(f"{OTEL_ENDPOINT}/v1/traces", json=payload, timeout=5)

Pipeline duration trends stored in a time-series backend (Prometheus, Grafana Mimir, Datadog) enable the build time SLO query. In PromQL:

# p99 build time for the main branch deploy pipeline, 7-day window
histogram_quantile(
  0.99,
  sum by (le, pipeline) (
    rate(ci_pipeline_duration_seconds_bucket{
      branch="main",
      pipeline="deploy"
    }[7d])
  )
)

Alert when this value exceeds the SLO threshold for two consecutive evaluation periods, not just once — pipeline spikes from a single slow job should not page on-call.

Deployment Gates

A deployment gate is an automated check that blocks promotion of a build artifact to the next environment unless specific conditions hold. Gates exist because the cost of finding a defect increases by an order of magnitude with each environment boundary crossed. Finding a memory regression in CI costs minutes. Finding it in production-like lab environments costs an incident.

The following GitHub Actions workflow demonstrates a complete gate sequence from build through production deploy with automated rollback:

name: Deploy with Gates

on:
  push:
    branches: [main]

jobs:
  build:
    runs-on: ubuntu-latest
    outputs:
      image_tag: ${{ steps.meta.outputs.tags }}
    steps:
      - uses: actions/checkout@v4

      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3

      - name: Build and push image
        id: meta
        uses: docker/build-push-action@v6
        with:
          push: true
          cache-from: type=gha
          cache-to: type=gha,mode=max
          tags: ghcr.io/${{ github.repository }}:${{ github.sha }}

  gate-security:
    needs: build
    runs-on: ubuntu-latest
    steps:
      - name: Scan image for CVEs
        uses: aquasecurity/trivy-action@0.24.0
        with:
          image-ref: ghcr.io/${{ github.repository }}:${{ github.sha }}
          severity: CRITICAL,HIGH
          exit-code: 1

  gate-integration:
    needs: build
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run integration test suite
        run: |
          docker compose -f docker-compose.test.yml up --abort-on-container-exit
          docker compose -f docker-compose.test.yml down

  gate-error-budget:
    needs: build
    runs-on: ubuntu-latest
    steps:
      - name: Check error budget remaining
        env:
          PROM_URL: ${{ secrets.PROMETHEUS_URL }}
        run: |
          python3 scripts/check_error_budget.py \
            --service api \
            --min-budget-percent 10 \
            --window 7d

  deploy-staging:
    needs: [gate-security, gate-integration, gate-error-budget]
    runs-on: ubuntu-latest
    environment: staging
    steps:
      - uses: actions/checkout@v4
      - name: Deploy to staging
        run: scripts/deploy.sh staging ${{ github.sha }}

      - name: Run smoke tests against staging
        run: pytest tests/smoke/ --base-url=${{ vars.STAGING_URL }}

  deploy-production:
    needs: deploy-staging
    runs-on: ubuntu-latest
    environment: production
    steps:
      - uses: actions/checkout@v4

      - name: Deploy to production
        id: deploy
        run: scripts/deploy.sh production ${{ github.sha }}

      - name: Wait and verify deployment health
        id: health_check
        run: |
          sleep 90
          python3 scripts/verify_deployment.py \
            --sha ${{ github.sha }} \
            --error-rate-threshold 0.01 \
            --latency-p99-threshold 500

      - name: Rollback on failure
        if: failure() && steps.deploy.outcome == 'success'
        run: |
          echo "Deployment health check failed — initiating rollback"
          PREVIOUS_SHA=$(git rev-parse HEAD~1)
          scripts/deploy.sh production "${PREVIOUS_SHA}"
          scripts/notify_incident.py \
            --severity high \
            --message "Auto-rolled back production to ${PREVIOUS_SHA}"

The rollback step only fires when the deploy step succeeded but the health check did not — avoiding a rollback attempt on a deployment that never reached the cluster.

Rollback Automation and the Decision Tree

Automated rollback sounds straightforward until you consider the cases where it makes things worse: a bad database migration that has already run, a rollback that itself fails, or a configuration change where the previous version is also broken. The decision tree for rollback automation should distinguish severity by signal type.

Automatic rollback — no human required — is appropriate when the health signal is unambiguous and reversible: error rate on the new version exceeds a fixed threshold within the first two minutes post-deploy, with no concurrent infrastructure changes detected. Manual gate — page on-call but do not roll back — is appropriate when signals are ambiguous (elevated latency but error rate is flat), when a migration is in progress, or when the previous deployment is less than an hour old (suggesting the prior version may also be problematic). The verify_deployment.py script above should encode this logic explicitly rather than leaving it to the operator to interpret raw metrics at 3am.

Managing Pipeline Dependencies

An unpinned CI dependency is a supply chain vulnerability waiting to materialize. Using uses: actions/checkout@main means a compromised upstream commit becomes your deployment pipeline. Every action and every tool version in a pipeline should be pinned to a full SHA for actions and an exact version tag for CLI tools.

# Pinned — safe
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683  # v4.2.2
- uses: docker/build-push-action@ca052bb54ab0790a636c9b5f226502c73d547a25  # v6.2.0

# Unpinned — dangerous
- uses: actions/checkout@main
- uses: docker/build-push-action@latest

Renovate or Dependabot can automate the update cycle without sacrificing pinning discipline. Configure Renovate to open PRs that update the SHA pin along with the friendly version comment, run the full pipeline on the PR, and auto-merge on green. This gives you the security of pinning with the freshness of automatic updates.

// renovate.json — pin and auto-merge GitHub Actions
{
  "extends": ["config:base"],
  "packageRules": [
    {
      "matchManagers": ["github-actions"],
      "pinDigests": true,
      "automerge": true,
      "automergeType": "pr",
      "requiredStatusChecks": ["ci / build", "ci / gate-security"]
    }
  ]
}

The compounding effect of these practices — quarantined flaky tests, aggressive caching, OTel instrumentation, hard deployment gates, and pinned dependencies — is a pipeline that behaves like the production-style service it is. The team stops dreading CI. Deployment becomes boring. Boring deployments are the goal.

*Zak Hassan is a Staff SRE specializing in deployment reliability, observability, and developer productivity. Find him at zakhassan.com or on LinkedIn.*

Topic Paths

SRE and Reliability Kubernetes and Platform Engineering Observability and Incident Learning Cloud Cost and Capacity

About the Author

Zak Hassan writes about reliability engineering under real scale constraints.

Staff-level SRE and platform engineer focused on identity reliability, Kubernetes, observability, cloud architecture, AI infrastructure, and reducing operational uncertainty.

Connect on LinkedIn