*By Zak Hassan — Staff SRE | May 2026*


The most common deployment pipeline at mid-sized engineering organizations looks like this: a CI job runs tests, builds a container image, and then calls kubectl set image or helm upgrade against a live cluster. It works — right up until it doesn't. Drift accumulates silently as engineers apply hotfixes directly. Nobody can answer "what is actually running in production-like lab environments right now?" without sshing into a node and checking. An audit asks for the deployment history of a service and you realize the only source of truth is the CI job logs, which rolled off S3 two weeks ago. The push model feels fast but it trades away every property that makes infrastructure manageable at scale: reconciliation, auditability, and a single authoritative source of truth. GitOps inverts this entirely. Git is the source of truth. The cluster pulls toward that truth continuously. Humans (and CI) only ever write to Git.

Why Push-Based Deployment Breaks at Scale

In a push model, the CI pipeline holds credentials that can write to production. Every engineer who touches the pipeline inherits that blast radius. More critically, the pipeline only runs when triggered — it has no knowledge of what the cluster looks like between deployments. When a developer runs kubectl scale deployment api --replicas=0 to debug a production incident and forgets to revert it, the cluster stays in that broken state indefinitely. No alarm fires. The next deploy might overwrite it, or might not touch replicas at all, depending on the Helm chart values. You find out during the next incident when you wonder why traffic is falling over.

GitOps solves this through continuous reconciliation. An in-cluster controller watches a Git repository and constantly compares the desired state expressed in Git against the observed state in the cluster. Any deviation is corrected automatically. The audit trail is your Git commit history — you know exactly what changed, when, and who approved it, because nothing reaches the cluster without going through Git first.

ArgoCD: Application CRDs and Sync Policies

ArgoCD runs as a set of controllers inside the cluster. You define applications declaratively using its Application CRD, and ArgoCD takes care of pulling from Git and applying resources. A minimal but production-realistic Application looks like this:

yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: payments-api
  namespace: argocd
spec:
  project: production
  source:
    repoURL: https://github.com/acme/k8s-manifests
    targetRevision: main
    path: apps/payments-api/overlays/prod
  destination:
    server: https://kubernetes.default.svc
    namespace: payments
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true
      - PrunePropagationPolicy=foreground
    retry:
      limit: 5
      backoff:
        duration: 5s
        factor: 2
        maxDuration: 3m

The two flags under automated are where the real GitOps behavior lives. selfHeal: true means that if someone manually patches a resource in the cluster, ArgoCD will detect the drift and revert it within a few seconds. prune: true means resources that are removed from Git are also removed from the cluster — no orphaned ConfigMaps accumulating over time. Without prune, GitOps quickly becomes GitOps-except-for-deletions, which is a meaningful gap.

ArgoCD's health assessment system is worth understanding. It ships with health checks for all core Kubernetes resource types and knows, for example, that a Deployment is healthy only when its desired and ready replica counts match. For custom resources, you can write Lua health checks directly in the ArgoCD ConfigMap. This means your ArgoCD dashboard isn't just showing sync status — it's showing whether the applications are actually serving traffic.

Flux: Kustomization and HelmRelease CRDs

Flux takes a more composable, Kubernetes-native approach than ArgoCD. Rather than a single monolithic Application concept, Flux separates concerns across multiple CRDs: GitRepository sources, Kustomization appliers, and HelmRelease for Helm-based workloads. This lets you build dependency graphs between components, which is critical for infrastructure-layer bootstrapping.

yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: cert-manager
  namespace: flux-system
spec:
  interval: 10m
  retryInterval: 2m
  timeout: 5m
  sourceRef:
    kind: GitRepository
    name: fleet-infra
  path: ./infrastructure/cert-manager
  prune: true
  wait: true
  healthChecks:
    - apiVersion: apps/v1
      kind: Deployment
      name: cert-manager
      namespace: cert-manager
---
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: ingress-nginx
  namespace: flux-system
spec:
  interval: 10m
  dependsOn:
    - name: cert-manager
  sourceRef:
    kind: GitRepository
    name: fleet-infra
  path: ./infrastructure/ingress-nginx
  prune: true
  wait: true

The dependsOn field is what allows Flux to install cert-manager before ingress-nginx, and to skip applying ingress-nginx if cert-manager is unhealthy. This ordering guarantee is something teams frequently implement badly with ad-hoc sleep loops in CI scripts. Flux makes it declarative.

For Helm-managed applications, the HelmRelease CRD gives you continuous reconciliation over Helm charts with fine-grained control over upgrade behavior:

yaml
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: redis
  namespace: cache
spec:
  interval: 15m
  chart:
    spec:
      chart: redis
      version: ">=19.0.0 <20.0.0"
      sourceRef:
        kind: HelmRepository
        name: bitnami
        namespace: flux-system
  values:
    auth:
      enabled: true
      existingSecret: redis-auth
    replica:
      replicaCount: 3
  upgrade:
    remediation:
      retries: 3
      strategy: rollback
  rollback:
    timeout: 5m
    cleanupOnFail: true

Drift Detection and Remediation in Practice

Drift detection is only valuable if you act on it. Both ArgoCD and Flux will surface drift in their status conditions, but you need observability plumbed into your alerting stack. ArgoCD exposes Prometheus metrics including argocd_app_info with a sync_status label. A simple alert rule catches apps that have been out of sync for more than five minutes:

promql
argocd_app_info{sync_status="OutOfSync"} == 1

For Flux, the equivalent is watching the gotk_reconcile_condition metric:

promql
gotk_reconcile_condition{type="Ready", status="False"} == 1

When self-heal is enabled, most drift resolves within the reconciliation interval without human intervention. The cases that don't self-heal — usually because the drift involves a resource type that the controller doesn't manage, or because the manual change introduced a conflict — are exactly the ones you want paged on.

The following script polls ArgoCD's API and prints a summary of any application that is out of sync or unhealthy, useful as a pre-deployment gate or a cron-driven Slack notification:

bash
#!/usr/bin/env bash
set -euo pipefail

ARGOCD_SERVER="${ARGOCD_SERVER:-argocd.internal.example.com}"
TOKEN="${ARGOCD_TOKEN:?ARGOCD_TOKEN must be set}"

curl -sSf \
  -H "Authorization: Bearer ${TOKEN}" \
  "https://${ARGOCD_SERVER}/api/v1/applications" \
| jq -r '
  .items[]
  | select(
      .status.sync.status != "Synced"
      or .status.health.status != "Healthy"
    )
  | [.metadata.name, .status.sync.status, .status.health.status]
  | @tsv
' \
| column -t -s $'\t' \
| while IFS= read -r line; do
    echo "DEGRADED: ${line}"
  done

Progressive Delivery with Flagger

Knowing that the cluster matches Git is necessary but not sufficient for safe deployments. You also need to know that the new version of the application is actually working before it receives 100% of production traffic. Flagger integrates with ArgoCD and Flux to provide canary analysis driven by real metrics.

Flagger watches for changes to a Deployment and automatically creates a canary Deployment that receives a configurable percentage of traffic. It queries your metrics backend — Prometheus, Datadog, or others — and advances the canary only if the metrics pass defined thresholds. If they don't, it rolls back automatically.

yaml
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: payments-api
  namespace: payments
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: payments-api
  progressDeadlineSeconds: 600
  service:
    port: 8080
    targetPort: 8080
    gateways:
      - istio-system/public-gateway
    hosts:
      - payments.example.com
  analysis:
    interval: 1m
    threshold: 5
    maxWeight: 50
    stepWeight: 10
    metrics:
      - name: request-success-rate
        thresholdRange:
          min: 99
        interval: 1m
      - name: request-duration
        thresholdRange:
          max: 500
        interval: 1m
    webhooks:
      - name: load-test
        url: http://flagger-loadtester.test/
        timeout: 5s
        metadata:
          cmd: "hey -z 1m -q 10 -c 2 https://payments.example.com/healthz"

With this configuration, Flagger will shift 10% of traffic to the canary, wait one minute, check that success rate is above 99% and p99 latency is below 500ms, then advance to 20%, and so on up to 50%. Five consecutive failures trigger an automatic rollback. This analysis happens entirely within the cluster, with no CI involvement — the rollback fires faster than any human could respond, and it fires based on real production signal rather than synthetic checks.

GitOps for Infrastructure: Beyond Applications

GitOps principles apply equally to infrastructure, not just application workloads. Terraform via Atlantis brings GitOps-style workflow to infrastructure code: pull requests trigger terraform plan output posted as a PR comment, and merges trigger terraform apply. The blast radius of any infrastructure change is visible in the PR before it lands, and the merge commit is the audit record.

Flux's Terraform controller goes further, running the Terraform reconciliation loop inside the cluster on the same interval as application manifests. This means your VPC route tables, RDS parameter groups, and IAM roles are subject to the same drift detection as your Deployments. A manual change to a security group gets reverted on the next reconciliation cycle.

The practical recommendation for most teams is Atlantis for infrastructure that requires careful human review (networking, IAM, databases) and the Flux Terraform controller for infrastructure that should stay locked to a declared state (DNS records, monitoring rules, cost allocation tags).

Secrets Handling: The One Thing You Cannot Put in Git

GitOps has one uncomfortable tension: if Git is the source of truth, what do you do with secrets? Putting raw secrets in Git — even a private repository — is a non-starter. Rotation becomes a nightmare, and any developer with repository access has production credentials.

The two patterns that actually work at scale are Sealed Secrets and SOPS. Sealed Secrets uses asymmetric encryption: a controller running in the cluster holds a private key, and you encrypt secrets with the corresponding public key before committing. Only the in-cluster controller can decrypt them. The committed SealedSecret resource is safe to store in Git. The limitation is that the encrypted value is tied to a specific cluster's key, which complicates multi-cluster setups.

SOPS (Secrets OPerationS) encrypts secret values using KMS — AWS KMS, GCP KMS, or Azure Key Vault — and stores the encrypted file in Git. The plaintext never exists on disk except in memory during decryption. Both Flux and ArgoCD have native SOPS integration. Flux decrypts at apply time using a secretRef pointing to a KMS key reference:

yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: payments-api-secrets
  namespace: flux-system
spec:
  interval: 10m
  sourceRef:
    kind: GitRepository
    name: fleet-infra
  path: ./secrets/payments
  decryption:
    provider: sops
    secretRef:
      name: sops-aws-creds

The third pattern — and often the right one for mature platforms — is to keep no secret values in Git at all and instead use an external secrets operator that pulls from Vault or AWS Secrets Manager at runtime. This separates secret lifecycle management from the GitOps reconciliation loop entirely, which makes rotation straightforward and audit trails cleaner.

Putting It Together

GitOps is not a tool choice — it is a discipline. ArgoCD and Flux are both excellent implementations; the decision between them usually comes down to whether you prefer a unified UI-first experience (ArgoCD) or a composable, CRD-first architecture (Flux). What matters more than the tool is the cultural commitment: nothing reaches the cluster except through Git, drift is always remediated, and progressive delivery gates every significant change. That combination gives you the auditability of change management without the friction, the safety of staged rollouts without the manual toil, and a cluster that reliably reflects what the team intended — not what someone ran at 2am during an incident.


*Zak Hassan is a Staff SRE specializing in Kubernetes platform engineering, GitOps, and progressive delivery. Find him at zakhassan.com or on LinkedIn.*

Topic Paths

About the Author

Zak Hassan writes about reliability engineering under real scale constraints.

Staff-level SRE and platform engineer focused on identity reliability, Kubernetes, observability, cloud architecture, AI infrastructure, and reducing operational uncertainty.

Connect on LinkedIn