*By Zak Hassan — Staff SRE | May 2026*
The most common deployment pipeline at mid-sized engineering organizations looks like this: a CI job runs tests, builds a container image, and then calls kubectl set image or helm upgrade against a live cluster. It works — right up until it doesn't. Drift accumulates silently as engineers apply hotfixes directly. Nobody can answer "what is actually running in production-like lab environments right now?" without sshing into a node and checking. An audit asks for the deployment history of a service and you realize the only source of truth is the CI job logs, which rolled off S3 two weeks ago. The push model feels fast but it trades away every property that makes infrastructure manageable at scale: reconciliation, auditability, and a single authoritative source of truth. GitOps inverts this entirely. Git is the source of truth. The cluster pulls toward that truth continuously. Humans (and CI) only ever write to Git.
Why Push-Based Deployment Breaks at Scale
In a push model, the CI pipeline holds credentials that can write to production. Every engineer who touches the pipeline inherits that blast radius. More critically, the pipeline only runs when triggered — it has no knowledge of what the cluster looks like between deployments. When a developer runs kubectl scale deployment api --replicas=0 to debug a production incident and forgets to revert it, the cluster stays in that broken state indefinitely. No alarm fires. The next deploy might overwrite it, or might not touch replicas at all, depending on the Helm chart values. You find out during the next incident when you wonder why traffic is falling over.
GitOps solves this through continuous reconciliation. An in-cluster controller watches a Git repository and constantly compares the desired state expressed in Git against the observed state in the cluster. Any deviation is corrected automatically. The audit trail is your Git commit history — you know exactly what changed, when, and who approved it, because nothing reaches the cluster without going through Git first.
ArgoCD: Application CRDs and Sync Policies
ArgoCD runs as a set of controllers inside the cluster. You define applications declaratively using its Application CRD, and ArgoCD takes care of pulling from Git and applying resources. A minimal but production-realistic Application looks like this:
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: payments-api
namespace: argocd
spec:
project: production
source:
repoURL: https://github.com/acme/k8s-manifests
targetRevision: main
path: apps/payments-api/overlays/prod
destination:
server: https://kubernetes.default.svc
namespace: payments
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
- PrunePropagationPolicy=foreground
retry:
limit: 5
backoff:
duration: 5s
factor: 2
maxDuration: 3mThe two flags under automated are where the real GitOps behavior lives. selfHeal: true means that if someone manually patches a resource in the cluster, ArgoCD will detect the drift and revert it within a few seconds. prune: true means resources that are removed from Git are also removed from the cluster — no orphaned ConfigMaps accumulating over time. Without prune, GitOps quickly becomes GitOps-except-for-deletions, which is a meaningful gap.
ArgoCD's health assessment system is worth understanding. It ships with health checks for all core Kubernetes resource types and knows, for example, that a Deployment is healthy only when its desired and ready replica counts match. For custom resources, you can write Lua health checks directly in the ArgoCD ConfigMap. This means your ArgoCD dashboard isn't just showing sync status — it's showing whether the applications are actually serving traffic.
Flux: Kustomization and HelmRelease CRDs
Flux takes a more composable, Kubernetes-native approach than ArgoCD. Rather than a single monolithic Application concept, Flux separates concerns across multiple CRDs: GitRepository sources, Kustomization appliers, and HelmRelease for Helm-based workloads. This lets you build dependency graphs between components, which is critical for infrastructure-layer bootstrapping.
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: cert-manager
namespace: flux-system
spec:
interval: 10m
retryInterval: 2m
timeout: 5m
sourceRef:
kind: GitRepository
name: fleet-infra
path: ./infrastructure/cert-manager
prune: true
wait: true
healthChecks:
- apiVersion: apps/v1
kind: Deployment
name: cert-manager
namespace: cert-manager
---
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: ingress-nginx
namespace: flux-system
spec:
interval: 10m
dependsOn:
- name: cert-manager
sourceRef:
kind: GitRepository
name: fleet-infra
path: ./infrastructure/ingress-nginx
prune: true
wait: trueThe dependsOn field is what allows Flux to install cert-manager before ingress-nginx, and to skip applying ingress-nginx if cert-manager is unhealthy. This ordering guarantee is something teams frequently implement badly with ad-hoc sleep loops in CI scripts. Flux makes it declarative.
For Helm-managed applications, the HelmRelease CRD gives you continuous reconciliation over Helm charts with fine-grained control over upgrade behavior:
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
name: redis
namespace: cache
spec:
interval: 15m
chart:
spec:
chart: redis
version: ">=19.0.0 <20.0.0"
sourceRef:
kind: HelmRepository
name: bitnami
namespace: flux-system
values:
auth:
enabled: true
existingSecret: redis-auth
replica:
replicaCount: 3
upgrade:
remediation:
retries: 3
strategy: rollback
rollback:
timeout: 5m
cleanupOnFail: trueDrift Detection and Remediation in Practice
Drift detection is only valuable if you act on it. Both ArgoCD and Flux will surface drift in their status conditions, but you need observability plumbed into your alerting stack. ArgoCD exposes Prometheus metrics including argocd_app_info with a sync_status label. A simple alert rule catches apps that have been out of sync for more than five minutes:
argocd_app_info{sync_status="OutOfSync"} == 1For Flux, the equivalent is watching the gotk_reconcile_condition metric:
gotk_reconcile_condition{type="Ready", status="False"} == 1When self-heal is enabled, most drift resolves within the reconciliation interval without human intervention. The cases that don't self-heal — usually because the drift involves a resource type that the controller doesn't manage, or because the manual change introduced a conflict — are exactly the ones you want paged on.
The following script polls ArgoCD's API and prints a summary of any application that is out of sync or unhealthy, useful as a pre-deployment gate or a cron-driven Slack notification:
#!/usr/bin/env bash
set -euo pipefail
ARGOCD_SERVER="${ARGOCD_SERVER:-argocd.internal.example.com}"
TOKEN="${ARGOCD_TOKEN:?ARGOCD_TOKEN must be set}"
curl -sSf \
-H "Authorization: Bearer ${TOKEN}" \
"https://${ARGOCD_SERVER}/api/v1/applications" \
| jq -r '
.items[]
| select(
.status.sync.status != "Synced"
or .status.health.status != "Healthy"
)
| [.metadata.name, .status.sync.status, .status.health.status]
| @tsv
' \
| column -t -s $'\t' \
| while IFS= read -r line; do
echo "DEGRADED: ${line}"
doneProgressive Delivery with Flagger
Knowing that the cluster matches Git is necessary but not sufficient for safe deployments. You also need to know that the new version of the application is actually working before it receives 100% of production traffic. Flagger integrates with ArgoCD and Flux to provide canary analysis driven by real metrics.
Flagger watches for changes to a Deployment and automatically creates a canary Deployment that receives a configurable percentage of traffic. It queries your metrics backend — Prometheus, Datadog, or others — and advances the canary only if the metrics pass defined thresholds. If they don't, it rolls back automatically.
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: payments-api
namespace: payments
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: payments-api
progressDeadlineSeconds: 600
service:
port: 8080
targetPort: 8080
gateways:
- istio-system/public-gateway
hosts:
- payments.example.com
analysis:
interval: 1m
threshold: 5
maxWeight: 50
stepWeight: 10
metrics:
- name: request-success-rate
thresholdRange:
min: 99
interval: 1m
- name: request-duration
thresholdRange:
max: 500
interval: 1m
webhooks:
- name: load-test
url: http://flagger-loadtester.test/
timeout: 5s
metadata:
cmd: "hey -z 1m -q 10 -c 2 https://payments.example.com/healthz"With this configuration, Flagger will shift 10% of traffic to the canary, wait one minute, check that success rate is above 99% and p99 latency is below 500ms, then advance to 20%, and so on up to 50%. Five consecutive failures trigger an automatic rollback. This analysis happens entirely within the cluster, with no CI involvement — the rollback fires faster than any human could respond, and it fires based on real production signal rather than synthetic checks.
GitOps for Infrastructure: Beyond Applications
GitOps principles apply equally to infrastructure, not just application workloads. Terraform via Atlantis brings GitOps-style workflow to infrastructure code: pull requests trigger terraform plan output posted as a PR comment, and merges trigger terraform apply. The blast radius of any infrastructure change is visible in the PR before it lands, and the merge commit is the audit record.
Flux's Terraform controller goes further, running the Terraform reconciliation loop inside the cluster on the same interval as application manifests. This means your VPC route tables, RDS parameter groups, and IAM roles are subject to the same drift detection as your Deployments. A manual change to a security group gets reverted on the next reconciliation cycle.
The practical recommendation for most teams is Atlantis for infrastructure that requires careful human review (networking, IAM, databases) and the Flux Terraform controller for infrastructure that should stay locked to a declared state (DNS records, monitoring rules, cost allocation tags).
Secrets Handling: The One Thing You Cannot Put in Git
GitOps has one uncomfortable tension: if Git is the source of truth, what do you do with secrets? Putting raw secrets in Git — even a private repository — is a non-starter. Rotation becomes a nightmare, and any developer with repository access has production credentials.
The two patterns that actually work at scale are Sealed Secrets and SOPS. Sealed Secrets uses asymmetric encryption: a controller running in the cluster holds a private key, and you encrypt secrets with the corresponding public key before committing. Only the in-cluster controller can decrypt them. The committed SealedSecret resource is safe to store in Git. The limitation is that the encrypted value is tied to a specific cluster's key, which complicates multi-cluster setups.
SOPS (Secrets OPerationS) encrypts secret values using KMS — AWS KMS, GCP KMS, or Azure Key Vault — and stores the encrypted file in Git. The plaintext never exists on disk except in memory during decryption. Both Flux and ArgoCD have native SOPS integration. Flux decrypts at apply time using a secretRef pointing to a KMS key reference:
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: payments-api-secrets
namespace: flux-system
spec:
interval: 10m
sourceRef:
kind: GitRepository
name: fleet-infra
path: ./secrets/payments
decryption:
provider: sops
secretRef:
name: sops-aws-credsThe third pattern — and often the right one for mature platforms — is to keep no secret values in Git at all and instead use an external secrets operator that pulls from Vault or AWS Secrets Manager at runtime. This separates secret lifecycle management from the GitOps reconciliation loop entirely, which makes rotation straightforward and audit trails cleaner.
Putting It Together
GitOps is not a tool choice — it is a discipline. ArgoCD and Flux are both excellent implementations; the decision between them usually comes down to whether you prefer a unified UI-first experience (ArgoCD) or a composable, CRD-first architecture (Flux). What matters more than the tool is the cultural commitment: nothing reaches the cluster except through Git, drift is always remediated, and progressive delivery gates every significant change. That combination gives you the auditability of change management without the friction, the safety of staged rollouts without the manual toil, and a cluster that reliably reflects what the team intended — not what someone ran at 2am during an incident.
*Zak Hassan is a Staff SRE specializing in Kubernetes platform engineering, GitOps, and progressive delivery. Find him at zakhassan.com or on LinkedIn.*
Topic Paths