Secrets Management at Scale: From Environment Variables to Zero-Trust

*By Zak Hassan — Staff SRE | May 2026*

Every credential-compromise scenario I model in security reviews has the same basic failure mode: the secret was somewhere it shouldn't have been. A database password in a .env file committed to a private (now public) repo. An API key printed to stdout by a debug log line that nobody cleaned up. A Slack webhook URL hardcoded in a Helm values file because "teams will fix it later." Zero-trust is a philosophy, but secrets management is where it either lives or dies in practice. If you're still handing credentials to the applications through environment variables and crossing your fingers, this post is for you.

Why Environment Variables Aren't Good Enough

The appeal of env vars is obvious: they're universally supported, easy to set, and keep secrets out of source code. That's where the good news ends.

Plaintext in logs. Crash reporters, health check endpoints, and debugging tools love to dump the environment. A single printenv in a Dockerfile's CMD layer, an unguarded console.log(process.env), or a framework that logs all configuration on startup will write your credentials to stdout — directly into your log aggregation platform, indexed and searchable.

No rotation story. An env var is static. Rotating a database password means redeploying every service that uses it, usually during a maintenance window, usually manually. In practice, rotation doesn't happen, which means a credential exposed six months ago is still valid today.

No audit trail. Who read that secret? When? From which IP? With env vars, you have no idea. You can't answer this question for a compliance audit, and you can't answer it during an incident.

Environment drift. The secret in staging is different from production, which is different from what's in the developer's .env.local, which hasn't been updated since Q3. By the time you're debugging a production-only issue, you've lost confidence in what value each environment is actually using.

Config files on disk (even mounted as Kubernetes secrets) share most of these problems. Kubernetes Secret objects are base64-encoded, not encrypted at rest by default, and are replicated to every node that runs a pod needing them. Without envelope encryption and strict RBAC, kubectl get secret -o yaml is all an attacker needs.

HashiCorp Vault Architecture

Vault solves this by treating secrets as a service, not a file. Every secret has an owner, an expiry, and an access log.

The core concepts:

Auth methods — how principals prove their identity to Vault. For Kubernetes workloads, the Kubernetes auth method exchanges a pod's service account JWT for a Vault token scoped to a specific policy.
Secret engines — plugins that store or generate secrets. kv-v2 stores versioned key/value pairs. The database engine connects to Postgres, MySQL, or MongoDB and generates short-lived credentials on demand.
Leases — every dynamic secret has a TTL. When the lease expires, Vault revokes the credentials at the source. Applications must renew leases or request new credentials before expiry.
Policies — HCL documents that define which paths a token can read, write, or list.

Enabling Kubernetes auth looks like this:

vault auth enable kubernetes

vault write auth/kubernetes/config \
  kubernetes_host="https://$KUBERNETES_PORT_443_TCP_ADDR:443" \
  token_reviewer_jwt="$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" \
  kubernetes_ca_cert=@/var/run/secrets/kubernetes.io/serviceaccount/ca.crt

vault write auth/kubernetes/role/payments-api \
  bound_service_account_names=payments-api \
  bound_service_account_namespaces=production \
  policies=payments-api-policy \
  ttl=1h

The policy itself:

# payments-api-policy.hcl
path "secret/data/production/payments/*" {
  capabilities = ["read"]
}

path "database/creds/payments-db-role" {
  capabilities = ["read"]
}

path "auth/token/renew-self" {
  capabilities = ["update"]
}

For the database engine, Vault connects with a privileged account and creates ephemeral users:

vault secrets enable database

vault write database/config/payments-postgres \
  plugin_name=postgresql-database-plugin \
  allowed_roles="payments-db-role" \
  connection_url="postgresql://{{username}}:{{password}}@postgres.production.svc:5432/payments" \
  username="vault-admin" \
  password="$VAULT_ADMIN_PG_PASSWORD"

vault write database/roles/payments-db-role \
  db_name=payments-postgres \
  creation_statements="CREATE ROLE \"{{name}}\" WITH LOGIN PASSWORD '{{password}}' VALID UNTIL '{{expiration}}'; GRANT SELECT, INSERT, UPDATE ON ALL TABLES IN SCHEMA public TO \"{{name}}\";" \
  revocation_statements="DROP ROLE IF EXISTS \"{{name}}\";" \
  default_ttl="1h" \
  max_ttl="24h"

Every time the payments API starts, it gets a unique Postgres user that expires in an hour. Vault drops the user when the lease expires — no manual rotation, no shared credentials.

The Vault Agent Sidecar Pattern

The cleanest way to get secrets into pods without changing application code is the Vault Agent sidecar, injected via the Vault Agent Injector (a mutating admission webhook).

Your application pod gets two containers added automatically: an init container that fetches secrets before your app starts, and a sidecar that renews leases and rewrites secret files as they approach expiry. The application reads files from a shared in-memory volume — it never calls Vault directly.

Annotations on the pod spec drive everything:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: payments-api
  namespace: production
spec:
  template:
    metadata:
      annotations:
        vault.hashicorp.com/agent-inject: "true"
        vault.hashicorp.com/role: "payments-api"
        vault.hashicorp.com/agent-inject-secret-db-creds: "database/creds/payments-db-role"
        vault.hashicorp.com/agent-inject-template-db-creds: |
          {{- with secret "database/creds/payments-db-role" -}}
          DATABASE_URL=postgresql://{{ .Data.username }}:{{ .Data.password }}@postgres.production.svc:5432/payments
          {{- end }}
        vault.hashicorp.com/agent-inject-secret-api-key: "secret/data/production/payments/stripe"
        vault.hashicorp.com/agent-inject-template-api-key: |
          {{- with secret "secret/data/production/payments/stripe" -}}
          STRIPE_API_KEY={{ .Data.data.api_key }}
          {{- end }}
        vault.hashicorp.com/agent-pre-populate-only: "false"
        vault.hashicorp.com/agent-revoke-on-shutdown: "true"
    spec:
      serviceAccountName: payments-api
      containers:
        - name: payments-api
          image: payments-api:latest
          command: ["/bin/sh", "-c"]
          args: ["source /vault/secrets/db-creds && source /vault/secrets/api-key && exec ./payments-api"]

The files land at /vault/secrets/ in the pod. The sidecar watches lease TTLs and rewrites the files before they expire, so a long-running process always has valid credentials without a restart.

External Secrets Operator

The Vault Agent pattern is powerful but couples your deployment manifests to Vault annotations. External Secrets Operator (ESO) is a cleaner abstraction for teams that want Vault (or AWS Secrets Manager, GCP Secret Manager, Azure Key Vault) to back native Kubernetes Secret objects, without any annotation overhead in application deployments.

ESO introduces two CRDs: ClusterSecretStore (cluster-wide backend config) and ExternalSecret (per-namespace secret mapping).

# cluster-secret-store.yaml
apiVersion: external-secrets.io/v1beta1
kind: ClusterSecretStore
metadata:
  name: vault-backend
spec:
  provider:
    vault:
      server: "https://vault.internal.example.com:8200"
      path: "secret"
      version: "v2"
      auth:
        kubernetes:
          mountPath: "kubernetes"
          role: "eso-reader"
          serviceAccountRef:
            name: "external-secrets-sa"
            namespace: "external-secrets"

# external-secret.yaml
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: payments-stripe-key
  namespace: production
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: vault-backend
    kind: ClusterSecretStore
  target:
    name: payments-stripe-secret
    creationPolicy: Owner
    template:
      type: Opaque
      data:
        STRIPE_API_KEY: "{{ .stripe_api_key }}"
        STRIPE_WEBHOOK_SECRET: "{{ .stripe_webhook_secret }}"
  data:
    - secretKey: stripe_api_key
      remoteRef:
        key: production/payments/stripe
        property: api_key
    - secretKey: stripe_webhook_secret
      remoteRef:
        key: production/payments/stripe
        property: webhook_secret

ESO reconciles on the refreshInterval, updating the Kubernetes secret when the remote value changes. Combined with Reloader, pods restart automatically when the secret is refreshed — giving you rotation that propagates without manual intervention.

Secret Rotation Without Downtime

Dynamic secrets from Vault's database engine give you rotation for free: every credential is already short-lived. But for static secrets — API keys, OAuth client secrets, TLS certificates — you need an explicit rotation strategy.

The pattern that works: blue/green credential rotation.

Generate the new credential at the provider (Stripe dashboard, AWS IAM, etc.)
Write both old and new values into Vault KV v2 under the same path. Your application reads a list of API keys and tries them in order.
Deploy the new application version that accepts either credential.
Verify traffic is succeeding with the new key.
Revoke the old credential at the provider.
Remove the old key from Vault.

This avoids the race condition where you revoke the old key before all pods have restarted with the new one.

Audit Logging

Vault's audit log is a newline-delimited JSON stream of every request and response. Enable it:

vault audit enable file file_path=/var/log/vault/audit.log

Each entry includes the request path, the auth token accessor (not the token itself), the response code, and the remote address. What you should alert on:

High error rates from a single token accessor — credential stuffing or a misconfigured application hitting Vault in a tight retry loop.
Access to paths outside normal operating patterns — a payments service reading secrets for the auth service.
Token creation with root policy — should essentially never happen outside break-glass procedures.
Lease revocations in bulk — could indicate a legitimate rotation or an attacker revoking credentials to cause a denial of service.

Ship the audit log to your SIEM. The accessor-to-entity mapping lets you correlate Vault activity with Kubernetes workload identity even after tokens rotate.

The Secret Sprawl Problem

In most organizations that haven't done a secrets audit, the same credential exists in six places simultaneously: Vault, AWS Secrets Manager (because one team preferred it), a GitHub Actions secret, a CircleCI environment variable, a .env file on a shared developer machine, and a comment in a Slack thread from 2023.

This is secret sprawl, and it means rotation is never truly complete.

The following Python script inventories Vault paths and flags secrets that haven't been updated in more than 90 days, and checks for dynamic secret leases approaching expiry:

#!/usr/bin/env python3
"""
vault_audit.py — Inventory Vault KV v2 secrets and flag staleness.
Requires: pip install hvac python-dateutil
Usage: VAULT_ADDR=https://vault.example.com VAULT_TOKEN=... python3 vault_audit.py
"""

import os
import json
from datetime import datetime, timezone, timedelta
import hvac
from dateutil import parser as dateutil_parser

VAULT_ADDR = os.environ["VAULT_ADDR"]
VAULT_TOKEN = os.environ["VAULT_TOKEN"]
STALE_THRESHOLD_DAYS = 90
LEASE_EXPIRY_WARNING_HOURS = 4

client = hvac.Client(url=VAULT_ADDR, token=VAULT_TOKEN)

def list_kv_paths(mount: str, path: str = "") -> list[str]:
    """Recursively list all KV v2 secret paths under mount/path."""
    try:
        result = client.secrets.kv.v2.list_secrets(path=path or "/", mount_point=mount)
        keys = result["data"]["keys"]
    except hvac.exceptions.InvalidPath:
        return []

    paths = []
    for key in keys:
        full_path = f"{path}/{key}".lstrip("/")
        if key.endswith("/"):
            paths.extend(list_kv_paths(mount, full_path.rstrip("/")))
        else:
            paths.append(full_path)
    return paths

def check_secret_staleness(mount: str, path: str) -> dict:
    """Return metadata for a KV v2 secret, flagging if stale."""
    meta = client.secrets.kv.v2.read_secret_metadata(path=path, mount_point=mount)
    versions = meta["data"]["versions"]
    latest_version = str(meta["data"]["current_version"])
    created_time_str = versions[latest_version]["created_time"]
    created_time = dateutil_parser.isoparse(created_time_str)

    age_days = (datetime.now(timezone.utc) - created_time).days
    is_stale = age_days > STALE_THRESHOLD_DAYS
    is_destroyed = versions[latest_version]["destroyed"]

    return {
        "path": f"{mount}/{path}",
        "current_version": latest_version,
        "last_updated": created_time.strftime("%Y-%m-%d"),
        "age_days": age_days,
        "stale": is_stale,
        "destroyed": is_destroyed,
    }

def check_expiring_leases() -> list[dict]:
    """List dynamic secret leases expiring within the warning window."""
    warning = []
    try:
        leases = client.sys.list_leases(prefix="database/creds/")
        for lease_id in leases.get("data", {}).get("keys", []):
            info = client.sys.read_lease(lease_id=lease_id)
            expire_time = dateutil_parser.isoparse(info["data"]["expire_time"])
            hours_remaining = (expire_time - datetime.now(timezone.utc)).total_seconds() / 3600
            if hours_remaining < LEASE_EXPIRY_WARNING_HOURS:
                warning.append({
                    "lease_id": lease_id,
                    "expire_time": expire_time.strftime("%Y-%m-%dT%H:%M:%SZ"),
                    "hours_remaining": round(hours_remaining, 2),
                })
    except Exception as e:
        print(f"[warn] Could not enumerate leases: {e}")
    return warning

def main():
    mounts_to_audit = ["secret"]  # adjust to your KV v2 mount names

    print("=== Stale Static Secrets (>90 days) ===")
    for mount in mounts_to_audit:
        paths = list_kv_paths(mount)
        for path in paths:
            result = check_secret_staleness(mount, path)
            if result["stale"]:
                print(
                    f"  STALE  {result['path']}  "
                    f"(v{result['current_version']}, last rotated {result['last_updated']}, "
                    f"{result['age_days']}d ago)"
                )

    print("\n=== Dynamic Leases Expiring Soon ===")
    expiring = check_expiring_leases()
    if expiring:
        for lease in expiring:
            print(
                f"  EXPIRING  {lease['lease_id']}  "
                f"in {lease['hours_remaining']}h  (expires {lease['expire_time']})"
            )
    else:
        print("  No leases approaching expiry.")

if __name__ == "__main__":
    main()

Run this as a Kubernetes CronJob and pipe output to PagerDuty or Slack. Stale secrets become visible and actionable rather than accumulating silently.

To tackle sprawl structurally: run a one-time inventory across every system that might hold secrets — AWS Secrets Manager, GitHub Actions, CircleCI, Doppler, parameter stores, .env files in repos (use trufflehog or gitleaks for this). For each secret found outside Vault, either migrate it or document why it lives where it does. The goal isn't a single system of record for religious reasons — it's that every secret has exactly one authoritative source and every other reference is a cache that gets invalidated on rotation.

Getting There from Here

You don't have to boil the ocean. A practical migration path:

Deploy Vault (or use HCP Vault) and enable the Kubernetes auth method.
Migrate your highest-risk secrets first — database credentials, payment provider keys, OAuth client secrets.
Enable ESO and start replacing Secret manifests with ExternalSecret manifests for new services.
Retrofit Vault Agent annotations to existing services that are due for a meaningful change anyway — don't rewrite everything at once.
Enable audit logging on day one, even before you have alerting wired up. The log is invaluable retroactively.
Run the sprawl audit quarterly. Secrets have a way of reappearing in unexpected places.

Zero-trust is a journey. Secrets management is the part you can actually measure: how many credentials are static, how many are dynamic, what percentage of your estate is audited. Move those numbers and you're moving in the right direction.

*Zak Hassan is a Staff SRE specializing in platform security, distributed systems, and developer infrastructure. Find him at zakhassan.com or on LinkedIn.*

Topic Paths

SRE and Reliability Kubernetes and Platform Engineering Observability and Incident Learning Identity Reliability Cloud Cost and Capacity

About the Author

Zak Hassan writes about reliability engineering under real scale constraints.

Staff-level SRE and platform engineer focused on identity reliability, Kubernetes, observability, cloud architecture, AI infrastructure, and reducing operational uncertainty.

Connect on LinkedIn