Feature Flags and Progressive Delivery: Separating Deployment from Release

*By Zak Hassan — Staff SRE | May 2026*

Every engineer has lived through the same painful scenario: a feature goes out in a release, something breaks in production-like lab environments, and the only remediation path is a full rollback or a hotfix deploy. Both options take time, require coordination, and carry their own risk. The core problem is that deployment and release are treated as a single atomic event — code ships, users see it, consequences follow immediately. Feature flags break this coupling. When code deployment and feature activation are independent decisions, the risk profile of shipping changes entirely. You can deploy on a Tuesday, dark-launch to internal users on Wednesday, roll out to one percent of traffic on Thursday, and flip the switch fully only when you have confidence. This is progressive delivery, and it is one of the most impactful reliability practices an SRE team can adopt.

Why Decoupling Changes the Risk Equation

A deployment without a release is just moving bytes around. The code sits inert in production-like lab environments, evaluated but never activated for end users. This pattern — the dark launch — lets you verify that instrumentation is wiring correctly, that database migrations completed cleanly, and that downstream service calls are functioning, all without exposing a single user to the new behavior.

Percentage rollouts extend this further. Instead of binary on/off, you release to a sliding fraction of traffic:

1% → 5% → 20% → 50% → 100%

At each stage you pause, watch your error rate, latency distributions, and business metrics. If something goes wrong at five percent, you flip the flag off and five percent of users had a degraded experience for minutes — not the entire user base for the duration of a rollback deploy.

Targeted releases add another dimension. You release to internal employees first, then beta users, then a specific geographic region, then all users. Each ring provides signal before the next is activated.

Flag Types and Why Mixing Them Creates Debt

Not all flags serve the same purpose, and treating them as interchangeable is where flag debt begins. There are four distinct categories:

Release flags are temporary. They wrap a new feature and exist only to enable progressive rollout. They should have an expiry date baked into the ticket that created them. Once fully rolled out, the flag is deleted and the code path is unconditional.

Experiment flags drive A/B tests. They assign users deterministically to cohorts and expose variant behavior. These are owned by the product or data science team, have a defined test duration, and produce a winner/loser determination that should trigger removal.

Ops flags are kill switches — permanent (or long-lived) controls that let operators disable expensive, risky, or degraded functionality at runtime without a deploy. These belong in blog posts and incident response playbooks.

Permission flags gate features by entitlement: paying customers see feature X, free tier users do not. These are long-lived and owned by the billing or entitlements system.

When you mix these — using a release flag as a permanent ops switch, or letting an experiment flag live for eighteen months — you create a codebase where nobody knows which flags are safe to clean up, which are load-bearing, and which are just forgotten. The audit becomes expensive. Establish the taxonomy early and enforce it at flag creation time.

OpenFeature: The Vendor-Neutral SDK Standard

The OpenFeature project (CNCF) defines a standard SDK interface for feature flag evaluation, decoupling the application code from any specific vendor's SDK. You write against the OpenFeature API; a provider implementation handles the backend — LaunchDarkly, Flagd, Unleash, or your own service.

# pip install openfeature-sdk

from openfeature import api
from openfeature.evaluation_context import EvaluationContext

# Register a provider (here: a custom one, shown below)
from myapp.flags import HttpFlagProvider

api.set_provider(HttpFlagProvider(base_url="https://flags.internal.example.com"))

client = api.get_client("payments-service")

# Evaluate a boolean flag with targeting context
ctx = EvaluationContext(
    targeting_key="user-8821",
    attributes={
        "org_id": "org-441",
        "region": "us-east-1",
        "plan": "enterprise",
        "internal": False,
    },
)

is_enabled = client.get_boolean_value(
    flag_key="new-checkout-flow",
    default_value=False,
    evaluation_context=ctx,
)

if is_enabled:
    return new_checkout_handler(request)
else:
    return legacy_checkout_handler(request)

The application code never imports LaunchDarkly or Flagd directly. If you swap providers, only the provider registration changes.

Implementing a Custom Provider

from openfeature.provider.provider import AbstractProvider
from openfeature.flag_evaluation import FlagResolutionDetails, Reason
from openfeature.evaluation_context import EvaluationContext
import httpx

class HttpFlagProvider(AbstractProvider):
    def __init__(self, base_url: str):
        self.base_url = base_url
        self._client = httpx.Client(timeout=0.5)  # tight timeout — flags must be fast

    @property
    def name(self) -> str:
        return "HttpFlagProvider"

    def resolve_boolean_details(
        self,
        flag_key: str,
        default_value: bool,
        evaluation_context: EvaluationContext | None = None,
    ) -> FlagResolutionDetails[bool]:
        try:
            payload = {
                "flag": flag_key,
                "targeting_key": evaluation_context.targeting_key if evaluation_context else None,
                "attributes": evaluation_context.attributes if evaluation_context else {},
            }
            response = self._client.post(f"{self.base_url}/evaluate", json=payload)
            response.raise_for_status()
            data = response.json()
            return FlagResolutionDetails(
                value=data["value"],
                reason=Reason.TARGETING_MATCH if data.get("matched_rule") else Reason.DEFAULT,
                variant=data.get("variant"),
            )
        except Exception:
            # On any error: fail safe, return default
            return FlagResolutionDetails(value=default_value, reason=Reason.ERROR)

The critical design decisions here: a tight HTTP timeout (500ms — flag evaluation must never become a latency bottleneck), and fail-open behavior that returns the default on any error. Flags must never be in the critical path of availability.

Progressive Rollout and Consistent Hashing

For percentage rollouts to be useful, they must be *sticky*. A user who sees the new feature at two percent should continue seeing it at five percent. If assignments are random per-request, users experience a flickering, inconsistent product — and your metrics are contaminated by users who saw both variants.

Consistent hashing solves this:

import hashlib

def get_flag_bucket(targeting_key: str, flag_key: str, salt: str = "") -> int:
    """Return a stable bucket 0-99 for a given user/flag combination."""
    raw = f"{flag_key}{salt}{targeting_key}".encode("utf-8")
    digest = hashlib.sha256(raw).hexdigest()
    # Take the first 8 hex chars → integer → mod 100
    return int(digest[:8], 16) % 100

def evaluate_rollout(targeting_key: str, flag_key: str, rollout_percent: int) -> bool:
    bucket = get_flag_bucket(targeting_key, flag_key)
    return bucket < rollout_percent

# Example: 20% rollout
targeting_key = "user-8821"
if evaluate_rollout(targeting_key, "new-checkout-flow", rollout_percent=20):
    # user is in the 20%
    ...

A targeting context for real-world use goes beyond user ID. Org ID lets you roll out to entire organizations atomically — useful when features are priced at the org level. Region lets you stage geographically. Build targeting rules that compose these dimensions:

def evaluate_with_targeting(
    flag_key: str,
    ctx: dict,
    rules: list[dict],
    default_rollout_percent: int,
) -> bool:
    for rule in rules:
        # Rule: {"match": {"plan": "enterprise"}, "rollout_percent": 100}
        if all(ctx.get(k) == v for k, v in rule["match"].items()):
            return evaluate_rollout(ctx["targeting_key"], flag_key, rule["rollout_percent"])
    # Fall through to default rollout
    return evaluate_rollout(ctx["targeting_key"], flag_key, default_rollout_percent)

Kill Switches as Reliability Tools

Ops flags are the SRE's most direct lever for runtime control. When a feature is expensive — say, a real-time recommendation engine that makes synchronous calls to an ML inference service — you want a kill switch that disables it under load and returns a cheap fallback (cached recommendations, most-popular items) without a deploy.

# In the related blog post: set recommendations_engine_enabled=false in the flag backend
# This code runs on every request with no changes needed

client = api.get_client("product-service")

def get_recommendations(user_id: str, product_id: str) -> list[dict]:
    ctx = EvaluationContext(targeting_key=user_id)
    engine_enabled = client.get_boolean_value(
        "recommendations_engine_enabled",
        default_value=True,  # fail-open: default ON, kill switch is explicit OFF
        evaluation_context=ctx,
    )

    if not engine_enabled:
        metrics.increment("recommendations.kill_switch_active")
        return get_cached_popular_items(product_id)

    return call_ml_inference_service(user_id, product_id)

Kill switches belong in related blog posts as first-line mitigation patterns, not last resorts. When you write a blog post for an incident involving service X, ask: is there a flag we could flip to shed load or disable the expensive path? If the answer is no, create one before the next on-call rotation. The relationship to load shedding is direct — kill switches are application-level load shedding, complementing infrastructure-level techniques like rate limiting and circuit breakers.

Flag Debt: Auditing Age and Usage

Flags accumulate. A codebase with three years of history might have hundreds of flags, most of them release flags that were never cleaned up. The danger is twofold: the code complexity of dead branches, and the operational risk of someone accidentally toggling a flag whose behavior is no longer obvious.

The remediation is automated auditing. Pull flag metadata from your backend, cross-reference with creation date and last evaluation timestamp, and surface anything overdue for review:

#!/usr/bin/env python3
"""audit_flags.py — surface stale feature flags for cleanup."""

import json
import sys
from datetime import datetime, timedelta, timezone
import httpx

FLAGS_API = "https://flags.internal.example.com/api/v1/flags"
RELEASE_FLAG_TTL_DAYS = 30
EXPERIMENT_FLAG_TTL_DAYS = 90
# Ops and permission flags are long-lived — skip TTL enforcement for them

def fetch_flags() -> list[dict]:
    response = httpx.get(FLAGS_API, headers={"Authorization": f"Bearer {open('/run/secrets/flags-token').read().strip()}"})
    response.raise_for_status()
    return response.json()["flags"]

def is_stale(flag: dict, now: datetime) -> tuple[bool, str]:
    flag_type = flag.get("type", "release")
    if flag_type in ("ops", "permission"):
        return False, ""

    created_at = datetime.fromisoformat(flag["created_at"]).replace(tzinfo=timezone.utc)
    last_evaluated = flag.get("last_evaluated_at")

    ttl = RELEASE_FLAG_TTL_DAYS if flag_type == "release" else EXPERIMENT_FLAG_TTL_DAYS
    age_days = (now - created_at).days

    if age_days > ttl:
        return True, f"age={age_days}d exceeds TTL={ttl}d for type={flag_type}"

    if last_evaluated is None and age_days > 7:
        return True, f"never evaluated after {age_days}d"

    if last_evaluated:
        last_eval_dt = datetime.fromisoformat(last_evaluated).replace(tzinfo=timezone.utc)
        days_since_eval = (now - last_eval_dt).days
        if days_since_eval > 14:
            return True, f"last evaluated {days_since_eval}d ago"

    return False, ""

def main():
    now = datetime.now(timezone.utc)
    flags = fetch_flags()
    stale = []

    for flag in flags:
        stale_flag, reason = is_stale(flag, now)
        if stale_flag:
            stale.append({
                "key": flag["key"],
                "type": flag.get("type", "release"),
                "owner": flag.get("owner", "unknown"),
                "reason": reason,
            })

    if not stale:
        print("No stale flags found.")
        sys.exit(0)

    print(f"Found {len(stale)} stale flag(s):\n")
    for f in stale:
        print(f"  [{f['type']}] {f['key']} — owner: {f['owner']} — {f['reason']}")

    # In CI/CD: exit non-zero to block deploys if stale count exceeds threshold
    if len(stale) > 20:
        sys.exit(1)

if __name__ == "__main__":
    main()

Run this script in CI and post results to a Slack channel weekly. Better: gate new flag creation on clearing the backlog above some threshold. Enforce TTLs at flag creation by requiring an expires_at field for release and experiment flags — your flag management UI or API should reject flags without it.

Observability for Feature Flags

Flags that are invisible in your observability platform are dangerous. When an incident occurs, the first question should be: did any flags change in the deployment window? This is only answerable if flag evaluation and flag changes are instrumented.

Track evaluation counts as metrics, broken down by flag key and variant:

# Wrap OpenFeature evaluation with metrics emission
def evaluate_flag_with_metrics(
    client, flag_key: str, default: bool, ctx: EvaluationContext
) -> bool:
    details = client.get_boolean_details(flag_key, default, ctx)
    metrics.increment(
        "feature_flag.evaluation",
        tags={
            "flag": flag_key,
            "variant": str(details.value),
            "reason": details.reason.value if details.reason else "unknown",
        },
    )
    return details.value

With this in place, you can plot feature_flag.evaluation{flag=new-checkout-flow, variant=true} alongside the checkout error rate. A correlation between a flag rollout step and a metric change is immediately visible in your dashboards.

Emit a change event to your event store whenever a flag is toggled. Most flag platforms provide webhook callbacks for this; forward them into your observability platform as deployment markers. In Datadog, a webhook to the Events API annotates your graphs with a vertical line at the moment of the flag change — invaluable for postmortem analysis.

Flag-split metrics — the ratio of true to false evaluations — let you confirm that your rollout percentage is behaving as expected. If you set a flag to ten percent and see sixty percent true evaluations, something is wrong with your targeting logic or consistent hashing.

Closing Thoughts

Progressive delivery via feature flags is not primarily a product experimentation tool — it is a reliability practice. The ability to separate the moment code is deployed from the moment users encounter it gives the team a safety margin that no amount of pre-release testing can match. Ops flags as documented kill switches make on-call incidents more tractable. Consistent hashing makes rollouts trustworthy. Automated flag debt auditing prevents the practice from becoming a liability. And OpenFeature ensures that the organizational investment in this infrastructure is not held hostage to a single vendor.

The discipline is in the process: flag types with clear ownership, TTLs enforced at creation, kill-switch patterns documented in blog posts before incidents happen, and observability wired from day one. Build the scaffolding correctly and progressive delivery becomes one of the most powerful tools in the SRE toolkit.

*Zak Hassan is a Staff SRE specializing in reliability engineering, progressive delivery, and production observability. Find him at zakhassan.com or on LinkedIn.*

Topic Paths

SRE and Reliability Kubernetes and Platform Engineering Observability and Incident Learning Cloud Cost and Capacity

About the Author

Zak Hassan writes about reliability engineering under real scale constraints.

Staff-level SRE and platform engineer focused on identity reliability, Kubernetes, observability, cloud architecture, AI infrastructure, and reducing operational uncertainty.

Connect on LinkedIn