*By Zak Hassan — Staff SRE | May 2026*


Most alerting setups are broken in the same way. Teams set thresholds on individual metrics — CPU > 80%, error rate > 1%, latency > 500ms — and get paged whenever those thresholds are crossed. The result is a pile of alerts that fire too often when nothing is actually wrong, and occasionally miss the things that are. On-call engineers learn to ignore alert noise, which means real incidents get missed too.

Burn rate alerting, derived from SLOs, solves this. Instead of alerting on individual metrics in isolation, you alert on the rate at which you're consuming your error budget — your ability to miss reliability targets. This approach pages you when and only when the current trajectory will exhaust your budget before the month is over, and it tells you how urgent the problem is by how fast the budget is burning.


The Error Budget Foundation

An SLO defines a reliability target: "99.9% of requests will succeed over a 30-day rolling window." The error budget is the inverse: you're allowed to fail 0.1% of requests — a specific number based on traffic volume.

If the service handles 10 million requests per day and has a 99.9% SLO, your monthly error budget is:

text
Monthly request volume = 10M × 30 = 300M requests
Error budget = 300M × 0.001 = 300,000 failed requests
Error budget per day = 10,000 failed requests
Error budget per hour = 416 failed requests

A burn rate is how fast you're consuming this budget relative to the normal rate. A burn rate of 1 means you're exactly on track — you'll use up exactly 100% of your error budget over the SLO window. A burn rate of 2 means you're using budget twice as fast as allowed — you'll exhaust it in 15 days instead of 30.


Multi-Window Burn Rate Alerts

The Google SRE workbook's recommendation for burn rate alerting uses two time windows for each alert: a short window (to detect fast-burning incidents quickly) and a long window (to confirm the burn rate is sustained, not a brief spike).

promql
# Error rate calculation — the base metric
rate(http_requests_total{job="my-service",status=~"5.."}[5m])
/
rate(http_requests_total{job="my-service"}[5m])

# Burn rate at a given window
# burn_rate = current_error_rate / slo_error_rate
# slo_error_rate = 1 - slo_target = 0.001 for 99.9% SLO

# 1-hour burn rate
(
  rate(http_requests_total{job="my-service",status=~"5.."}[1h])
  / rate(http_requests_total{job="my-service"}[1h])
) / 0.001  # Divide by SLO error rate

# 5-minute burn rate
(
  rate(http_requests_total{job="my-service",status=~"5.."}[5m])
  / rate(http_requests_total{job="my-service"}[5m])
) / 0.001

The multi-window alert rule:

yaml
groups:
  - name: slo-burn-rate-alerts
    rules:
      # CRITICAL: burning at 14x — budget exhausted in ~52 hours
      # Use dual window: short confirms it's happening, long confirms it's sustained
      - alert: ErrorBudgetBurnRateCritical
        expr: |
          (
            # Short window: fast burn detection
            (
              rate(http_requests_total{job="my-service",status=~"5.."}[1h])
              / rate(http_requests_total{job="my-service"}[1h])
            ) / 0.001 > 14
          )
          and
          (
            # Long window: sustained burn confirmation
            (
              rate(http_requests_total{job="my-service",status=~"5.."}[6h])
              / rate(http_requests_total{job="my-service"}[6h])
            ) / 0.001 > 14
          )
        for: 2m
        labels:
          severity: page
          urgency: critical
        annotations:
          summary: "Error budget burning at 14x — exhausted in ~52h"
          description: |
            Service {{ $labels.job }} is burning error budget at 14x the normal rate.
            At this rate, the monthly error budget will be exhausted in approximately 52 hours.
          blog_post: "https://zakhassan.com/blog/error-budget-critical"

      # HIGH: burning at 6x — budget exhausted in ~5 days
      - alert: ErrorBudgetBurnRateHigh
        expr: |
          (
            (
              rate(http_requests_total{job="my-service",status=~"5.."}[6h])
              / rate(http_requests_total{job="my-service"}[6h])
            ) / 0.001 > 6
          )
          and
          (
            (
              rate(http_requests_total{job="my-service",status=~"5.."}[1d])
              / rate(http_requests_total{job="my-service"}[1d])
            ) / 0.001 > 6
          )
        for: 5m
        labels:
          severity: page
          urgency: high
        annotations:
          summary: "Error budget burning at 6x — exhausted in ~5 days"

      # MEDIUM: burning at 3x — budget exhausted in ~10 days
      - alert: ErrorBudgetBurnRateMedium
        expr: |
          (
            (
              rate(http_requests_total{job="my-service",status=~"5.."}[1d])
              / rate(http_requests_total{job="my-service"}[1d])
            ) / 0.001 > 3
          )
          and
          (
            (
              rate(http_requests_total{job="my-service",status=~"5.."}[3d])
              / rate(http_requests_total{job="my-service"}[3d])
            ) / 0.001 > 3
          )
        for: 10m
        labels:
          severity: ticket   # Not a page — ticket to investigate during business hours
          urgency: medium
        annotations:
          summary: "Error budget burning at 3x — exhausted in ~10 days"

Why multi-window: the short window detects sudden spikes quickly. The long window confirms the burn rate is genuine and sustained — a 1-minute spike in errors doesn't satisfy the dual-window condition for the 1h/6h alert. This dramatically reduces false positives compared to single-window alerting.


Latency SLOs and Burn Rate

Error rate SLOs are the most common, but latency SLOs work the same way. Define the latency budget, measure its consumption rate, alert on burn rate.

yaml
# Latency SLO: 99% of requests under 500ms, measured over 30 days
# Budget: 1% of requests can exceed 500ms

groups:
  - name: latency-slo
    rules:
      - alert: LatencyBudgetBurnRateCritical
        expr: |
          (
            # Fraction of requests exceeding SLO threshold in short window
            (
              rate(http_request_duration_seconds_bucket{job="my-service",le="0.5"}[1h])
              / rate(http_request_duration_seconds_count{job="my-service"}[1h])
            )
            # Convert to fraction EXCEEDING the threshold (1 - fraction within SLO)
            - 1
          ) / -0.01 > 14  # Divided by SLO error rate (1%), compare to burn rate 14x
        for: 2m
        labels:
          severity: page

Multi-SLO burn rate aggregation: for services with both error rate and latency SLOs, aggregate across both:

promql
# Combined burn rate: max of error rate burn and latency burn
max(
  (
    rate(http_requests_total{job="my-service",status=~"5.."}[1h])
    / rate(http_requests_total{job="my-service"}[1h])
  ) / 0.001,  # Error rate burn rate
  (
    1 - (
      rate(http_request_duration_seconds_bucket{job="my-service",le="0.5"}[1h])
      / rate(http_request_duration_seconds_count{job="my-service"}[1h])
    )
  ) / 0.01    # Latency burn rate
)

Error Budget Dashboards: Making Budget Visible

The burn rate alert is for paging. The error budget dashboard is for weekly team review — a tool for the conversation about whether reliability investment is warranted.

python
# Error budget remaining calculation for dashboard
def compute_error_budget_status(service: str, slo_target: float = 0.999, window_days: int = 30) -> dict:
    """
    Compute current error budget status for a service.
    """
    # Query Prometheus for error count and total count over the window
    end = datetime.utcnow()
    start = end - timedelta(days=window_days)
    
    total_requests = query_prometheus_range(
        f'sum(increase(http_requests_total{{job="{service}"}}[{window_days}d]))',
        start, end
    )
    
    error_requests = query_prometheus_range(
        f'sum(increase(http_requests_total{{job="{service}",status=~"5.."}}[{window_days}d]))',
        start, end
    )
    
    slo_error_rate = 1 - slo_target
    budget_total = total_requests * slo_error_rate
    budget_remaining = budget_total - error_requests
    budget_consumed_pct = (error_requests / budget_total) * 100
    
    # Days elapsed vs budget consumed
    days_elapsed = window_days  # Assumes end of window calculation
    budget_consumption_rate = budget_consumed_pct / days_elapsed
    
    return {
        "service": service,
        "slo_target": slo_target,
        "window_days": window_days,
        "total_requests": total_requests,
        "error_requests": error_requests,
        "actual_error_rate": error_requests / total_requests,
        "budget_total_requests": budget_total,
        "budget_remaining_requests": budget_remaining,
        "budget_consumed_pct": budget_consumed_pct,
        "budget_remaining_pct": 100 - budget_consumed_pct,
        "status": "healthy" if budget_consumed_pct < 50 else 
                  "warning" if budget_consumed_pct < 80 else 
                  "critical"
    }

The weekly error budget review: once per week, the engineering team reviews the error budget dashboard for all services. Services burning budget faster than normal get explicit discussion: what caused it? Is an action item tracking the fix? If the budget is consistently exhausted, that's a signal that the SLO target is wrong (too aggressive) or reliability investment is inadequate. If the budget is never touched, that's a signal the SLO may be too conservative and reliability work could be redirected elsewhere.


Alert-to-SLO Mapping: Auditing Your Alert Set

Once you've implemented burn rate alerts, audit your existing alert set to identify redundancy and gaps:

python
def audit_alert_set(alerts: list[dict], slos: list[dict]) -> AuditReport:
    """
    Identify:
    1. Alerts that don't correspond to any SLO (likely noise or redundant)
    2. SLOs that have no corresponding burn rate alert (coverage gap)
    3. Alerts with no actionability data (can't evaluate effectiveness)
    """
    alert_metrics = {a['metric'] for a in alerts}
    slo_metrics = {s['metric'] for s in slos}
    
    # Alerts with no SLO backing — candidates for removal
    unsupported_alerts = [a for a in alerts if a.get('metric') not in slo_metrics]
    
    # SLOs with no burn rate alert — coverage gaps
    unmonitored_slos = [s for s in slos if not any(
        a.get('type') == 'burn_rate' and a.get('slo_id') == s['id']
        for a in alerts
    )]
    
    # Alerts with actionability below 70%
    noisy_alerts = [a for a in alerts 
                   if a.get('actionability_rate', 1.0) < 0.70]
    
    return AuditReport(
        unsupported_alerts=unsupported_alerts,
        unmonitored_slos=unmonitored_slos,
        noisy_alerts=noisy_alerts,
        total_alerts=len(alerts),
        total_slos=len(slos)
    )

In practice, most alert audits reveal that 30-50% of existing alerts don't correspond to any defined SLO and should be retired, converted to tickets, or converted to logging. The audit also surfaces SLOs that exist in documentation but have no operational monitoring — reliability targets that aren't being measured aren't being maintained.


Alertmanager Routing: The Last Mile

Well-designed alerts still fail if they're routed incorrectly — pages that go to the wrong team, critical alerts buried in low-priority channels, or routing logic so complex that nobody knows what goes where.

yaml
# Alertmanager routing configuration
route:
  group_by: ['alertname', 'service', 'team']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default'
  
  routes:
    # Critical burn rate alerts: immediate page
    - match:
        severity: page
        urgency: critical
      receiver: pagerduty-critical
      continue: false  # Stop routing after this match
    
    # High burn rate alerts: page with slight delay
    - match:
        severity: page
        urgency: high
      receiver: pagerduty-high
      group_wait: 1m  # Wait 1 minute to group related alerts
      continue: false
    
    # Ticket-severity: Slack notification only
    - match:
        severity: ticket
      receiver: slack-tickets
      repeat_interval: 24h  # Don't spam the channel
      continue: false
    
    # Team-specific routing by label
    - match:
        team: data
      receiver: data-team-pagerduty
      continue: false

receivers:
  - name: pagerduty-critical
    pagerduty_configs:
      - service_key: "${PAGERDUTY_CRITICAL_KEY}"
        severity: critical
        
  - name: slack-tickets
    slack_configs:
      - api_url: "${SLACK_WEBHOOK_URL}"
        channel: '#reliability-tickets'
        title: '{{ .CommonLabels.alertname }}'
        text: '{{ .CommonAnnotations.summary }}'

The routing configuration is a reliability system itself — it should be version-controlled, reviewed when changed, and tested with amtool before deployment:

bash
# Test routing configuration before applying
amtool config routes test \
  --config.file=alertmanager.yml \
  alertname="ErrorBudgetBurnRateCritical" \
  severity="page" \
  urgency="critical" \
  team="backend"

# Should show: pagerduty-critical

*Zak Hassan is a Staff SRE specializing in SLO engineering, observability, and reliability culture. Find him at zakhassan.com or on LinkedIn.*

Topic Paths

About the Author

Zak Hassan writes about reliability engineering under real scale constraints.

Staff-level SRE and platform engineer focused on identity reliability, Kubernetes, observability, cloud architecture, AI infrastructure, and reducing operational uncertainty.

Connect on LinkedIn