*By Zak Hassan — Staff SRE | May 2026*
Most alerting setups are broken in the same way. Teams set thresholds on individual metrics — CPU > 80%, error rate > 1%, latency > 500ms — and get paged whenever those thresholds are crossed. The result is a pile of alerts that fire too often when nothing is actually wrong, and occasionally miss the things that are. On-call engineers learn to ignore alert noise, which means real incidents get missed too.
Burn rate alerting, derived from SLOs, solves this. Instead of alerting on individual metrics in isolation, you alert on the rate at which you're consuming your error budget — your ability to miss reliability targets. This approach pages you when and only when the current trajectory will exhaust your budget before the month is over, and it tells you how urgent the problem is by how fast the budget is burning.
The Error Budget Foundation
An SLO defines a reliability target: "99.9% of requests will succeed over a 30-day rolling window." The error budget is the inverse: you're allowed to fail 0.1% of requests — a specific number based on traffic volume.
If the service handles 10 million requests per day and has a 99.9% SLO, your monthly error budget is:
Monthly request volume = 10M × 30 = 300M requests
Error budget = 300M × 0.001 = 300,000 failed requests
Error budget per day = 10,000 failed requests
Error budget per hour = 416 failed requestsA burn rate is how fast you're consuming this budget relative to the normal rate. A burn rate of 1 means you're exactly on track — you'll use up exactly 100% of your error budget over the SLO window. A burn rate of 2 means you're using budget twice as fast as allowed — you'll exhaust it in 15 days instead of 30.
Multi-Window Burn Rate Alerts
The Google SRE workbook's recommendation for burn rate alerting uses two time windows for each alert: a short window (to detect fast-burning incidents quickly) and a long window (to confirm the burn rate is sustained, not a brief spike).
# Error rate calculation — the base metric
rate(http_requests_total{job="my-service",status=~"5.."}[5m])
/
rate(http_requests_total{job="my-service"}[5m])
# Burn rate at a given window
# burn_rate = current_error_rate / slo_error_rate
# slo_error_rate = 1 - slo_target = 0.001 for 99.9% SLO
# 1-hour burn rate
(
rate(http_requests_total{job="my-service",status=~"5.."}[1h])
/ rate(http_requests_total{job="my-service"}[1h])
) / 0.001 # Divide by SLO error rate
# 5-minute burn rate
(
rate(http_requests_total{job="my-service",status=~"5.."}[5m])
/ rate(http_requests_total{job="my-service"}[5m])
) / 0.001The multi-window alert rule:
groups:
- name: slo-burn-rate-alerts
rules:
# CRITICAL: burning at 14x — budget exhausted in ~52 hours
# Use dual window: short confirms it's happening, long confirms it's sustained
- alert: ErrorBudgetBurnRateCritical
expr: |
(
# Short window: fast burn detection
(
rate(http_requests_total{job="my-service",status=~"5.."}[1h])
/ rate(http_requests_total{job="my-service"}[1h])
) / 0.001 > 14
)
and
(
# Long window: sustained burn confirmation
(
rate(http_requests_total{job="my-service",status=~"5.."}[6h])
/ rate(http_requests_total{job="my-service"}[6h])
) / 0.001 > 14
)
for: 2m
labels:
severity: page
urgency: critical
annotations:
summary: "Error budget burning at 14x — exhausted in ~52h"
description: |
Service {{ $labels.job }} is burning error budget at 14x the normal rate.
At this rate, the monthly error budget will be exhausted in approximately 52 hours.
blog_post: "https://zakhassan.com/blog/error-budget-critical"
# HIGH: burning at 6x — budget exhausted in ~5 days
- alert: ErrorBudgetBurnRateHigh
expr: |
(
(
rate(http_requests_total{job="my-service",status=~"5.."}[6h])
/ rate(http_requests_total{job="my-service"}[6h])
) / 0.001 > 6
)
and
(
(
rate(http_requests_total{job="my-service",status=~"5.."}[1d])
/ rate(http_requests_total{job="my-service"}[1d])
) / 0.001 > 6
)
for: 5m
labels:
severity: page
urgency: high
annotations:
summary: "Error budget burning at 6x — exhausted in ~5 days"
# MEDIUM: burning at 3x — budget exhausted in ~10 days
- alert: ErrorBudgetBurnRateMedium
expr: |
(
(
rate(http_requests_total{job="my-service",status=~"5.."}[1d])
/ rate(http_requests_total{job="my-service"}[1d])
) / 0.001 > 3
)
and
(
(
rate(http_requests_total{job="my-service",status=~"5.."}[3d])
/ rate(http_requests_total{job="my-service"}[3d])
) / 0.001 > 3
)
for: 10m
labels:
severity: ticket # Not a page — ticket to investigate during business hours
urgency: medium
annotations:
summary: "Error budget burning at 3x — exhausted in ~10 days"Why multi-window: the short window detects sudden spikes quickly. The long window confirms the burn rate is genuine and sustained — a 1-minute spike in errors doesn't satisfy the dual-window condition for the 1h/6h alert. This dramatically reduces false positives compared to single-window alerting.
Latency SLOs and Burn Rate
Error rate SLOs are the most common, but latency SLOs work the same way. Define the latency budget, measure its consumption rate, alert on burn rate.
# Latency SLO: 99% of requests under 500ms, measured over 30 days
# Budget: 1% of requests can exceed 500ms
groups:
- name: latency-slo
rules:
- alert: LatencyBudgetBurnRateCritical
expr: |
(
# Fraction of requests exceeding SLO threshold in short window
(
rate(http_request_duration_seconds_bucket{job="my-service",le="0.5"}[1h])
/ rate(http_request_duration_seconds_count{job="my-service"}[1h])
)
# Convert to fraction EXCEEDING the threshold (1 - fraction within SLO)
- 1
) / -0.01 > 14 # Divided by SLO error rate (1%), compare to burn rate 14x
for: 2m
labels:
severity: pageMulti-SLO burn rate aggregation: for services with both error rate and latency SLOs, aggregate across both:
# Combined burn rate: max of error rate burn and latency burn
max(
(
rate(http_requests_total{job="my-service",status=~"5.."}[1h])
/ rate(http_requests_total{job="my-service"}[1h])
) / 0.001, # Error rate burn rate
(
1 - (
rate(http_request_duration_seconds_bucket{job="my-service",le="0.5"}[1h])
/ rate(http_request_duration_seconds_count{job="my-service"}[1h])
)
) / 0.01 # Latency burn rate
)Error Budget Dashboards: Making Budget Visible
The burn rate alert is for paging. The error budget dashboard is for weekly team review — a tool for the conversation about whether reliability investment is warranted.
# Error budget remaining calculation for dashboard
def compute_error_budget_status(service: str, slo_target: float = 0.999, window_days: int = 30) -> dict:
"""
Compute current error budget status for a service.
"""
# Query Prometheus for error count and total count over the window
end = datetime.utcnow()
start = end - timedelta(days=window_days)
total_requests = query_prometheus_range(
f'sum(increase(http_requests_total{{job="{service}"}}[{window_days}d]))',
start, end
)
error_requests = query_prometheus_range(
f'sum(increase(http_requests_total{{job="{service}",status=~"5.."}}[{window_days}d]))',
start, end
)
slo_error_rate = 1 - slo_target
budget_total = total_requests * slo_error_rate
budget_remaining = budget_total - error_requests
budget_consumed_pct = (error_requests / budget_total) * 100
# Days elapsed vs budget consumed
days_elapsed = window_days # Assumes end of window calculation
budget_consumption_rate = budget_consumed_pct / days_elapsed
return {
"service": service,
"slo_target": slo_target,
"window_days": window_days,
"total_requests": total_requests,
"error_requests": error_requests,
"actual_error_rate": error_requests / total_requests,
"budget_total_requests": budget_total,
"budget_remaining_requests": budget_remaining,
"budget_consumed_pct": budget_consumed_pct,
"budget_remaining_pct": 100 - budget_consumed_pct,
"status": "healthy" if budget_consumed_pct < 50 else
"warning" if budget_consumed_pct < 80 else
"critical"
}The weekly error budget review: once per week, the engineering team reviews the error budget dashboard for all services. Services burning budget faster than normal get explicit discussion: what caused it? Is an action item tracking the fix? If the budget is consistently exhausted, that's a signal that the SLO target is wrong (too aggressive) or reliability investment is inadequate. If the budget is never touched, that's a signal the SLO may be too conservative and reliability work could be redirected elsewhere.
Alert-to-SLO Mapping: Auditing Your Alert Set
Once you've implemented burn rate alerts, audit your existing alert set to identify redundancy and gaps:
def audit_alert_set(alerts: list[dict], slos: list[dict]) -> AuditReport:
"""
Identify:
1. Alerts that don't correspond to any SLO (likely noise or redundant)
2. SLOs that have no corresponding burn rate alert (coverage gap)
3. Alerts with no actionability data (can't evaluate effectiveness)
"""
alert_metrics = {a['metric'] for a in alerts}
slo_metrics = {s['metric'] for s in slos}
# Alerts with no SLO backing — candidates for removal
unsupported_alerts = [a for a in alerts if a.get('metric') not in slo_metrics]
# SLOs with no burn rate alert — coverage gaps
unmonitored_slos = [s for s in slos if not any(
a.get('type') == 'burn_rate' and a.get('slo_id') == s['id']
for a in alerts
)]
# Alerts with actionability below 70%
noisy_alerts = [a for a in alerts
if a.get('actionability_rate', 1.0) < 0.70]
return AuditReport(
unsupported_alerts=unsupported_alerts,
unmonitored_slos=unmonitored_slos,
noisy_alerts=noisy_alerts,
total_alerts=len(alerts),
total_slos=len(slos)
)In practice, most alert audits reveal that 30-50% of existing alerts don't correspond to any defined SLO and should be retired, converted to tickets, or converted to logging. The audit also surfaces SLOs that exist in documentation but have no operational monitoring — reliability targets that aren't being measured aren't being maintained.
Alertmanager Routing: The Last Mile
Well-designed alerts still fail if they're routed incorrectly — pages that go to the wrong team, critical alerts buried in low-priority channels, or routing logic so complex that nobody knows what goes where.
# Alertmanager routing configuration
route:
group_by: ['alertname', 'service', 'team']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'default'
routes:
# Critical burn rate alerts: immediate page
- match:
severity: page
urgency: critical
receiver: pagerduty-critical
continue: false # Stop routing after this match
# High burn rate alerts: page with slight delay
- match:
severity: page
urgency: high
receiver: pagerduty-high
group_wait: 1m # Wait 1 minute to group related alerts
continue: false
# Ticket-severity: Slack notification only
- match:
severity: ticket
receiver: slack-tickets
repeat_interval: 24h # Don't spam the channel
continue: false
# Team-specific routing by label
- match:
team: data
receiver: data-team-pagerduty
continue: false
receivers:
- name: pagerduty-critical
pagerduty_configs:
- service_key: "${PAGERDUTY_CRITICAL_KEY}"
severity: critical
- name: slack-tickets
slack_configs:
- api_url: "${SLACK_WEBHOOK_URL}"
channel: '#reliability-tickets'
title: '{{ .CommonLabels.alertname }}'
text: '{{ .CommonAnnotations.summary }}'The routing configuration is a reliability system itself — it should be version-controlled, reviewed when changed, and tested with amtool before deployment:
# Test routing configuration before applying
amtool config routes test \
--config.file=alertmanager.yml \
alertname="ErrorBudgetBurnRateCritical" \
severity="page" \
urgency="critical" \
team="backend"
# Should show: pagerduty-critical*Zak Hassan is a Staff SRE specializing in SLO engineering, observability, and reliability culture. Find him at zakhassan.com or on LinkedIn.*
Topic Paths