*By Zak Hassan — Staff SRE | May 2026*
On-call is where reliability theory meets operational reality. The best-designed system still has a human being paged at 2am when something goes wrong, and that human being's ability to respond effectively — quickly, calmly, without making things worse — determines whether an incident is a minor blip or a multi-hour outage.
Most on-call improvements focus on the wrong things: better tooling, faster escalation, clearer blog posts. These help at the margins. The bigger problems are structural: too much alert noise drowning out real incidents, stale notes that no longer match reality, and rotation structures that burn engineers out until they leave. Fixing on-call requires addressing the structural problems first.
The On-Call Toil Audit
Toil is manual, repetitive operational work that doesn't produce lasting value. Responding to the same alert for the sixth time this month is toil. Manually restarting a service that should auto-recover is toil. Toil directly competes with reliability improvements — engineers doing toil aren't building the automation that would eliminate it.
The audit starts with measurement. For 30 days, track every on-call action:
# On-call toil tracking schema
# Ideally this feeds into a dashboard; minimally, it's a shared spreadsheet
oncall_log_fields = {
"timestamp": "datetime",
"engineer": "string",
"alert_name": "string",
"service": "string",
"action_taken": "string", # What did the engineer actually do?
"time_to_resolve_minutes": "int",
"was_automated": "bool", # Could this have been automated?
"required_code_change": "bool", # Fix required a code/config change?
"recurrence": "int", # How many times this week/month?
"actionable": "bool", # Did the alert require action, or was it noise?
"blog_post_exists": "bool",
"blog_post_was_helpful": "bool",
}After 30 days, the audit reveals the pattern:
- Which alerts fire most often and are noise (low actionable rate)?
- Which alerts require the same manual action every time (automation candidate)?
- Which services generate disproportionate on-call volume?
The audit output is a prioritized list of improvements, ordered by on-call hour reduction per unit of engineering effort.
Alert Quality: The Primary Lever
Ninety percent of on-call toil reduction comes from alert hygiene. An on-call rotation with 50 alerts per week where 40 are noise is fundamentally different from one with 15 alerts per week where 12 require action.
Alert classification:
Every alert should have an explicit classification:
- Page-worthy: Customer impact is occurring or imminent. Requires immediate response at any hour.
- Ticket: Something needs attention within business hours. No immediate response required.
- Log only: Informational. No action required; exists only for post-incident debugging.
Most alert setups have one category (page) doing the job of three. Everything pages, so everything is treated as noise, so real pages get missed.
# Prometheus alerting rules with explicit severity
groups:
- name: my-service.alerts
rules:
# PAGE: customers are experiencing errors right now
- alert: ErrorRateCritical
expr: |
sum(rate(http_requests_total{service="my-service",status=~"5.."}[5m]))
/ sum(rate(http_requests_total{service="my-service"}[5m])) > 0.05
for: 2m
labels:
severity: page
team: backend
annotations:
summary: "Error rate >5% for 2 minutes"
blog_post: "https://zakhassan.com/blog/my-service-high-error-rate"
# TICKET: degraded but not broken; fix before it becomes a page
- alert: ErrorRateElevated
expr: |
sum(rate(http_requests_total{service="my-service",status=~"5.."}[5m]))
/ sum(rate(http_requests_total{service="my-service"}[5m])) > 0.01
for: 10m
labels:
severity: ticket
team: backend
annotations:
summary: "Error rate elevated (1-5%) for 10 minutes"
# LOG: useful for debugging, not worth interrupting anyone
- alert: SlowDatabaseQuery
expr: histogram_quantile(0.99, rate(db_query_duration_seconds_bucket[5m])) > 0.5
for: 5m
labels:
severity: info
team: backendReducing false positive rates: for each alert that fires, track whether it required action. An alert with less than 70% actionability (fires without requiring action more than 30% of the time) should be deleted or converted to a ticket. The bar sounds low — most production alert sets fail it badly.
Blog Posts That Actually Work
A blog post is only useful if an engineer on their third hour of a 2am incident can follow it without previous knowledge of the system. Most blog posts fail this test because they were written by the person who knows the system best, for an imagined reader who also knows the system.
Blog post structure that works:
# [Service Name] — [Alert Name] Blog Post
## What this alert means
The error rate on checkout-service has exceeded 5% for 2+ minutes.
This means customers are experiencing failures on the payment flow.
## Immediate impact assessment
1. Check current error rate: [link to dashboard]
2. Check if orders are processing: run `curl https://api.example.com/health/orders`
3. If orders are completely failing: this is P1 — escalate to engineering leadership NOW
4. If orders are degraded (some failing): this is P2 — continue debugging
## Common causes and fixes (in order of likelihood)
### 1. Recent deployment (most common — check first)
- Check: `kubectl rollout history deployment/checkout-service -n production`
- Fix if recent: `kubectl rollout undo deployment/checkout-service -n production`
- Wait 2 minutes and verify error rate drops
### 2. Database connection exhaustion
- Check: [link to PgBouncer dashboard] — look for cl_waiting > 0
- Fix: `kubectl scale deployment/checkout-service --replicas=X -n production` (reduce replicas to reduce DB connections)
- If PgBouncer itself is failing: page database on-call
### 3. Downstream payment processor degraded
- Check: https://status.stripe.com (the payment processor)
- If Stripe is degraded: this is an upstream dependency issue — notify customer support to expect tickets
- No fix needed from us; monitor until Stripe recovers
### 4. Memory leak causing OOM restarts
- Check: look for frequent pod restarts in [link to pod dashboard]
- Fix: add a memory limit increase temporarily while the team investigates root cause
`kubectl set resources deployment/checkout-service --limits=memory=2Gi -n production`
## Escalation
- Still failing after 30 minutes: page backend engineering lead
- Customer data at risk: page security and engineering leadership
- Payment processor account issues: page VP Engineering
## Post-incident
- File incident report in [link to incident tracker]
- Update this blog post if you found a new causeThe key elements: explicit impact assessment up front (engineers need to know severity before debugging), ordered likely causes (not an exhaustive list), exact commands (not "investigate the database" but the actual command), and escalation contacts for when debugging fails.
Rotation Design: Preventing Burnout
Rotation design is often an afterthought — whoever is available goes on-call. The structural decisions about rotation design have a bigger impact on engineer health and retention than any tooling improvement.
Rotation principles:
Minimum rotation size is six engineers. Fewer than six means engineers are on-call more than one week in six, which becomes a recurring life disruption rather than an occasional inconvenience. If the team is smaller than six, options are: cross-train adjacent teams, hire, or accept that the on-call burden will cause attrition.
Shadow rotations for new engineers: a new team member should shadow on-call (receive pages but not be expected to resolve independently) for at least one full rotation before going on-call independently. Being expected to resolve unfamiliar production incidents solo is a reliable path to on-call PTSD and departure.
Follow-the-sun for global teams: if the team spans timezones, route pages to whoever is in business hours. A North American engineer paged at 3am for something that a European colleague will be awake for in 2 hours is unnecessary suffering. The tooling to implement this (PagerDuty schedules, escalation policies) is straightforward.
Compensation: on-call should be compensated — either through explicit on-call pay, compensatory time off after heavy on-call weeks, or both. Teams that treat on-call as just part of the job without compensation will see who leaves first.
The Handoff Process
Information lost at on-call handoffs compounds over time. The incoming engineer doesn't know which alerts were noisy this week, which fixes are temporary and need follow-up, or what known issues to watch for.
A structured handoff takes 20 minutes and prevents hours of repeated debugging:
# On-Call Handoff — Week of [DATE]
## Outgoing engineer: [NAME]
## Incoming engineer: [NAME]
## Active incidents / monitoring situations
- [INCIDENT-123] Elevated error rate on search-service: temporarily mitigated by scaling to 20 replicas.
Root cause still unknown. Check error rate Monday morning — if still elevated, this becomes a P1.
[Link to incident]
## Temporary changes that need to be reverted
- search-service replica count was increased to 20 (normal is 8).
Revert when error rate returns to baseline:
`kubectl scale deployment/search-service --replicas=8 -n production`
## Known noisy alerts this week
- DiskUsageHigh on worker-node-03 fires every 6 hours but auto-remediates.
Engineering tracking at [link to ticket]. Safe to acknowledge and ignore.
## Upcoming deployments / risky changes
- auth-service deployment Tuesday afternoon. Auth is always risky.
Watch for login error spike 15 minutes post-deploy.
## New blog posts / documentation added this week
- Added blog post for payment-service timeout errors: [link]
## What I wish I'd known before this rotation
- The monitoring dashboard for search-service has incorrect thresholds —
the real p99 latency baseline is 800ms, not 200ms. Ticket filed to fix.The handoff document lives in the incident management system (not Slack, which is ephemeral) and is updated throughout the rotation, not written from scratch at the end.
Automation: Closing the Loop on Toil
The toil audit identifies candidates; automation closes the loop. The highest-value automations are repeated steps that are already explained in blog posts and executed manually every time an alert fires.
# Automated remediation for pod crash-loop
# Triggered by the CrashLoopBackOff alert before paging the engineer
import subprocess
import json
def auto_remediate_crashloop(namespace: str, deployment: str) -> RemediationResult:
"""
Attempt automated recovery from CrashLoopBackOff.
Log the action; page if remediation fails.
"""
# 1. Check if this is a recent deployment (most common cause)
rollout_history = subprocess.run(
["kubectl", "rollout", "history", f"deployment/{deployment}", "-n", namespace],
capture_output=True, text=True
)
# Get the most recent deployment timestamp
last_deploy_time = get_last_deployment_time(namespace, deployment)
time_since_deploy = datetime.utcnow() - last_deploy_time
if time_since_deploy < timedelta(minutes=30):
# Recent deploy is likely the cause — roll back automatically
log_remediation_action(
deployment=deployment,
action="auto_rollback",
reason=f"CrashLoop {time_since_deploy.seconds}s after deployment"
)
result = subprocess.run(
["kubectl", "rollout", "undo", f"deployment/{deployment}", "-n", namespace],
capture_output=True, text=True
)
if result.returncode == 0:
# Monitor for 5 minutes before declaring success
if wait_for_stable(namespace, deployment, timeout_seconds=300):
send_notification(
channel="incidents",
message=f"✅ Auto-remediated CrashLoopBackOff on {deployment} via rollback. No engineer action needed."
)
return RemediationResult(success=True, action="rollback")
# Remediation failed or cause unknown — escalate to on-call
page_oncall(
alert="CrashLoopBackOff",
service=deployment,
context=f"Auto-remediation failed. Last deploy: {last_deploy_time}"
)
return RemediationResult(success=False, action="escalated")The principle: attempt automated remediation for known, safe actions (rollback, restart, scale) before paging a human. Log every automated action. Page the human when automation fails or when the cause is unknown. The goal is not to eliminate on-call — it's to ensure that when a human is paged, it's because human judgment is genuinely required.
*Zak Hassan is a Staff SRE specializing in incident management, on-call operations, and reliability culture. Find him at zakhassan.com or on LinkedIn.*
Topic Paths