I've been on-call in some form for most of my career. Early on, that meant being woken up by a PagerDuty alert at 3am, fumbling for my laptop, opening a blog post in one tab and a dozen monitoring dashboards in another, and spending the next hour trying to piece together what happened while running on adrenaline and insufficient sleep.

The job is meaningfully different now. Not because the systems have gotten simpler — they haven't — but because the support structure around the on-call engineer has changed. AI tooling that was experimental two years ago is now part of the standard toolkit, and its effect on the on-call experience is real and measurable.

This is a personal account of what modern on-call looks like and where the remaining friction is.


The Page That Wakes You Up

The alert itself is the same. PagerDuty, OpsGenie, or your organization's alerting platform sends a notification at 2:47am. You're awake.

The first difference: what arrives alongside the alert. In a mature AI-augmented on-call setup, by the time you've opened your laptop, there's already a message in the incident Slack channel from the investigation agent. It started working the moment the alert fired — querying logs, checking recent deployments, correlating metrics — and it has a first take ready:

text
🔍 Auto-Investigation: payment-service elevated error rate (P95 >2s)

LIKELY CAUSE (High Confidence):
Deployment of payment-service v3.2.1 at 02:31 UTC introduced a database 
query regression. The /v2/charges endpoint is executing a full table scan 
on the transactions table for queries with >90 day lookback windows.

EVIDENCE:
• Error spike began 02:33 — 2 minutes after deployment at 02:31
• Affected endpoint: POST /v2/charges (3.2% error rate, up from 0.1%)
• Slow query log: SELECT * FROM transactions WHERE customer_id=? AND 
  created_at > ? — missing index on (customer_id, created_at) composite
• Previous version: same query executed in ~8ms
• Current version: same query executing in 2,400ms for customers with 
  >90 day transaction history (~18% of users)

RECOMMENDED ACTION:
Option A (Fast): Roll back to v3.2.0 — 5 minutes, zero risk
Option B (Fix Forward): Add index — requires DB maintenance window

Investigation time: 1m 47s | Confidence: High

You read this in 45 seconds. You approve the rollback. Total time from page to resolution: under 7 minutes. You go back to sleep.

This is not a hypothetical. This is what 60% of P2 incidents look like for teams that have built and deployed AI investigation agents. The agent doesn't eliminate the human — you're still reviewing and approving the recommendation. But your role is judgment, not investigation.


When the Agent Is Wrong

The other 40% of incidents are why humans are still on-call.

The agent's failure modes are specific and learnable. It tends to miss when:

  • The root cause is a correlation without a direct causal chain it can trace
  • The failure involves a new service or integration it has no historical context for
  • Multiple contributing factors are each necessary but not sufficient alone
  • The issue is environmental rather than code-driven (network partition, AWS service degradation, certificate expiry)

When the agent escalates to you with low confidence or an "inconclusive" assessment, you're starting your investigation with the agent's work already done — the data gathering phase is complete, the hypotheses are listed, and the agent has often already ruled out several likely causes. You're starting at step 3 of a 6-step investigation, not step 1.

The difference is significant. The cognitive load of starting an investigation from a blank screen at 3am — figuring out what to look at, where to start, which direction to follow — is substantial. Arriving with a structured report that says "ruled out recent deployment (no deploys in 4 hours), ruled out dependency health (all dependencies healthy), most likely cause is the cache eviction event at 02:40 — here's the evidence" changes the mental load from "where do I start?" to "does this reasoning hold up?"


The Tooling That Changed My Day-to-Day

Beyond the investigation agent, a few tools that have changed the texture of on-call:

Blog post search via natural language. Rather than Ctrl+F-ing through a pile of notes, the on-call setup I model in my homelab includes a semantic search tool over public blog posts, lab notes, and past simulated incident summaries. "How should Redis connection pool exhaustion be handled?" finds the relevant post in 2 seconds, including links to similar experiments and what actually fixed them.

Automated deployment context. When I open an incident channel, the context panel automatically shows: last 5 deploys to the affected service (with git diff links), the current on-call for each upstream dependency, and any open incidents in related services. What used to take 10 minutes of checking each system separately is presented automatically.

Post-incident summarization. After an incident is resolved, the AI generates a draft postmortem: timeline of events, what changed, what happened, what fixed it, preliminary impact estimate. You edit it; you don't write it from scratch. The cognitive overhead of postmortem writing when you're tired after a 2am incident was significant. Editing a reasonable draft is much less draining.


What Hasn't Changed (And Shouldn't)

The judgment calls are still human. Some examples from recent months:

"The agent recommends rolling back, but the deployment fixed a security vulnerability. Should the team accept 30 more minutes of degraded service to apply a targeted hotfix instead of rolling back to the vulnerable version?" — This requires weighing reliability against security. Not an AI call.

"Three separate services are having issues simultaneously. The agent is investigating each independently. But something feels connected — the timing is too precise for coincidence. Let me look at the common infrastructure." — Pattern recognition across incidents that the agent is treating as separate. Human intuition, validated by investigation.

"The evidence points to a vendor API being slow. But I know this vendor just had a major incident last week and they're on shaky footing. This might be intermittent rather than resolved. Let me add monitoring before declaring the incident closed." — Domain knowledge about vendor reliability history. Not encoded in the agent.

These are the incidents that are worth getting woken up for. The agent does the mechanical work; the human brings judgment, historical context, and pattern recognition.


The On-Call Load Numbers

For context: on a team where the AI investigation tooling is mature, the on-call experience in rough numbers:

  • 60% of P2 incidents: agent produces high-confidence diagnosis, human approves resolution, back to sleep in under 10 minutes
  • 25% of P2 incidents: agent produces partial investigation, human takes it from there with context already gathered, 20-40 minutes to resolve
  • 15% of P2 incidents: agent is inconclusive or low confidence, human investigates from scratch, 45+ minutes
  • Near-zero P3 incidents that require human attention at night: the agent handles these autonomously or defers them to morning

The reduction in "wake up, investigate for an hour, find that it was a flaky dependency that resolved itself, go back to sleep at 4am" incidents is substantial. That specific category — the false positive that still costs you an hour of sleep — is largely eliminated for the failure modes in the agent's playbook.


The Organizational Trust-Building Arc

Getting from "teams have an incident investigation agent" to "I actually trust this enough to approve rollbacks at 2am without double-checking" takes time. It can take a team about three months of running an agent in observe-only mode (it produces reports but you still do the investigation) before the trust was high enough for the agent's diagnosis to meaningfully accelerate the human investigation. Another two months before the high-confidence cases were being acted on directly.

The trust-building arc is sequential:

  1. Agent observes and reports (you validate against your own findings)
  2. Agent reports and you use it as a starting point (you extend, not replace)
  3. Agent produces high-confidence recommendations you act on directly
  4. Agent handles low-complexity incidents autonomously (you review in the morning)

Don't try to jump to step 4 without going through steps 1-3. The trust has to be earned through demonstrated accuracy, not assumed.


*Zak Hassan is a Staff SRE specializing in AI-powered operations and large-scale distributed systems reliability. Find him at zakhassan.com or on LinkedIn.*

Topic Paths

About the Author

Zak Hassan writes about reliability engineering under real scale constraints.

Staff-level SRE and platform engineer focused on identity reliability, Kubernetes, observability, cloud architecture, AI infrastructure, and reducing operational uncertainty.

Connect on LinkedIn