I've spent years turning infrastructure experiments into technical notes and blog posts: carefully structured markdown that explains how to diagnose a degraded service, restart a stuck job, or identify which upstream dependency quietly died at 2am. And for a long time, I was proud of that format.
Then I built an AI agent in my lab that changed how I think about that format in about three weeks.
This isn't a hype piece. I want to tell you exactly what I prototyped in a lab, what surprised me, what failed, and what I'd do differently. Because there's a lot of AI-in-SRE content out there right now that reads like a press release, and you deserve something more honest than that.
The Problem This Prototype Was Solving
The lab incident response workflow had a familiar shape: alert fires, the simulated on-call flow opens the relevant blog post, reads the steps, starts manually pulling logs from CloudWatch, cross-references with metrics in Grafana, checks the deployment history, and after 20-40 minutes arrives at a diagnosis that should have taken 5.
The blog post was never the bottleneck. The bottleneck was data gathering and correlation. A good on-call engineer wasn't slow because they didn't know what to do — they were slow because gathering the context to make a decision took time. Fetch the logs. Parse the timestamps. Find the correlated metric spike. Check what deployed when.
That's exactly the kind of tedious, multi-step, data-gathering work that LLMs with tool use are genuinely good at.
The Architecture
I prototyped a Claude-powered incident response agent using the Anthropic SDK with a tool-use pattern. The core architecture was straightforward:
Alert (CloudWatch / PagerDuty)
↓
Lambda Trigger
↓
Incident Agent (Claude + Tools)
├── Tool: fetch_cloudwatch_logs(service, time_window)
├── Tool: get_recent_deployments(service, hours=24)
├── Tool: query_metrics(service, metric_name, start, end)
├── Tool: get_dependency_health(upstream_services)
└── Tool: search_past_incidents(error_pattern)
↓
Structured Output: Root Cause Summary + Recommended Actions
↓
Posted to Slack incident channelThe agent receives the alert payload, then autonomously decides which tools to call and in what order. It doesn't just run a fixed sequence — it adapts based on what it finds. If the first log query shows a database timeout error, it goes deeper into database metrics. If it looks like a deployment issue, it pulls the git diff and compares it against the error timeline.
The system prompt matters enormously. I spent more time on the prompt than on the code. It defines the agent's reasoning framework, tells it what to look for, how to structure its output, and crucially — what to do when it's uncertain. Teams have explicit instructions telling the agent to say "I don't know" and escalate rather than hallucinate a confident but wrong diagnosis.
What Surprised Me
It was better at pattern matching than humans under pressure. At 3am, a tired on-call engineer misses things. The agent doesn't. It will consistently check all the things it's been told to check, in the right order, without skipping steps because it wants to go back to sleep.
It remembered things humans forget. I added a tool that searches past incident summaries stored in S3. The agent started surfacing "this looks like the incident from November where the Redis connection pool was exhausted during batch job execution" — context that no human would have at their fingertips at 2am.
The 20% of cases where it was wrong were really wrong. This is the part the blog posts don't tell you. When Claude got it right, it was impressively right. When it got it wrong, it could be confidently wrong in a way that sent an engineer down the wrong path for 15 minutes. I addressed this by requiring the agent to express uncertainty explicitly and by flagging low-confidence diagnoses for immediate human review.
Token costs were non-trivial at scale. Pulling 10,000 lines of logs into context for every alert adds up. I introduced pre-filtering — a smaller, cheaper model first culls the logs to the 500 most relevant lines before the main agent runs. This cut costs by about 70% with minimal accuracy loss.
The Results
After running this through several weeks of reliability testing in a cloud lab:
- Average MTTR dropped from 34 minutes to 11 minutes for incidents where the agent reached a confident diagnosis
- On-call engineers reported spending less time on "data gathering" and more time on "actual decision making"
- Roughly 60% of P2 incidents were fully diagnosed by the agent before any human looked at them
- The other 40% got a partially completed investigation that the human then finished — which was still faster than starting from scratch
The prototype did not eliminate on-call. It made the investigation loop significantly less miserable.
What I'd Do Differently
Start with a narrower scope. I initially tried to build a general-purpose incident agent. That was too broad. The prototype got much better results when the scope narrowed to a single service category first, tuned the prompts and tools for that domain, then expanded. General agents are hard. Domain-specific agents are achievable.
Build evals before you build the agent. I should have defined what "good" looks like — what a correct diagnosis looks like for a set of test incidents — before the prototype work started. Instead, I prototyped first and evaluated by vibes, which made it hard to know if prompt changes were actually improvements.
Log everything the agent does. Every tool call, every intermediate reasoning step, every final output. You will need this for debugging, for trust building with your platform team, and for demonstrating to leadership that the system is working. The lab version uses OpenTelemetry with custom spans for agent tool calls.
The Honest Bottom Line
AI incident response agents are real, they work, and if your operational learning still lives only in static blog posts while humans manually gather every bit of context during incidents, you're leaving significant time savings on the table.
But they're not magic. They require careful prompt engineering, good tooling design, robust evaluation, and a team willing to trust-but-verify their outputs. The teams that treat them as autonomous replacements for human judgment will get burned. The teams that treat them as highly capable, tireless first responders that hand off to humans for final decisions will see the results I saw in testing.
The blog posts are not obsolete. They are the seed material for better tools.
*Zak Hassan is a Staff SRE specializing in AI-powered infrastructure automation. Find him at zakhassan.com or on LinkedIn.*
Topic Paths