In March 2026, AWS quietly dropped one of the most consequential launches for SRE teams in years: the general availability of the AWS DevOps Agent. If you missed it amidst the usual re:Invent noise and quarterly AWS release avalanche, you're not alone. But this one deserves your attention.

The DevOps Agent is AWS's answer to a question every SRE team has been wrestling with for the past 18 months: *should we build our own AI operations tooling, or wait for a vendor to solve it for us?*

Having built a lab version of this — before AWS shipped theirs — I have some opinions.


What AWS DevOps Agent Actually Does

At its core, the DevOps Agent is an AI system that connects to your existing AWS tooling and autonomously investigates incidents. When a CloudWatch alarm fires, a PagerDuty alert triggers, or a ServiceNow ticket opens, the agent starts working without human prompting.

According to AWS, it:

  • Correlates signals across CloudWatch metrics, logs, and traces
  • Queries AWS Config and CloudTrail for recent infrastructure changes
  • Checks deployment history via CodeDeploy and ECS
  • Analyzes X-Ray distributed traces
  • Generates a root cause summary and recommended remediation steps
  • Posts findings to your notification channel of choice

The numbers AWS is reporting are striking: up to 75% lower MTTR and 94% root cause accuracy in preview. One customer (WGU) reduced a two-hour resolution to 28 minutes — a 77% improvement.

What's new in the GA release that wasn't in preview: support for Azure and on-premises environments (not just AWS-native), custom agent skills so you can extend what it knows about, and custom charts/reports in the investigation output.


The Case For Using AWS DevOps Agent

Zero infrastructure to manage. You're not running a Lambda, maintaining a prompt library, managing SDK versions, or handling the operational overhead of running your own AI service. AWS manages all of that. For teams with limited platform bandwidth, this matters.

Deep AWS service integration. The agent has native, pre-built connectors into CloudWatch, Config, CloudTrail, X-Ray, CodeDeploy, ECS, and more. Replicating that surface area in a DIY agent takes weeks of work, and every new AWS service you adopt needs a new tool to build.

Compliance and audit trail baked in. If you're in a regulated environment — financial services, healthcare, anything that needs an audit log of who (or what) took what action — AWS DevOps Agent's integration with CloudTrail gives you that out of the box. DIY agents require you to build this yourself.

It's improving fast. AWS is iterating on this quickly. The custom skills capability means you can extend it to know about your proprietary services and tooling, which closes the biggest gap that existed in preview.


The Case For Building Your Own

You're not AWS-only. If you run on GCP, Azure, or OCI in addition to AWS — which describes most serious production environments — the DevOps Agent's multi-cloud support is still nascent. It added Azure and on-prem support at GA, but "added" is different from "deeply integrated." A custom agent can be built from day one to query your GCP Monitoring, Azure Monitor, and AWS CloudWatch in a single investigation pass.

Your observability stack isn't AWS-native. If your primary observability platform is Datadog, Grafana Cloud, Honeycomb, or Observe by Snowflake — as it is for a large percentage of mature engineering organizations — the DevOps Agent's value proposition weakens significantly. It's optimized for the AWS observability surface, not your third-party stack.

You need domain-specific intelligence. AWS DevOps Agent is general-purpose. It knows how AWS services behave, but it doesn't know that your particular service has a known memory leak that manifests every 72 hours under specific traffic conditions, or that a spike in your payment service latency is almost always caused by a downstream fraud detection API. A custom agent with access to your incident history and public blog archive can be dramatically smarter about your specific systems.

You want full control over the reasoning. When the AWS agent makes a wrong call — and it will — you can't crack it open and see why. With a DIY agent built on Claude or Bedrock, you control the system prompt, the tools, the reasoning chain, and you can add explicit logging of intermediate steps. Debugging and improving a black box is painful.


My Honest Take

Use the AWS DevOps Agent if you're primarily AWS-native, don't have an existing AI operations investment, and want to move fast. It's genuinely impressive and will improve your MTTR without requiring a platform team to build and maintain tooling.

Build your own if you're multi-cloud, if your observability stack lives outside AWS, if you need domain-specific knowledge baked in, or if you have the platform engineering capacity to build something that will be a genuine competitive advantage in how you operate.

The two are not mutually exclusive, either. Several teams I know are running the AWS DevOps Agent as a first pass — it gets 60% of incidents — and then have a custom Claude-based agent with deeper domain knowledge that handles the hard cases the AWS agent can't crack.

Whatever you choose, the era of humans being the only thing standing between an alert and a diagnosis is over. Build your strategy around that reality.


Quick Reference: AWS DevOps Agent vs DIY

FactorAWS DevOps AgentDIY (Claude / Bedrock)
Setup timeHoursWeeks
AWS integration depthNativeBuild-it-yourself
Multi-cloud supportLimitedFull control
Custom knowledgeVia Skills (new)Full control
Observability into reasoningLimitedFull
Cost modelPer-investigationPer-token
Compliance/audit trailCloudTrail nativeBuild-it-yourself
3rd party stack supportLimitedFull

*Zak Hassan is a Staff SRE specializing in AI-powered infrastructure automation. Find him at zakhassan.com or on LinkedIn.*

Topic Paths

About the Author

Zak Hassan writes about reliability engineering under real scale constraints.

Staff-level SRE and platform engineer focused on identity reliability, Kubernetes, observability, cloud architecture, AI infrastructure, and reducing operational uncertainty.

Connect on LinkedIn