MCP for SREs: The Protocol Quietly Changing How We Automate Operations

If you've been following the AI tooling space closely, you've heard about the Model Context Protocol. If you're an SRE who mainly cares about keeping systems up, you may have filed it as "developer tooling" and moved on. That would be a mistake.

MCP is becoming the connective tissue of operational AI. Here's what it actually is, why it matters for infrastructure teams specifically, and what building with it looks like in practice.

What MCP Actually Is (In One Paragraph)

MCP is a standardized protocol — a specification, really — for how AI models connect to external tools and data sources. Before MCP, every "AI with tools" integration was bespoke: your agent could call your custom tools using whatever schema you built, but there was no standard way to package tools for reuse, no standard way to discover them, and no standard way for a tool to expose its capabilities to a model. MCP solves the packaging and discovery problem. An MCP server exposes a set of tools, resources, and prompts using a standard schema. Any MCP-compatible client — Claude, other models, IDEs, custom agents — can connect and use those tools without custom integration code.

There are now over 500 public MCP servers covering everything from GitHub to Postgres to Kubernetes. The ecosystem is growing fast.

Why SREs Should Pay Attention

The promise of MCP for operations teams is composability. Today, when you build an incident response agent, you hand-write tools for:

Querying CloudWatch
Fetching PagerDuty alert details
Listing recent deployments from CodeDeploy
Searching your public blog posts and lab notes
Querying your Iceberg data lake

Each of these is custom code you own, maintain, and debug. When the CloudWatch API changes, you update your tool. When you hire a new engineer and they want to build their own agent for a different use case, they rewrite those same tools.

MCP changes the economics here. Write it once as an MCP server, and any agent in your organization — incident response, cost optimization, capacity planning, security review — can use the same tools. The Kubernetes MCP server your platform team built can be used by the Claude integration in your IDE, the incident agent you built, and the cost analysis agent your FinOps team is building.

The SRE MCP Server Pattern

Here's what a minimal SRE-purpose MCP server looks like. This one wraps CloudWatch Logs and exposes it to any MCP client:

# sre_mcp_server.py
from mcp.server import Server
from mcp.server.stdio import stdio_server
from mcp.types import Tool, TextContent
import boto3
import json

server = Server("sre-observability")
cw_logs = boto3.client('logs')

@server.list_tools()
async def list_tools():
    return [
        Tool(
            name="query_cloudwatch_logs",
            description="""Search CloudWatch Logs for a service within a time range.
            Use this to investigate errors, find log patterns, or correlate events.""",
            inputSchema={
                "type": "object",
                "properties": {
                    "log_group": {
                        "type": "string",
                        "description": "CloudWatch log group name (e.g., /aws/lambda/my-function)"
                    },
                    "query": {
                        "type": "string",
                        "description": "CloudWatch Logs Insights query string"
                    },
                    "start_time_minutes_ago": {
                        "type": "integer",
                        "description": "How far back to search, in minutes",
                        "default": 60
                    }
                },
                "required": ["log_group", "query"]
            }
        ),
        Tool(
            name="get_metric_statistics",
            description="Fetch CloudWatch metric statistics for a given service and metric",
            inputSchema={
                "type": "object",
                "properties": {
                    "namespace": {"type": "string"},
                    "metric_name": {"type": "string"},
                    "dimensions": {
                        "type": "object",
                        "description": "Key-value pairs for metric dimensions"
                    },
                    "period_minutes": {"type": "integer", "default": 5},
                    "lookback_hours": {"type": "integer", "default": 2}
                },
                "required": ["namespace", "metric_name"]
            }
        )
    ]

@server.call_tool()
async def call_tool(name: str, arguments: dict):
    if name == "query_cloudwatch_logs":
        import time
        start = int(time.time()) - (arguments.get('start_time_minutes_ago', 60) * 60)
        
        response = cw_logs.start_query(
            logGroupName=arguments['log_group'],
            startTime=start,
            endTime=int(time.time()),
            queryString=arguments['query'],
            limit=100
        )
        query_id = response['queryId']
        
        # Poll for results
        import asyncio
        for _ in range(20):
            result = cw_logs.get_query_results(queryId=query_id)
            if result['status'] == 'Complete':
                return [TextContent(
                    type="text",
                    text=json.dumps(result['results'], indent=2)
                )]
            await asyncio.sleep(1)
        
        return [TextContent(type="text", text="Query timed out")]
    
    elif name == "get_metric_statistics":
        # Implementation here
        pass

async def main():
    async with stdio_server() as (read_stream, write_stream):
        await server.run(read_stream, write_stream, server.create_initialization_options())

if __name__ == "__main__":
    import asyncio
    asyncio.run(main())

Once this server is running, any MCP client can connect to it — your local Claude Desktop for ad-hoc queries, your incident response agent, your capacity planning agent. The tool is written once and reused everywhere.

The Async Tasks Primitive (What's Coming)

The current MCP spec is synchronous: a client calls a tool and waits for a response. This works for most operational queries, but it creates problems for long-running operations: "run a database migration and tell me when it's done," "wait for this deployment to complete and check the error rate," "monitor this metric for the next hour and alert me if it spikes."

The MCP roadmap for 2026 includes an async Tasks primitive — a standardized way for an MCP server to return a task handle that the client can poll or receive callbacks from. For SREs, this is significant. It means your MCP server can expose operations like:

start_deployment_and_monitor(service, version) → task_id
get_task_status(task_id) → {status, progress, result}

And an agent orchestrator can manage multiple long-running operations simultaneously, without blocking on each one. This is the pattern you need to build truly autonomous operational tooling — the kind that can run a canary deployment, watch the error rate for 20 minutes, and roll back automatically if thresholds are breached.

Building Your SRE MCP Library

If you're starting today, I'd prioritize these tools for your first MCP server:

Observability tools:

CloudWatch Logs query
CloudWatch Metrics fetch
X-Ray trace lookup
Datadog/Grafana query (if applicable)

Infrastructure tools:

EC2/ECS describe instance/task
Recent deployment history (CodeDeploy, ECS deploy events)
RDS instance status and metrics
EKS pod and node status

Knowledge tools:

Blog post search (embed your public blog posts and lab notes in a vector store, expose a semantic search tool)
Past incident search (query your incident history)
Service dependency map (which services does X depend on?)

Communication tools:

Slack message posting
PagerDuty alert management
Jira ticket creation

Each of these becomes a building block. Once they're in your MCP server, any agent — whether you built it or someone else did — can use them to reason about your infrastructure.

The Bigger Picture

MCP represents a shift in how operational AI gets built. The current paradigm is vertically integrated: you build an agent, you build its tools, you own the whole stack. The MCP paradigm is horizontal: tools are a shared layer, agents are built on top, and the same underlying tooling powers many different agents built by different teams for different purposes.

For SRE teams, this means the investment in building a high-quality operational MCP server has compounding returns. You build it once. Your on-call agent uses it. Your cost optimization agent uses it. The IDE integrations your developers use to query production metrics use it. The audit and compliance tooling uses it.

The teams that invest in this layer now will have a significant operational advantage as LLM-powered tooling matures. The tool layer is the moat.

*Zak Hassan is a Staff SRE specializing in AI-powered infrastructure automation. Find him at zakhassan.com or on LinkedIn.*

Topic Paths

SRE and Reliability Kubernetes and Platform Engineering Observability and Incident Learning Cloud Cost and Capacity

About the Author

Zak Hassan writes about reliability engineering under real scale constraints.

Staff-level SRE and platform engineer focused on identity reliability, Kubernetes, observability, cloud architecture, AI infrastructure, and reducing operational uncertainty.

Connect on LinkedIn