I Modeled a 6x Cloud Cost Reduction with an LLM Agent

Cloud cost optimization is one of those problems that's theoretically easy and practically miserable. Everyone knows the levers: right-size instances, delete unused resources, use Spot where possible, move cold data to cheaper storage tiers. The problem isn't knowing what to do — it's the relentless operational grind of actually doing it across hundreds of services, dozens of teams, and an AWS account that never stops changing.

I modeled that grind with an LLM agent in a cloud-lab environment. The simulated optimization path showed how spend can drop dramatically when idle capacity, stale services, and storage lifecycle gaps are found systematically. Here's the complete architecture, the honest results, and the things that almost went wrong.

The Cost Problem at Scale

Before getting to the solution, let me characterize the problem. When you are modeling a mature production-style environment with multiple services and teams, cloud cost optimization has these properties:

It's not one problem, it's fifty. EC2 rightsizing is different from RDS rightsizing, which is different from Kinesis shard optimization, which is different from S3 lifecycle policies, which is different from NAT gateway cross-AZ traffic costs. Each domain requires different data, different analysis, and different remediation paths.

The signal is buried. AWS Cost Explorer gives you data, but the signal-to-noise ratio is brutal. A 15% month-over-month cost increase in a specific service could be normal growth, wasteful behavior, or a billing anomaly. Figuring out which one requires pulling together usage metrics, deployment history, and team context.

Remediation needs human approval — but analysis shouldn't. You should not have an autonomous agent resize databases without a human in the loop. But you absolutely can have an agent do all the investigative work and produce a ready-to-execute recommendation. The human's job should be to approve, not to discover.

Teams are busy. The engineers closest to the code who could make optimization decisions are the least available to spend time on it. Recommendations without context get ignored. Recommendations with clear business impact get acted on.

The Architecture

┌─────────────────────────────────────────────────────────┐
│                   Trigger Layer                         │
│  Weekly scheduler + Cost anomaly detector               │
└───────────────────────┬─────────────────────────────────┘
                        │
┌───────────────────────▼─────────────────────────────────┐
│                 Orchestrator Agent                       │
│  Claude Sonnet + Tool Use                               │
│                                                         │
│  Tools:                                                 │
│  ├── get_cost_explorer_data(service, period, groupby)   │
│  ├── get_cloudwatch_metrics(namespace, metric, dims)    │
│  ├── list_ec2_instances(filters)                        │
│  ├── get_rds_instance_metrics(db_id, period)           │
│  ├── get_s3_bucket_analytics(bucket)                    │
│  ├── get_deployment_history(service, days)              │
│  ├── get_rightsizing_recommendations()                  │
│  └── get_compute_optimizer_findings()                   │
└───────────────────────┬─────────────────────────────────┘
                        │
┌───────────────────────▼─────────────────────────────────┐
│              Analysis & Recommendation Layer            │
│                                                         │
│  Per-finding structured output:                         │
│  - Resource identifier                                  │
│  - Current cost ($)                                     │
│  - Projected savings ($)                                │
│  - Confidence level (high/medium/low)                   │
│  - Effort level (low/medium/high)                       │
│  - Remediation steps (executable)                       │
│  - Risk flags                                           │
└───────────────────────┬─────────────────────────────────┘
                        │
┌───────────────────────▼─────────────────────────────────┐
│                  Delivery Layer                         │
│                                                         │
│  ├── Team Slack digest (weekly)                         │
│  ├── Jira tickets (auto-created, assigned to team)      │
│  ├── S3 report archive (90-day retention)               │
│  └── CloudWatch dashboard (trend tracking)              │
└─────────────────────────────────────────────────────────┘

The Agent Prompt Strategy

The system prompt is where the real engineering is. Here's the structure I use in lab examples:

You are a cloud cost optimization analyst for a production AWS environment.
Your job is to investigate cost data and produce specific, actionable 
recommendations that engineers can act on.

INVESTIGATION APPROACH:
1. Start with the highest-cost services (by absolute spend, last 30 days)
2. For each service in the top 10: pull utilization metrics alongside cost data
3. Compare current resource configurations against actual utilization
4. Check deployment history for cost inflection points
5. Cross-reference with AWS Compute Optimizer and Cost Explorer recommendations

FOR EACH FINDING, you must provide:
- The specific resource (ARN or identifier)
- Current monthly cost
- Projected monthly savings
- Your confidence (high/medium/low) with reasoning
- The exact remediation steps (CLI commands where possible)
- Risk flags: anything that could go wrong

DO NOT:
- Recommend changes to production databases without noting the maintenance window requirement
- Recommend Spot instances for stateful workloads
- Recommend rightsizing that would bring utilization above 70% at P99

WHEN YOU'RE UNCERTAIN: Say so. A medium-confidence $50K recommendation is more 
valuable than a confident-sounding wrong one.

The explicit DO NOT section evolved from incidents. Early versions of the agent recommended Spot for workloads that absolutely should not run on Spot, and once recommended a database downsize that would have created risk at peak traffic. Adding hard constraints to the prompt eliminated these categories of error.

The Results, Honestly

After extended production-like lab testing:

What the agent found (that humans had missed):

14 EC2 instances running at under 5% CPU utilization, all candidates for downsizing or termination. Three of them were Jenkins workers from a deprecated pipeline — no one knew they were still running.
A Kinesis stream provisioned at 20 shards "for safety" during a load test 8 months prior. Average throughput: 0.3 shards. Cost: ~$4,200/month in unnecessary capacity.
6 RDS read replicas across three environments. Two were in dev environments and read zero rows per day.
S3 lifecycle policies on buckets that were created before I had a policy standard. 40TB of log data that had never been moved to Glacier.
Cross-AZ data transfer costs in a service that was doing unnecessary cross-zone calls in a hot path. $8K/month, fixed with one routing change.

The 6x number: The modeled infrastructure spend at the start of this was normalized against the final spend after all the agent-identified optimizations were implemented. The compounding of many medium-sized wins added up.

What the agent got wrong:

Twice recommended downsizing instances that had bursty traffic patterns that weren't visible in the P95 metrics the lab fed it. I added P99 and max metrics to the tool output after this.
Once flagged a "zombie" EC2 instance that was actually running a scheduled job that ran monthly — CPU showed 0% for 29 of 30 days. I added a check for scheduled jobs before flagging instances as candidates for termination.

The Savings Attribution Problem

One thing worth addressing: proving causality in cloud cost reduction is genuinely hard. Spend goes down for many reasons — traffic drops, a service gets rewritten, a team changes their testing behavior. I handle this in the lab by:

Having the agent tag each recommendation with a ticket ID at creation time
Tracking cost in Cost Explorer by resource tag before and after the change
Running a monthly retrospective that compares predicted savings vs actual savings

This attribution rigor matters for two reasons: it helps you improve the agent (findings that consistently don't save what they predict indicate a model issue), and it lets you demonstrate business value to leadership with confidence.

Starting Your Own

The core loop is simpler than the architecture diagram suggests:

def cost_optimization_run():
    # 1. Pull top cost services
    top_services = get_cost_explorer_data(
        period='last_30_days',
        group_by='SERVICE',
        limit=15
    )
    
    # 2. For each service, get utilization context
    enriched = []
    for service in top_services:
        metrics = get_cloudwatch_metrics(service)
        resources = list_resources(service)
        enriched.append({
            'service': service,
            'cost': service['cost'],
            'metrics': metrics,
            'resources': resources
        })
    
    # 3. Agent analyzes and produces recommendations
    recommendations = claude_agent.analyze(
        system_prompt=COST_OPTIMIZER_PROMPT,
        data=enriched,
        tools=COST_TOOLS
    )
    
    # 4. Create tickets and notify teams
    for rec in recommendations:
        if rec['confidence'] == 'high' and rec['savings'] > 1000:
            create_jira_ticket(rec)
            notify_team_slack(rec)

The hard part isn't the code — it's the tooling (getting the right data into the agent's context), the prompt (getting the agent to reason correctly about utilization), and the organizational trust-building (getting teams to act on the recommendations).

Start with read-only analysis before you build any remediation automation. The goal for month one is producing accurate, trustworthy reports. Automation comes later, after the team believes the recommendations are sound.

*Zak Hassan is a Staff SRE specializing in AI-powered infrastructure automation and cloud cost optimization. Find him at zakhassan.com or on LinkedIn.*

Topic Paths

SRE and Reliability Kubernetes and Platform Engineering Observability and Incident Learning AI Infrastructure and Operations Cloud Cost and Capacity

About the Author

Zak Hassan writes about reliability engineering under real scale constraints.

Staff-level SRE and platform engineer focused on identity reliability, Kubernetes, observability, cloud architecture, AI infrastructure, and reducing operational uncertainty.

Connect on LinkedIn