*By Zak Hassan — Staff SRE | May 2026*
Cloud costs are reliability's shadow metric. A team that over-provisions for reliability headroom wastes money; a team that under-provisions to save money creates reliability risk. The SRE who understands cost engineering can make reliability investments intelligently — buying the right amount of resilience for the actual risk — instead of either hoarding resources out of fear or cutting corners that will matter at 3am.
This is the operational side of cloud cost engineering: attribution, rightsizing, purchasing strategies, and the tools that make cost management a continuous discipline rather than a quarterly panic.
The Attribution Problem
The fundamental challenge in cloud cost management is attribution — knowing which team, service, or product is responsible for which cost. Without attribution, cost conversations are impossible: you can see the total bill going up, but you can't hold anyone accountable or make intelligent decisions about where to cut.
Tagging strategy: every resource in AWS/GCP/Azure should carry a minimum set of tags:
# Enforced tag set via AWS Config Rule or GCP Organization Policy
REQUIRED_TAGS = {
"team": "The engineering team responsible (e.g., 'backend', 'data')",
"service": "The application or service (e.g., 'checkout-service', 'ml-pipeline')",
"environment": "production | staging | development",
"cost_center": "Finance cost center code for chargeback",
}
# AWS Config rule to flag untagged resources
aws_config_rule = {
"ConfigRuleName": "required-tags",
"Source": {
"Owner": "AWS",
"SourceIdentifier": "REQUIRED_TAGS"
},
"InputParameters": json.dumps({
"tag1Key": "team",
"tag2Key": "service",
"tag3Key": "environment",
"tag4Key": "cost_center"
})
}Kubernetes workload costs require a different approach — tagging cloud instances doesn't tell you which Kubernetes service consumed which resources on a shared node:
# OpenCost / Kubecost: namespace-level cost attribution
# Query the OpenCost API for cost by namespace
def get_namespace_costs(window: str = "7d") -> dict:
response = requests.get(
f"http://opencost.monitoring:9003/allocation/compute",
params={
"window": window,
"aggregate": "namespace",
"accumulate": "true"
}
)
allocations = response.json()["data"][0]
costs_by_namespace = {}
for namespace, data in allocations.items():
costs_by_namespace[namespace] = {
"total_cost": data["totalCost"],
"cpu_cost": data["cpuCost"],
"memory_cost": data["ramCost"],
"storage_cost": data["pvCost"],
"network_cost": data["networkCost"],
"efficiency": data["totalEfficiency"] # % of requested resources actually used
}
return dict(sorted(costs_by_namespace.items(), key=lambda x: x[1]["total_cost"], reverse=True))Rightsizing: The Biggest Lever
Rightsizing — matching instance sizes to actual workload — is consistently the highest-ROI cloud cost reduction activity. Most production environments are significantly over-provisioned because developers over-specify resource requirements to avoid OOM kills, and the over-specification is never revisited.
AWS Compute Optimizer integration:
import boto3
def get_rightsizing_recommendations() -> list[dict]:
optimizer = boto3.client('compute-optimizer', region_name='us-west-2')
# Get EC2 instance recommendations
ec2_recs = optimizer.get_ec2_instance_recommendations(
filters=[{
"name": "Finding",
"values": ["Overprovisioned"] # Only get downsizing opportunities
}]
)
recommendations = []
for rec in ec2_recs['instanceRecommendations']:
current = rec['currentInstanceType']
recommended = rec['recommendationOptions'][0]['instanceType'] # Best option
monthly_savings = rec['recommendationOptions'][0]['projectedUtilizationMetrics']
recommendations.append({
"instance_id": rec['instanceArn'].split('/')[-1],
"current_type": current,
"recommended_type": recommended,
"finding": rec['finding'],
"estimated_monthly_savings": rec['recommendationOptions'][0].get('estimatedMonthlySavings', {}).get('value', 0),
"utilization_p99_cpu": next((m['value'] for m in rec['utilizationMetrics'] if m['name'] == 'CPU' and m['statistic'] == 'MAXIMUM'), None)
})
return sorted(recommendations, key=lambda x: x['estimated_monthly_savings'], reverse=True)Kubernetes resource rightsizing (from VPA recommendations, covered in the Kubernetes post):
def generate_rightsizing_report(namespace: str) -> list[dict]:
"""Generate a rightsizing report comparing current requests vs VPA recommendations."""
vpas = k8s_client.list_namespaced_custom_object(
group="autoscaling.k8s.io",
version="v1",
namespace=namespace,
plural="verticalpodautoscalers"
)
report = []
for vpa in vpas['items']:
deployment = vpa['spec']['targetRef']['name']
if 'recommendation' not in vpa.get('status', {}):
continue
for container in vpa['status']['recommendation']['containerRecommendations']:
current = get_current_requests(namespace, deployment, container['containerName'])
recommended = container['target']
# Calculate cost delta (approximate)
cpu_delta_cores = (parse_cpu(current.get('cpu', '0')) -
parse_cpu(recommended.get('cpu', '0')))
mem_delta_gb = (parse_memory_gb(current.get('memory', '0')) -
parse_memory_gb(recommended.get('memory', '0')))
# AWS us-east-1 approximate rates
monthly_savings = (cpu_delta_cores * 30 * 24 * 0.048 + # $0.048/vCPU-hour
mem_delta_gb * 30 * 24 * 0.006) # $0.006/GB-hour
if monthly_savings > 10: # Only flag significant savings
report.append({
"deployment": deployment,
"container": container['containerName'],
"current_cpu": current.get('cpu'),
"recommended_cpu": recommended.get('cpu'),
"current_memory": current.get('memory'),
"recommended_memory": recommended.get('memory'),
"estimated_monthly_savings": monthly_savings
})
return sorted(report, key=lambda x: x['estimated_monthly_savings'], reverse=True)Spot and Preemptible Instances: The 70% Discount
Spot instances (AWS) and preemptible instances (GCP) offer the same compute as on-demand instances at 60-90% discount, with the trade-off that they can be reclaimed with 2 minutes notice. Most workloads can be engineered to tolerate spot interruption.
Workloads appropriate for spot:
- Batch processing, ML training, data pipelines
- Kubernetes worker nodes running stateless services (with correct PodDisruptionBudgets)
- CI/CD runners
- Development and staging environments
Workloads NOT appropriate for spot:
- Primary database instances
- Single-replica stateful services
- Anything requiring guaranteed availability during business hours
# Kubernetes: prefer spot nodes but fall back to on-demand
apiVersion: apps/v1
kind: Deployment
metadata:
name: batch-processor
spec:
template:
spec:
# Tolerate spot node interruption taints
tolerations:
- key: "node.kubernetes.io/spot"
operator: "Exists"
effect: "NoSchedule"
# Prefer spot, fall back to on-demand
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: "node.kubernetes.io/capacity-type"
operator: In
values: ["spot", "preemptible"]
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: "node.kubernetes.io/capacity-type"
operator: In
values: ["spot", "preemptible", "on-demand"] # Fall back to on-demandHandling spot interruption gracefully:
# Spot interruption handler — run on every node as a DaemonSet
import requests
import subprocess
import time
def watch_for_interruption():
"""
AWS provides a 2-minute warning before spot interruption via instance metadata.
Use it to drain the node gracefully.
"""
while True:
try:
# Check the interruption notice endpoint (only populated during interruption)
response = requests.get(
"http://169.254.169.254/latest/meta-data/spot/interruption-action",
timeout=1
)
if response.status_code == 200:
# Interruption is coming — drain this node
node_name = get_current_node_name()
subprocess.run([
"kubectl", "drain", node_name,
"--ignore-daemonsets",
"--delete-emptydir-data",
"--grace-period=90", # 90 seconds to finish current work
"--timeout=100s"
])
break # Node will be terminated, no need to continue
except requests.exceptions.Timeout:
pass # Metadata endpoint timeout = no interruption notice
time.sleep(5) # Poll every 5 secondsReserved Capacity and Savings Plans
For predictable baseline workloads, committed-use discounts offer 40-60% savings over on-demand pricing. The decision framework:
Compute Savings Plans (AWS): commit to a dollar/hour spend on compute. The flexibility to change instance types, regions, and operating systems makes this lower risk than Reserved Instances.
def analyze_savings_plan_opportunity(
lookback_days: int = 90,
commitment_years: int = 1
) -> SavingsPlanRecommendation:
"""
Analyze historical compute spend to recommend Savings Plan commitment.
"""
ce = boto3.client('cost-explorer')
# Get historical on-demand compute spend
historical_spend = ce.get_cost_and_usage(
TimePeriod={
'Start': (datetime.today() - timedelta(days=lookback_days)).strftime('%Y-%m-%d'),
'End': datetime.today().strftime('%Y-%m-%d')
},
Granularity='DAILY',
Filter={'Dimensions': {'Key': 'PURCHASE_TYPE', 'Values': ['On Demand']}},
Metrics=['UnblendedCost']
)
daily_costs = [float(day['Total']['UnblendedCost']['Amount'])
for day in historical_spend['ResultsByTime']]
# Conservative: commit to p10 of daily spend (baseline, not peak)
baseline_daily = sorted(daily_costs)[len(daily_costs) // 10]
hourly_commitment = baseline_daily / 24
# Savings: ~40% for 1-year, ~60% for 3-year
discount = 0.40 if commitment_years == 1 else 0.60
annual_savings = hourly_commitment * 8760 * discount
return SavingsPlanRecommendation(
hourly_commitment=hourly_commitment,
annual_savings=annual_savings,
commitment_years=commitment_years,
confidence="conservative" # p10 baseline minimizes risk of underutilization
)The commitment level matters: committing to baseline utilization (what the system uses 90% of the time) is low risk. Committing to peak utilization wastes money during off-peak periods when you've already paid for capacity you're not using.
Cost Anomaly Detection
Unexpected cost spikes — a forgotten load test left running, a misconfigured autoscaler, a data pipeline gone infinite loop — can add thousands of dollars to the monthly bill before anyone notices. Automated anomaly detection catches these before they compound.
# Cost anomaly detection using AWS Cost Anomaly Detection
import boto3
def setup_cost_anomaly_alerts():
ce = boto3.client('cost-explorer')
# Create a monitor for overall account spending
monitor = ce.create_anomaly_monitor(
AnomalyMonitor={
'MonitorName': 'AllServicesMonitor',
'MonitorType': 'DIMENSIONAL',
'MonitorDimension': 'SERVICE'
}
)
# Alert when anomaly exceeds $500 total impact or $100/day
subscription = ce.create_anomaly_subscription(
AnomalySubscription={
'MonitorArnList': [monitor['MonitorArn']],
'Subscribers': [
{
'Address': 'sre-team@example.com',
'Type': 'EMAIL'
},
{
'Address': 'arn:aws:sns:us-east-1:123456789:cost-alerts',
'Type': 'SNS'
}
],
'Threshold': 500, # Alert when impact > $500
'Frequency': 'DAILY',
'SubscriptionName': 'SRECostAlerts'
}
)
# For GCP: use Cloud Billing budget alerts
# For Azure: use Azure Cost Management budgets and alertsCost anomaly detection doesn't replace tagging — it's the safety net for costs that slip through. A well-tagged environment with per-service cost dashboards catches most surprises through regular review; anomaly detection catches the ones that reviewers miss.
*Zak Hassan is a Staff SRE specializing in FinOps, cloud infrastructure cost optimization, and reliability engineering. Find him at zakhassan.com or on LinkedIn.*
Topic Paths