AI-Driven Capacity Planning: Moving from Reactive Scaling to Predictive Infrastructure

Capacity planning has always been part science, part art, and part educated guessing. Traditional approaches — observe historical traffic, apply a growth factor, add a safety buffer, provision that — work reasonably well for traffic that behaves predictably. They fail when growth is non-linear, when business events create sharp spikes, or when the relationship between traffic and resource consumption changes because the workload changed.

Machine learning applied to capacity planning changes the game in specific ways: it can incorporate signals that humans miss, it can model non-linear relationships, and it can learn from its own prediction errors. Here's what AI-driven capacity planning actually looks like in practice.

The Limits of Threshold-Based Autoscaling

Standard autoscaling works reactively: when CPU or memory exceeds a threshold, add capacity; when it drops below another threshold, remove it. The fundamental problem is that reactive scaling always lags reality. By the time a metric exceeds the scale-out threshold, some requests are already experiencing degraded service.

The lag compounds with the provisioning time for the resource being scaled. Scaling an EC2 Auto Scaling Group adds capacity in 2-5 minutes. Scaling an EKS node group takes 3-8 minutes. Scaling an Aurora Serverless v2 database scales faster but still has a ramp period. During that window, your current capacity is handling more traffic than it was sized for.

For workloads with predictable traffic patterns, the right answer is predictive scaling: scale up before the traffic arrives, not after. For workloads with unpredictable spikes, the right answer is a combination of predictive scaling (for the predictable component) and fast reactive scaling (for the unpredictable component).

What Predictive Scaling Actually Requires

AWS Predictive Scaling uses ML to forecast traffic based on historical patterns. It works well for workloads with recurring patterns — business hours peaks, weekly cycles, monthly reporting runs. It requires at least two weeks of CloudWatch metrics history to build a reliable forecast.

The configuration:

import boto3

autoscaling = boto3.client('autoscaling')

autoscaling.put_scaling_policy(
    AutoScalingGroupName='my-api-fleet',
    PolicyName='predictive-scaling-policy',
    PolicyType='PredictiveScaling',
    PredictiveScalingConfiguration={
        'MetricSpecifications': [
            {
                'TargetValue': 70.0,  # Target CPU utilization
                'PredefinedMetricPairSpecification': {
                    'PredefinedMetricType': 'ASGCPUUtilization'
                }
            }
        ],
        'Mode': 'ForecastAndScale',  # Both forecast AND scale proactively
        'SchedulingBufferTime': 300,  # Scale up 5 minutes before forecast
        'MaxCapacityBreachBehavior': 'IncreaseMaxCapacity',
        'MaxCapacityBuffer': 10  # Allow 10% above configured max during peaks
    }
)

The SchedulingBufferTime parameter is critical — it tells AWS to provision capacity N seconds before the forecasted need arrives. For a service where instance startup takes 3 minutes, a 5-minute buffer ensures capacity is available and healthy before traffic hits.

Incorporating Business Signals

The limitation of ML-based scaling that only uses infrastructure metrics is that it can't account for business events: product launches, marketing campaigns, sales events, seasonal patterns specific to your business. A Black Friday sale that's scheduled but not yet in the historical data won't appear in the forecast.

The pattern: build a capacity planning API that accepts business calendar events and adjusts the baseline forecast:

class CapacityPlanningEngine:
    def __init__(self, baseline_model, event_registry):
        self.baseline_model = baseline_model
        self.event_registry = event_registry
    
    def get_capacity_forecast(
        self, 
        service: str, 
        start_time: datetime,
        end_time: datetime
    ) -> list[CapacityPoint]:
        
        # 1. Get baseline ML forecast
        baseline = self.baseline_model.predict(service, start_time, end_time)
        
        # 2. Check for business events in the window
        events = self.event_registry.get_events(start_time, end_time)
        
        # 3. Apply event multipliers to baseline
        adjusted = []
        for point in baseline:
            multiplier = 1.0
            for event in events:
                if event.affects_service(service) and event.covers_time(point.timestamp):
                    multiplier *= event.traffic_multiplier
                    # e.g., Black Friday event has multiplier=8.0 for checkout service
            
            adjusted.append(CapacityPoint(
                timestamp=point.timestamp,
                baseline_rps=point.predicted_rps,
                adjusted_rps=point.predicted_rps * multiplier,
                event_names=[e.name for e in events if e.covers_time(point.timestamp)]
            ))
        
        return adjusted
    
    def get_recommended_capacity(self, service: str, time: datetime) -> int:
        forecast = self.get_capacity_forecast(service, time, time + timedelta(hours=1))
        peak_rps = max(p.adjusted_rps for p in forecast)
        
        # Convert RPS to instance count with safety buffer
        rps_per_instance = self.get_service_capacity(service)  # Measured from load tests
        instances_needed = math.ceil(peak_rps / rps_per_instance)
        safety_buffer = math.ceil(instances_needed * 0.2)  # 20% buffer
        
        return instances_needed + safety_buffer

The event registry is populated by your business teams: engineering event registration (load tests, releases), marketing event registration (campaign launches), and business calendar integration (sales events, seasonality). The capacity planning engine combines these with the ML baseline forecast to produce a holistic capacity recommendation.

The Capacity Model Validation Loop

Predictive models are only useful if they're accurate. The validation loop:

After each forecasting period, compare forecast to actual:

def validate_forecast_accuracy(service: str, period: tuple[datetime, datetime]) -> dict:
    forecast = get_stored_forecast(service, period)
    actual = get_actual_traffic(service, period)
    
    errors = []
    for predicted, observed in zip(forecast, actual):
        error_pct = abs(predicted.rps - observed.rps) / max(observed.rps, 1) * 100
        errors.append(error_pct)
    
    return {
        "mean_absolute_percentage_error": sum(errors) / len(errors),
        "p95_error": sorted(errors)[int(len(errors) * 0.95)],
        "max_error": max(errors),
        "underforecast_events": [  # Times we predicted less than actual (dangerous)
            p for p, o in zip(forecast, actual) 
            if p.rps < o.rps * 0.8  # Predicted < 80% of actual
        ]
    }

Alert on MAPE above your target (typically 15-20% is acceptable for weekly forecasts). Underforecast events — where the prediction was significantly below reality — are the dangerous ones: they mean you provisioned less capacity than needed. Prioritize understanding and fixing the underforecast cases.

Retrain on prediction failures. When the model significantly underforecasts (>50% error), capture the business context (what events were happening?) and feed it back into the model as labeled training data. Over time, the model learns to account for the specific patterns relevant to your business.

Capacity Planning for AI Workloads

AI workloads have capacity planning characteristics that don't fit the standard web service model, and they're increasingly a significant fraction of infrastructure cost.

LLM inference capacity planning: Token throughput (tokens/second) is the relevant metric, not requests/second. A fleet sized for 1,000 requests/second may only support 200 requests/second if the requests have 5x the average context length. Capacity planning must account for the distribution of context lengths in your workload.

Training job capacity: GPU training jobs have specific instance requirements (you can't substitute CPU instances), and the duration of training jobs is workload-dependent. Capacity planning for a training cluster means modeling your pipeline's job queue, the expected duration of each job type, and the desired queue waiting time.

The cold start economics of GPU instances: GPU instances take longer to provision than CPU instances. If you're scaling GPU capacity reactively, the lag is longer. Minimum fleet sizes for GPU workloads should be higher (relative to average utilization) than for CPU workloads.

Making Capacity Planning a Shared Discipline

The organizational pattern that works: capacity planning as a shared ritual between engineering and the business, not a quarterly report that SRE produces in isolation.

Monthly capacity reviews where product teams share upcoming initiatives, marketing shares campaign calendars, and engineering translates these into infrastructure needs. The output: a 90-day forward-looking capacity plan that SRE uses for procurement and autoscaling configuration.

This ritual surfaces the information that ML models can't predict — the product launch that's coming next month, the partnership integration that will triple API traffic from a new enterprise customer. The model handles the predictable; the humans handle the novel.

*Zak Hassan is a Staff SRE specializing in AI-powered infrastructure automation, capacity engineering, and data platform reliability. Find him at zakhassan.com or on LinkedIn.*

Topic Paths

SRE and Reliability Kubernetes and Platform Engineering Observability and Incident Learning AI Infrastructure and Operations Cloud Cost and Capacity

About the Author

Zak Hassan writes about reliability engineering under real scale constraints.

Staff-level SRE and platform engineer focused on identity reliability, Kubernetes, observability, cloud architecture, AI infrastructure, and reducing operational uncertainty.

Connect on LinkedIn