*By Zak Hassan — Staff SRE | May 2026*


Kubernetes resource management is one of the most consequential and least well-understood operational concerns in homelab-style clusters. Most engineering teams set requests and limits once during initial deployment, never revisit them, and accumulate a cluster where half the nodes are overcommitted and the other half are idle — while applications get OOM-killed at traffic peaks and the infrastructure bill grows without explanation.

Getting resource management right requires understanding how Kubernetes scheduling works, how the autoscalers interact, and how to build a feedback loop that keeps resource configuration current as applications evolve.


Requests, Limits, and the Scheduling Contract

Kubernetes distinguishes between resource requests (what the scheduler uses to place pods) and resource limits (what the kubelet enforces at runtime). This distinction matters more than most teams realize.

Requests are a scheduling guarantee: when Kubernetes places a pod on a node, it reserves that amount of CPU and memory. A node with 4 CPU cores and 8GB RAM, running pods totaling 3.5 CPU requests and 7GB memory requests, will not accept a new pod requesting 1 CPU and 2GB — even if actual utilization is 1.5 CPU and 3GB.

Limits are runtime enforcement: CPU limits are throttled (the pod's processes get fewer CPU cycles), and memory limits trigger OOM kills (the pod is killed and restarted). The asymmetry matters: a pod that occasionally exceeds its CPU limit gets slower; a pod that exceeds its memory limit dies.

yaml
# The resource block that actually matters
resources:
  requests:
    cpu: "500m"       # 0.5 CPU cores — used for scheduling
    memory: "512Mi"   # 512 MB — used for scheduling
  limits:
    cpu: "2000m"      # 2 CPU cores — throttled if exceeded
    memory: "1Gi"     # 1 GB — OOMKilled if exceeded

# The anti-pattern: no requests, no limits
# Result: pods scheduled arbitrarily, no resource accounting,
# nodes get overloaded, OOM killer makes random choices

Quality of Service classes are derived from requests/limits:

  • Guaranteed: requests == limits for all containers. Highest priority; last to be evicted.
  • Burstable: requests < limits, or some containers have requests without limits. Medium priority.
  • BestEffort: no requests or limits at all. First to be evicted when the node is under memory pressure.

Production workloads should be Guaranteed or Burstable. BestEffort pods are appropriate for batch jobs that can tolerate interruption — nothing else.


Setting Correct Requests: The Measurement Approach

The right way to set requests is to measure what the application actually uses, not to guess. Most teams set requests based on developer intuition and then never revisit them.

The measurement workflow:

bash
# Check current resource utilization for a deployment
kubectl top pods -l app=my-service --containers

# Get utilization at the container level (not just pod)
kubectl top pods --all-namespaces --containers | sort -k4 -rh | head -20

# For historical data, query Prometheus
# Average CPU usage over the past 7 days, per pod
promql
# Actual CPU usage vs requested (lower = over-provisioned)
sum(rate(container_cpu_usage_seconds_total{container!="", namespace="production"}[5m])) by (pod, container)
/
sum(kube_pod_container_resource_requests{resource="cpu", namespace="production"}) by (pod, container)

# Memory: ratio of actual to requested
sum(container_memory_working_set_bytes{container!="", namespace="production"}) by (pod, container)
/
sum(kube_pod_container_resource_requests{resource="memory", namespace="production"}) by (pod, container)

A ratio below 0.3 (using less than 30% of requested resources) is a strong signal of over-provisioning. A ratio above 0.8 sustained means requests are too low and the pod is constrained.

The practical request-setting rule: set CPU requests to the p50 (median) CPU usage, set memory requests to the p90 memory usage (memory doesn't compress — under-requesting memory causes OOM kills). Set CPU limits at 2-4x requests for bursty workloads; set memory limits at 1.2-1.5x requests.


Vertical Pod Autoscaler (VPA)

Setting requests manually doesn't scale across hundreds of deployments. VPA automates it: it observes actual usage, recommends (or automatically applies) request adjustments, and keeps resource configuration current as applications change.

yaml
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: my-service-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-service
  updatePolicy:
    updateMode: "Off"   # Start with "Off" — recommendations only, no auto-apply
    # Options: Off (recommend only), Initial (apply on pod creation), Auto (evict and recreate)
  resourcePolicy:
    containerPolicies:
      - containerName: my-service
        minAllowed:
          cpu: "100m"
          memory: "128Mi"
        maxAllowed:
          cpu: "4"
          memory: "8Gi"
        controlledResources: ["cpu", "memory"]
bash
# Check VPA recommendations without applying them
kubectl describe vpa my-service-vpa

# Output includes:
# Lower Bound: minimum safe values
# Target: recommended values based on observed usage
# Upper Bound: maximum based on usage spikes
# Uncapped Target: recommendation without min/max constraints

VPA caveats: VPA in Auto mode evicts pods to apply new resource values — avoid Auto for single-replica deployments where eviction causes downtime. Use Initial (applies on pod creation only) as a middle ground. VPA also conflicts with HPA on CPU metrics — if you use HPA for scaling, disable VPA CPU recommendations and let VPA manage memory only.


Horizontal Pod Autoscaler (HPA)

HPA scales the number of pod replicas based on metrics. The most common configuration scales on CPU utilization, but production HPA usually needs custom metrics.

yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-service-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-service
  minReplicas: 3
  maxReplicas: 50
  metrics:
    # Scale on CPU — simple but often wrong
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

    # Scale on requests per second — usually better for web services
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: "1000"

    # Scale on queue depth for async workers
    - type: External
      external:
        metric:
          name: sqs_messages_visible
          selector:
            matchLabels:
              queue: my-worker-queue
        target:
          type: AverageValue
          averageValue: "100"  # Scale so each replica handles ~100 messages

  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60     # Wait 60s before scaling up further
      policies:
        - type: Pods
          value: 4                        # Add at most 4 pods per scaling event
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300    # Wait 5 minutes before scaling down
      policies:
        - type: Percent
          value: 10                       # Remove at most 10% of replicas per event
          periodSeconds: 60

The behavior block is where most HPA configurations go wrong. Without explicit scale-down stabilization, HPA will scale down aggressively during a brief traffic lull only to scale back up when traffic returns — causing availability problems and excessive pod churn. A 5-minute scale-down window is a reasonable starting point for most services.


Cluster Autoscaler: Node-Level Scaling

When pod autoscalers add replicas faster than existing nodes can accommodate, Cluster Autoscaler (CA) provisions new nodes. When nodes are underutilized, CA removes them.

CA decisions depend on pod requests, not actual utilization — another reason accurate requests matter. CA won't scale down a node if doing so would require evicting a pod that has no other node that can satisfy its requests.

yaml
# Common CA configuration gotchas

# 1. PodDisruptionBudgets must be set correctly — CA respects them
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: my-service-pdb
spec:
  minAvailable: 2         # Always keep 2 pods running
  selector:
    matchLabels:
      app: my-service

# 2. Scale-down annotations on nodes you want to protect
kubectl annotate node my-critical-node \
  cluster-autoscaler.kubernetes.io/scale-down-disabled=true

# 3. CA logs for debugging why it's not scaling
kubectl logs -n kube-system -l app=cluster-autoscaler | grep -i "scale\|cannot\|unschedulable"

Node pool design for CA efficiency: use multiple node pools with different instance sizes. Small instances for low-resource pods, large instances for memory-intensive workloads. CA selects the cheapest pool that satisfies pending pod requirements — having a pool of spot/preemptible instances for batch workloads reduces cost dramatically.


Capacity Planning: The Cluster-Level View

Node-level autoscaling handles traffic variance, but capacity planning addresses structural growth: the cluster the team needs 6 months from now.

python
# Capacity planning projection from historical data
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression

def project_cluster_capacity(utilization_history_df, horizon_days=180):
    """
    Project resource needs based on historical utilization trends.
    
    utilization_history_df: DataFrame with columns [date, cpu_cores_used, memory_gb_used]
    """
    df = utilization_history_df.copy()
    df['day_index'] = (df['date'] - df['date'].min()).dt.days
    
    projections = {}
    for resource in ['cpu_cores_used', 'memory_gb_used']:
        X = df[['day_index']].values
        y = df[resource].values
        
        model = LinearRegression()
        model.fit(X, y)
        
        future_day = df['day_index'].max() + horizon_days
        projected_value = model.predict([[future_day]])[0]
        
        # Add 30% headroom for variance + safety margin
        projections[resource] = projected_value * 1.30
        
        print(f"{resource}: currently {y[-1]:.1f}, projected {projected_value:.1f} "
              f"(+{(projected_value/y[-1] - 1)*100:.0f}%), with headroom: {projections[resource]:.1f}")
    
    return projections

Capacity planning inputs beyond utilization trends:

  • Planned feature launches with estimated traffic impact
  • Marketing campaigns or seasonal traffic patterns
  • Data retention growth for stateful workloads
  • Regulatory requirements affecting instance types or regions

The output of capacity planning isn't a single number — it's a range with confidence levels, reviewed quarterly and updated when assumptions change.


Namespace Resource Quotas: Guardrails Against Runaway Usage

In multi-tenant clusters, namespace resource quotas prevent one team from consuming resources that others need:

yaml
apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-backend-quota
  namespace: team-backend
spec:
  hard:
    requests.cpu: "20"           # Max 20 CPU cores requested across all pods
    requests.memory: 40Gi
    limits.cpu: "80"
    limits.memory: 160Gi
    count/pods: "100"            # Max 100 pods in this namespace
    count/persistentvolumeclaims: "20"

---
apiVersion: v1
kind: LimitRange
metadata:
  name: team-backend-defaults
  namespace: team-backend
spec:
  limits:
    - type: Container
      default:                   # Applied when no limits are specified
        cpu: "500m"
        memory: "512Mi"
      defaultRequest:            # Applied when no requests are specified
        cpu: "100m"
        memory: "128Mi"
      max:                       # No single container can exceed these
        cpu: "8"
        memory: "16Gi"

LimitRange defaults are particularly important: they ensure pods without explicit resource configuration get reasonable defaults rather than BestEffort QoS, which would make them first candidates for eviction.


*Zak Hassan is a Staff SRE specializing in Kubernetes operations, capacity planning, and cloud infrastructure reliability. Find him at zakhassan.com or on LinkedIn.*

Topic Paths

About the Author

Zak Hassan writes about reliability engineering under real scale constraints.

Staff-level SRE and platform engineer focused on identity reliability, Kubernetes, observability, cloud architecture, AI infrastructure, and reducing operational uncertainty.

Connect on LinkedIn