*By Zak Hassan — Staff SRE | May 2026*
Kubernetes resource management is one of the most consequential and least well-understood operational concerns in homelab-style clusters. Most engineering teams set requests and limits once during initial deployment, never revisit them, and accumulate a cluster where half the nodes are overcommitted and the other half are idle — while applications get OOM-killed at traffic peaks and the infrastructure bill grows without explanation.
Getting resource management right requires understanding how Kubernetes scheduling works, how the autoscalers interact, and how to build a feedback loop that keeps resource configuration current as applications evolve.
Requests, Limits, and the Scheduling Contract
Kubernetes distinguishes between resource requests (what the scheduler uses to place pods) and resource limits (what the kubelet enforces at runtime). This distinction matters more than most teams realize.
Requests are a scheduling guarantee: when Kubernetes places a pod on a node, it reserves that amount of CPU and memory. A node with 4 CPU cores and 8GB RAM, running pods totaling 3.5 CPU requests and 7GB memory requests, will not accept a new pod requesting 1 CPU and 2GB — even if actual utilization is 1.5 CPU and 3GB.
Limits are runtime enforcement: CPU limits are throttled (the pod's processes get fewer CPU cycles), and memory limits trigger OOM kills (the pod is killed and restarted). The asymmetry matters: a pod that occasionally exceeds its CPU limit gets slower; a pod that exceeds its memory limit dies.
# The resource block that actually matters
resources:
requests:
cpu: "500m" # 0.5 CPU cores — used for scheduling
memory: "512Mi" # 512 MB — used for scheduling
limits:
cpu: "2000m" # 2 CPU cores — throttled if exceeded
memory: "1Gi" # 1 GB — OOMKilled if exceeded
# The anti-pattern: no requests, no limits
# Result: pods scheduled arbitrarily, no resource accounting,
# nodes get overloaded, OOM killer makes random choicesQuality of Service classes are derived from requests/limits:
- Guaranteed: requests == limits for all containers. Highest priority; last to be evicted.
- Burstable: requests < limits, or some containers have requests without limits. Medium priority.
- BestEffort: no requests or limits at all. First to be evicted when the node is under memory pressure.
Production workloads should be Guaranteed or Burstable. BestEffort pods are appropriate for batch jobs that can tolerate interruption — nothing else.
Setting Correct Requests: The Measurement Approach
The right way to set requests is to measure what the application actually uses, not to guess. Most teams set requests based on developer intuition and then never revisit them.
The measurement workflow:
# Check current resource utilization for a deployment
kubectl top pods -l app=my-service --containers
# Get utilization at the container level (not just pod)
kubectl top pods --all-namespaces --containers | sort -k4 -rh | head -20
# For historical data, query Prometheus
# Average CPU usage over the past 7 days, per pod# Actual CPU usage vs requested (lower = over-provisioned)
sum(rate(container_cpu_usage_seconds_total{container!="", namespace="production"}[5m])) by (pod, container)
/
sum(kube_pod_container_resource_requests{resource="cpu", namespace="production"}) by (pod, container)
# Memory: ratio of actual to requested
sum(container_memory_working_set_bytes{container!="", namespace="production"}) by (pod, container)
/
sum(kube_pod_container_resource_requests{resource="memory", namespace="production"}) by (pod, container)A ratio below 0.3 (using less than 30% of requested resources) is a strong signal of over-provisioning. A ratio above 0.8 sustained means requests are too low and the pod is constrained.
The practical request-setting rule: set CPU requests to the p50 (median) CPU usage, set memory requests to the p90 memory usage (memory doesn't compress — under-requesting memory causes OOM kills). Set CPU limits at 2-4x requests for bursty workloads; set memory limits at 1.2-1.5x requests.
Vertical Pod Autoscaler (VPA)
Setting requests manually doesn't scale across hundreds of deployments. VPA automates it: it observes actual usage, recommends (or automatically applies) request adjustments, and keeps resource configuration current as applications change.
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: my-service-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: my-service
updatePolicy:
updateMode: "Off" # Start with "Off" — recommendations only, no auto-apply
# Options: Off (recommend only), Initial (apply on pod creation), Auto (evict and recreate)
resourcePolicy:
containerPolicies:
- containerName: my-service
minAllowed:
cpu: "100m"
memory: "128Mi"
maxAllowed:
cpu: "4"
memory: "8Gi"
controlledResources: ["cpu", "memory"]# Check VPA recommendations without applying them
kubectl describe vpa my-service-vpa
# Output includes:
# Lower Bound: minimum safe values
# Target: recommended values based on observed usage
# Upper Bound: maximum based on usage spikes
# Uncapped Target: recommendation without min/max constraintsVPA caveats: VPA in Auto mode evicts pods to apply new resource values — avoid Auto for single-replica deployments where eviction causes downtime. Use Initial (applies on pod creation only) as a middle ground. VPA also conflicts with HPA on CPU metrics — if you use HPA for scaling, disable VPA CPU recommendations and let VPA manage memory only.
Horizontal Pod Autoscaler (HPA)
HPA scales the number of pod replicas based on metrics. The most common configuration scales on CPU utilization, but production HPA usually needs custom metrics.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: my-service-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: my-service
minReplicas: 3
maxReplicas: 50
metrics:
# Scale on CPU — simple but often wrong
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
# Scale on requests per second — usually better for web services
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "1000"
# Scale on queue depth for async workers
- type: External
external:
metric:
name: sqs_messages_visible
selector:
matchLabels:
queue: my-worker-queue
target:
type: AverageValue
averageValue: "100" # Scale so each replica handles ~100 messages
behavior:
scaleUp:
stabilizationWindowSeconds: 60 # Wait 60s before scaling up further
policies:
- type: Pods
value: 4 # Add at most 4 pods per scaling event
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300 # Wait 5 minutes before scaling down
policies:
- type: Percent
value: 10 # Remove at most 10% of replicas per event
periodSeconds: 60The behavior block is where most HPA configurations go wrong. Without explicit scale-down stabilization, HPA will scale down aggressively during a brief traffic lull only to scale back up when traffic returns — causing availability problems and excessive pod churn. A 5-minute scale-down window is a reasonable starting point for most services.
Cluster Autoscaler: Node-Level Scaling
When pod autoscalers add replicas faster than existing nodes can accommodate, Cluster Autoscaler (CA) provisions new nodes. When nodes are underutilized, CA removes them.
CA decisions depend on pod requests, not actual utilization — another reason accurate requests matter. CA won't scale down a node if doing so would require evicting a pod that has no other node that can satisfy its requests.
# Common CA configuration gotchas
# 1. PodDisruptionBudgets must be set correctly — CA respects them
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: my-service-pdb
spec:
minAvailable: 2 # Always keep 2 pods running
selector:
matchLabels:
app: my-service
# 2. Scale-down annotations on nodes you want to protect
kubectl annotate node my-critical-node \
cluster-autoscaler.kubernetes.io/scale-down-disabled=true
# 3. CA logs for debugging why it's not scaling
kubectl logs -n kube-system -l app=cluster-autoscaler | grep -i "scale\|cannot\|unschedulable"Node pool design for CA efficiency: use multiple node pools with different instance sizes. Small instances for low-resource pods, large instances for memory-intensive workloads. CA selects the cheapest pool that satisfies pending pod requirements — having a pool of spot/preemptible instances for batch workloads reduces cost dramatically.
Capacity Planning: The Cluster-Level View
Node-level autoscaling handles traffic variance, but capacity planning addresses structural growth: the cluster the team needs 6 months from now.
# Capacity planning projection from historical data
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
def project_cluster_capacity(utilization_history_df, horizon_days=180):
"""
Project resource needs based on historical utilization trends.
utilization_history_df: DataFrame with columns [date, cpu_cores_used, memory_gb_used]
"""
df = utilization_history_df.copy()
df['day_index'] = (df['date'] - df['date'].min()).dt.days
projections = {}
for resource in ['cpu_cores_used', 'memory_gb_used']:
X = df[['day_index']].values
y = df[resource].values
model = LinearRegression()
model.fit(X, y)
future_day = df['day_index'].max() + horizon_days
projected_value = model.predict([[future_day]])[0]
# Add 30% headroom for variance + safety margin
projections[resource] = projected_value * 1.30
print(f"{resource}: currently {y[-1]:.1f}, projected {projected_value:.1f} "
f"(+{(projected_value/y[-1] - 1)*100:.0f}%), with headroom: {projections[resource]:.1f}")
return projectionsCapacity planning inputs beyond utilization trends:
- Planned feature launches with estimated traffic impact
- Marketing campaigns or seasonal traffic patterns
- Data retention growth for stateful workloads
- Regulatory requirements affecting instance types or regions
The output of capacity planning isn't a single number — it's a range with confidence levels, reviewed quarterly and updated when assumptions change.
Namespace Resource Quotas: Guardrails Against Runaway Usage
In multi-tenant clusters, namespace resource quotas prevent one team from consuming resources that others need:
apiVersion: v1
kind: ResourceQuota
metadata:
name: team-backend-quota
namespace: team-backend
spec:
hard:
requests.cpu: "20" # Max 20 CPU cores requested across all pods
requests.memory: 40Gi
limits.cpu: "80"
limits.memory: 160Gi
count/pods: "100" # Max 100 pods in this namespace
count/persistentvolumeclaims: "20"
---
apiVersion: v1
kind: LimitRange
metadata:
name: team-backend-defaults
namespace: team-backend
spec:
limits:
- type: Container
default: # Applied when no limits are specified
cpu: "500m"
memory: "512Mi"
defaultRequest: # Applied when no requests are specified
cpu: "100m"
memory: "128Mi"
max: # No single container can exceed these
cpu: "8"
memory: "16Gi"LimitRange defaults are particularly important: they ensure pods without explicit resource configuration get reasonable defaults rather than BestEffort QoS, which would make them first candidates for eviction.
*Zak Hassan is a Staff SRE specializing in Kubernetes operations, capacity planning, and cloud infrastructure reliability. Find him at zakhassan.com or on LinkedIn.*
Topic Paths