*By Zak Hassan — Staff SRE | May 2026*
Kubernetes networking is one of those areas where the abstraction feels clean until it isn't. The mental model — every pod gets an IP, every service gets a ClusterIP, DNS just works — holds right up until the moment you're staring at a 502 at 2am with no obvious cause in the application logs. The problem with most operational guides is that they treat each layer in isolation: CNI is a "platform team problem," CoreDNS gets a mention in the DNS troubleshooting blog post, and ingress controllers are configured once and forgotten. Real failures cut across all of these layers simultaneously, and diagnosing them requires a coherent mental model of how data actually flows through a cluster.
The Kubernetes Networking Model and CNI
Kubernetes enforces a flat network model: every pod must be able to communicate with every other pod without NAT, and every node must be able to reach every pod IP directly. This constraint is not implemented by Kubernetes itself — it is delegated entirely to the CNI (Container Network Interface) plugin installed in the cluster.
When a pod is scheduled, the kubelet calls the CNI plugin's binary to set up the pod's network namespace, assign an IP from the node's CIDR block, and configure the routing so that traffic destined for that IP arrives at the right network namespace. Flannel, the simplest CNI, uses a VXLAN overlay: it wraps pod-to-pod packets in UDP and routes them between nodes via a flannel.1 virtual interface. The overhead is real and the debug surface is awkward. Calico takes a different approach — it uses BGP to advertise pod CIDRs between nodes, so pod traffic routes natively at L3 without encapsulation (in most configurations). Cilium is the most sophisticated: it uses eBPF programs loaded directly into the kernel to handle packet routing, policy enforcement, and observability at line rate.
When CNI breaks, the symptoms are blunt. Pods will be stuck in ContainerCreating with events like networkPlugin cni failed to set up pod "..." network: failed to allocate for range 0: no IP addresses available in range set. IPAM exhaustion — running out of pod IPs on a node — is a common and nasty failure mode. Check node IP availability like this:
# Check how many IPs are allocated on a node (Calico example)
kubectl get ipamblocks -o json | jq '.items[] | {cidr: .spec.cidr, allocations: (.spec.allocations | length)}'
# For Flannel, check the subnet config
kubectl get configmap kube-flannel-cfg -n kube-flannel -o jsonpath='{.data.net-conf\.json}' | jq .
# Describe a stuck pod to see CNI errors
kubectl describe pod <pod-name> -n <namespace> | grep -A 10 EventsDNS Resolution in Kubernetes
CoreDNS is the cluster DNS server, deployed as a Deployment (typically two replicas) in kube-system. Every pod's /etc/resolv.conf is injected by kubelet and looks roughly like this:
nameserver 10.96.0.10
search production.svc.cluster.local svc.cluster.local cluster.local
options ndots:5The ndots:5 option is the source of a subtle performance problem that affects nearly every Kubernetes cluster. When a pod tries to resolve postgres, the resolver sees a name with fewer than 5 dots and treats it as relative, appending each search domain in sequence before trying the name as absolute. This means a lookup for postgres results in four DNS queries before the answer is returned: postgres.production.svc.cluster.local, postgres.svc.cluster.local, postgres.cluster.local, and finally postgres. If the application resolves external hostnames like api.stripe.com, you get five queries for every one name — four of which are guaranteed failures returned from CoreDNS as NXDOMAIN.
The fix is to use fully-qualified names (trailing dot) in application configuration or to reduce ndots for pods that resolve mostly external names:
# Pod spec DNS config for services that mostly call external APIs
spec:
dnsConfig:
options:
- name: ndots
value: "2"
- name: single-request-reopenFor internal service names, use the full FQDN: postgres.production.svc.cluster.local. This trades verbosity for predictable, single-query resolution.
DNS Debugging Workflow
When DNS is misbehaving, the first step is always to get a debug pod running in the same namespace as the failing workload.
# Spin up a debug pod with DNS tools
kubectl run dnsutils --image=gcr.io/kubernetes-e2e-test-images/dnsutils:1.3 \
--namespace=production --restart=Never -it --rm -- /bin/sh
# Inside the pod — test cluster DNS
nslookup postgres
nslookup postgres.production.svc.cluster.local
dig postgres.production.svc.cluster.local @10.96.0.10 +stats
# Check what resolv.conf looks like in that namespace context
cat /etc/resolv.confCommon DNS failure modes and how to spot them:
CoreDNS OOM is the most dangerous. It presents as intermittent, cluster-wide DNS failures with no obvious pattern. Check it with:
kubectl top pods -n kube-system -l k8s-app=kube-dns
kubectl get events -n kube-system --field-selector reason=OOMKilling
kubectl logs -n kube-system -l k8s-app=kube-dns --previousIf CoreDNS is cycling, increase its memory limit and check whether the autopath plugin is enabled — it doubles query volume and should be disabled in large clusters.
Negative caching is subtler. CoreDNS caches NXDOMAIN responses for 30 seconds by default. If a service is temporarily missing its endpoints and then recovers, pods may continue seeing failures for up to 30 seconds. Customize the CoreDNS ConfigMap to reduce negative TTL:
apiVersion: v1
kind: ConfigMap
metadata:
name: coredns
namespace: kube-system
data:
Corefile: |
.:53 {
errors
health {
lameduck 5s
}
ready
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
ttl 30
}
prometheus :9153
forward . /etc/resolv.conf {
max_concurrent 1000
}
cache 30 {
denial 5 # Reduce negative cache TTL to 5 seconds
}
loop
reload
loadbalance
}Service Networking
ClusterIP services are implemented entirely by kube-proxy — Kubernetes itself has no forwarding logic. When you create a Service, the API server writes it to etcd, kube-proxy watches the API server, and rewrites the node's iptables (or IPVS) rules to intercept traffic destined for the ClusterIP and DNAT it to a random healthy endpoint.
In iptables mode, every ClusterIP becomes a chain of rules. A service with 50 endpoints means 50 iptables rules, each evaluated in sequence for every packet. This is why large clusters with many services exhibit connection latency under load. IPVS mode uses a kernel-level hash table and scales far better; switch to it in the kube-proxy ConfigMap:
apiVersion: v1
kind: ConfigMap
metadata:
name: kube-proxy
namespace: kube-system
data:
config.conf: |
mode: "ipvs"
ipvs:
scheduler: "rr"Stale endpoints are a common source of intermittent 5xx errors. When a pod is terminated, kube-proxy needs time to remove it from the iptables rules. During the window between pod termination and iptables update, requests can be routed to a closed socket. Mitigate this with a preStop hook and a graceful shutdown delay:
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 5"]
terminationGracePeriodSeconds: 30Inspect current service endpoints and their readiness directly:
kubectl get endpoints postgres -n production -o yaml
kubectl describe endpoints postgres -n production
# Inspect iptables rules for a ClusterIP
CLUSTER_IP=$(kubectl get svc postgres -n production -o jsonpath='{.spec.clusterIP}')
iptables -t nat -L -n | grep $CLUSTER_IPIngress Controllers
The Ingress resource is just a spec — enforcement requires a controller. NGINX Ingress is the most widely deployed; Traefik is popular for dynamic environments; Contour (backed by Envoy) is preferred where advanced traffic management like weighted routing matters. The IngressClass resource, introduced in Kubernetes 1.18, decouples which controller handles which Ingress object:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: api-ingress
namespace: production
annotations:
nginx.ingress.kubernetes.io/proxy-read-timeout: "120"
nginx.ingress.kubernetes.io/proxy-send-timeout: "120"
nginx.ingress.kubernetes.io/proxy-body-size: "16m"
spec:
ingressClassName: nginx
rules:
- host: api.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: api-service
port:
number: 8080For production NGINX Ingress, the controller's ConfigMap governs global defaults. Misconfigured timeouts and buffer sizes are the leading cause of 502 and 504 errors under load:
apiVersion: v1
kind: ConfigMap
metadata:
name: ingress-nginx-controller
namespace: ingress-nginx
data:
# Connection and timeout tuning
proxy-connect-timeout: "10"
proxy-send-timeout: "120"
proxy-read-timeout: "120"
keep-alive: "75"
keep-alive-requests: "1000"
# Buffer tuning — prevents 502 when upstream sends large headers
proxy-buffer-size: "16k"
proxy-buffers: "8 16k"
proxy-busy-buffers-size: "32k"
# Connection limits
max-worker-connections: "16384"
worker-processes: "auto"
# Enable real IP from load balancer
use-forwarded-headers: "true"
compute-full-forwarded-for: "true"
# Log format for structured logging
log-format-upstream: '{"time":"$time_iso8601","addr":"$remote_addr","method":"$request_method","uri":"$uri","status":$status,"bytes":$bytes_sent,"duration":$request_time,"upstream":"$upstream_addr","upstream_status":"$upstream_status"}'Network Debugging Toolkit
The netshoot image by Nicolaka is the Swiss army knife for in-cluster network debugging. It bundles tcpdump, curl, netstat, iperf3, mtr, and dozens of other tools:
# Run netshoot in the same namespace as the failing workload
kubectl run netshoot --image=nicolaka/netshoot -n production \
--restart=Never -it --rm -- /bin/bash
# Test raw connectivity between pods
curl -v http://postgres.production.svc.cluster.local:5432
# Check what's listening on a node
netstat -tulnp
# Run a bandwidth test between pods (run iperf3 -s in one pod first)
iperf3 -c <target-pod-ip> -t 10For deeper packet inspection, you can run tcpdump inside a pod's network namespace by finding the pod's PID on the node and entering its netns:
# On the node — find the container PID
POD_ID=$(crictl pods --namespace production --name api-pod -q)
PID=$(crictl inspectp $POD_ID | jq '.info.pid')
# Run tcpdump in the pod's network namespace
nsenter -t $PID -n -- tcpdump -i eth0 -nn port 5432 -w /tmp/capture.pcap
# Alternatively, run tcpdump as a sidecar debug container (Kubernetes 1.25+)
kubectl debug -it <pod-name> -n production \
--image=nicolaka/netshoot \
--target=<container-name> \
-- tcpdump -i eth0 -nn -w /tmp/capture.pcapCheck the conntrack table when you suspect NAT or connection tracking is dropping traffic:
# On the node
conntrack -L | grep <pod-ip>
conntrack -S # Show statistics — look for drop/insert_failed countersFor Cilium-specific debugging, the cilium CLI is indispensable:
# Check network policy enforcement for a pod
kubectl exec -n kube-system ds/cilium -- cilium endpoint list
kubectl exec -n kube-system ds/cilium -- cilium policy trace \
--src-endpoint <src-id> --dst-endpoint <dst-id> --dport 5432/TCP
# Live packet drops with reason
kubectl exec -n kube-system ds/cilium -- cilium monitor --type dropCommon Networking Failures and Their Symptoms
ImagePullBackOff caused by DNS is more common than it should be. When a node's CoreDNS pod is on the same node and is being evicted or restarted, the node loses DNS resolution transiently. The kubelet cannot resolve the registry hostname and reports ImagePullBackOff. Check node-local DNS health and consider deploying NodeLocal DNSCache, which runs a caching DNS agent on every node to eliminate the dependency on CoreDNS availability for node-level resolution.
Inter-pod connectivity failures that appear as random connection timeouts usually point to a CNI misconfiguration or a network policy blocking traffic. Verify that NetworkPolicy isn't silently dropping traffic:
# List all NetworkPolicies in a namespace
kubectl get networkpolicy -n production -o yaml
# Test with a temporary allow-all policy to isolate the issue
kubectl apply -f - <<EOF
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: debug-allow-all
namespace: production
spec:
podSelector: {}
ingress:
- {}
egress:
- {}
policyTypes:
- Ingress
- Egress
EOFNodePort not responding is almost always one of two things: the service's endpoint list is empty (no healthy pods), or the node's firewall is blocking the port range (30000-32767 by default). Verify with kubectl get endpoints <service> and check cloud security groups or iptables INPUT rules on the node.
Ingress 502 errors typically mean the ingress controller cannot reach the upstream service. The cause is usually a pod that has terminated but whose endpoint hasn't been removed yet, a timeout mismatch (upstream takes longer to respond than the ingress's proxy-read-timeout), or a buffer overflow on large response bodies. Check the NGINX ingress controller logs first — they include upstream response codes and timing. Then check the upstream service's own logs and verify endpoint health:
kubectl logs -n ingress-nginx deploy/ingress-nginx-controller --tail=100 | grep " 502 "
kubectl get endpoints <service-name> -n production
kubectl describe ingress <ingress-name> -n productionIngress 504 is a timeout — the upstream is alive but slow. Cross-reference proxy-read-timeout in the ConfigMap against actual upstream p99 latency from your metrics system. A backend that normally responds in 30 seconds will 504 against the default 60-second timeout only under load when response time degrades. Raise the timeout on the specific Ingress annotation rather than globally to avoid masking real latency regressions elsewhere.
Kubernetes networking rewards operators who understand the full stack — from how CNI allocates IPs, through how CoreDNS resolves names, through how kube-proxy rewrite rules get installed, to how an ingress controller proxies HTTP. Each layer has its own failure modes and its own debugging tools. The clusters that stay healthy are the ones where the team has instrumented all of these layers before the outage, not after.
*Zak Hassan is a Staff SRE specializing in Kubernetes infrastructure, distributed systems reliability, and cloud-native observability. Find him at zakhassan.com or on LinkedIn.*
Topic Paths