*By Zak Hassan — Staff SRE | May 2026*


Kubernetes networking is one of those areas where the abstraction feels clean until it isn't. The mental model — every pod gets an IP, every service gets a ClusterIP, DNS just works — holds right up until the moment you're staring at a 502 at 2am with no obvious cause in the application logs. The problem with most operational guides is that they treat each layer in isolation: CNI is a "platform team problem," CoreDNS gets a mention in the DNS troubleshooting blog post, and ingress controllers are configured once and forgotten. Real failures cut across all of these layers simultaneously, and diagnosing them requires a coherent mental model of how data actually flows through a cluster.

The Kubernetes Networking Model and CNI

Kubernetes enforces a flat network model: every pod must be able to communicate with every other pod without NAT, and every node must be able to reach every pod IP directly. This constraint is not implemented by Kubernetes itself — it is delegated entirely to the CNI (Container Network Interface) plugin installed in the cluster.

When a pod is scheduled, the kubelet calls the CNI plugin's binary to set up the pod's network namespace, assign an IP from the node's CIDR block, and configure the routing so that traffic destined for that IP arrives at the right network namespace. Flannel, the simplest CNI, uses a VXLAN overlay: it wraps pod-to-pod packets in UDP and routes them between nodes via a flannel.1 virtual interface. The overhead is real and the debug surface is awkward. Calico takes a different approach — it uses BGP to advertise pod CIDRs between nodes, so pod traffic routes natively at L3 without encapsulation (in most configurations). Cilium is the most sophisticated: it uses eBPF programs loaded directly into the kernel to handle packet routing, policy enforcement, and observability at line rate.

When CNI breaks, the symptoms are blunt. Pods will be stuck in ContainerCreating with events like networkPlugin cni failed to set up pod "..." network: failed to allocate for range 0: no IP addresses available in range set. IPAM exhaustion — running out of pod IPs on a node — is a common and nasty failure mode. Check node IP availability like this:

bash
# Check how many IPs are allocated on a node (Calico example)
kubectl get ipamblocks -o json | jq '.items[] | {cidr: .spec.cidr, allocations: (.spec.allocations | length)}'

# For Flannel, check the subnet config
kubectl get configmap kube-flannel-cfg -n kube-flannel -o jsonpath='{.data.net-conf\.json}' | jq .

# Describe a stuck pod to see CNI errors
kubectl describe pod <pod-name> -n <namespace> | grep -A 10 Events

DNS Resolution in Kubernetes

CoreDNS is the cluster DNS server, deployed as a Deployment (typically two replicas) in kube-system. Every pod's /etc/resolv.conf is injected by kubelet and looks roughly like this:

text
nameserver 10.96.0.10
search production.svc.cluster.local svc.cluster.local cluster.local
options ndots:5

The ndots:5 option is the source of a subtle performance problem that affects nearly every Kubernetes cluster. When a pod tries to resolve postgres, the resolver sees a name with fewer than 5 dots and treats it as relative, appending each search domain in sequence before trying the name as absolute. This means a lookup for postgres results in four DNS queries before the answer is returned: postgres.production.svc.cluster.local, postgres.svc.cluster.local, postgres.cluster.local, and finally postgres. If the application resolves external hostnames like api.stripe.com, you get five queries for every one name — four of which are guaranteed failures returned from CoreDNS as NXDOMAIN.

The fix is to use fully-qualified names (trailing dot) in application configuration or to reduce ndots for pods that resolve mostly external names:

yaml
# Pod spec DNS config for services that mostly call external APIs
spec:
  dnsConfig:
    options:
      - name: ndots
        value: "2"
      - name: single-request-reopen

For internal service names, use the full FQDN: postgres.production.svc.cluster.local. This trades verbosity for predictable, single-query resolution.

DNS Debugging Workflow

When DNS is misbehaving, the first step is always to get a debug pod running in the same namespace as the failing workload.

bash
# Spin up a debug pod with DNS tools
kubectl run dnsutils --image=gcr.io/kubernetes-e2e-test-images/dnsutils:1.3 \
  --namespace=production --restart=Never -it --rm -- /bin/sh

# Inside the pod — test cluster DNS
nslookup postgres
nslookup postgres.production.svc.cluster.local
dig postgres.production.svc.cluster.local @10.96.0.10 +stats

# Check what resolv.conf looks like in that namespace context
cat /etc/resolv.conf

Common DNS failure modes and how to spot them:

CoreDNS OOM is the most dangerous. It presents as intermittent, cluster-wide DNS failures with no obvious pattern. Check it with:

bash
kubectl top pods -n kube-system -l k8s-app=kube-dns
kubectl get events -n kube-system --field-selector reason=OOMKilling
kubectl logs -n kube-system -l k8s-app=kube-dns --previous

If CoreDNS is cycling, increase its memory limit and check whether the autopath plugin is enabled — it doubles query volume and should be disabled in large clusters.

Negative caching is subtler. CoreDNS caches NXDOMAIN responses for 30 seconds by default. If a service is temporarily missing its endpoints and then recovers, pods may continue seeing failures for up to 30 seconds. Customize the CoreDNS ConfigMap to reduce negative TTL:

yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: coredns
  namespace: kube-system
data:
  Corefile: |
    .:53 {
        errors
        health {
           lameduck 5s
        }
        ready
        kubernetes cluster.local in-addr.arpa ip6.arpa {
           pods insecure
           fallthrough in-addr.arpa ip6.arpa
           ttl 30
        }
        prometheus :9153
        forward . /etc/resolv.conf {
           max_concurrent 1000
        }
        cache 30 {
           denial 5      # Reduce negative cache TTL to 5 seconds
        }
        loop
        reload
        loadbalance
    }

Service Networking

ClusterIP services are implemented entirely by kube-proxy — Kubernetes itself has no forwarding logic. When you create a Service, the API server writes it to etcd, kube-proxy watches the API server, and rewrites the node's iptables (or IPVS) rules to intercept traffic destined for the ClusterIP and DNAT it to a random healthy endpoint.

In iptables mode, every ClusterIP becomes a chain of rules. A service with 50 endpoints means 50 iptables rules, each evaluated in sequence for every packet. This is why large clusters with many services exhibit connection latency under load. IPVS mode uses a kernel-level hash table and scales far better; switch to it in the kube-proxy ConfigMap:

yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: kube-proxy
  namespace: kube-system
data:
  config.conf: |
    mode: "ipvs"
    ipvs:
      scheduler: "rr"

Stale endpoints are a common source of intermittent 5xx errors. When a pod is terminated, kube-proxy needs time to remove it from the iptables rules. During the window between pod termination and iptables update, requests can be routed to a closed socket. Mitigate this with a preStop hook and a graceful shutdown delay:

yaml
lifecycle:
  preStop:
    exec:
      command: ["/bin/sh", "-c", "sleep 5"]
terminationGracePeriodSeconds: 30

Inspect current service endpoints and their readiness directly:

bash
kubectl get endpoints postgres -n production -o yaml
kubectl describe endpoints postgres -n production

# Inspect iptables rules for a ClusterIP
CLUSTER_IP=$(kubectl get svc postgres -n production -o jsonpath='{.spec.clusterIP}')
iptables -t nat -L -n | grep $CLUSTER_IP

Ingress Controllers

The Ingress resource is just a spec — enforcement requires a controller. NGINX Ingress is the most widely deployed; Traefik is popular for dynamic environments; Contour (backed by Envoy) is preferred where advanced traffic management like weighted routing matters. The IngressClass resource, introduced in Kubernetes 1.18, decouples which controller handles which Ingress object:

yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: api-ingress
  namespace: production
  annotations:
    nginx.ingress.kubernetes.io/proxy-read-timeout: "120"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "120"
    nginx.ingress.kubernetes.io/proxy-body-size: "16m"
spec:
  ingressClassName: nginx
  rules:
    - host: api.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: api-service
                port:
                  number: 8080

For production NGINX Ingress, the controller's ConfigMap governs global defaults. Misconfigured timeouts and buffer sizes are the leading cause of 502 and 504 errors under load:

yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: ingress-nginx-controller
  namespace: ingress-nginx
data:
  # Connection and timeout tuning
  proxy-connect-timeout: "10"
  proxy-send-timeout: "120"
  proxy-read-timeout: "120"
  keep-alive: "75"
  keep-alive-requests: "1000"

  # Buffer tuning — prevents 502 when upstream sends large headers
  proxy-buffer-size: "16k"
  proxy-buffers: "8 16k"
  proxy-busy-buffers-size: "32k"

  # Connection limits
  max-worker-connections: "16384"
  worker-processes: "auto"

  # Enable real IP from load balancer
  use-forwarded-headers: "true"
  compute-full-forwarded-for: "true"

  # Log format for structured logging
  log-format-upstream: '{"time":"$time_iso8601","addr":"$remote_addr","method":"$request_method","uri":"$uri","status":$status,"bytes":$bytes_sent,"duration":$request_time,"upstream":"$upstream_addr","upstream_status":"$upstream_status"}'

Network Debugging Toolkit

The netshoot image by Nicolaka is the Swiss army knife for in-cluster network debugging. It bundles tcpdump, curl, netstat, iperf3, mtr, and dozens of other tools:

bash
# Run netshoot in the same namespace as the failing workload
kubectl run netshoot --image=nicolaka/netshoot -n production \
  --restart=Never -it --rm -- /bin/bash

# Test raw connectivity between pods
curl -v http://postgres.production.svc.cluster.local:5432

# Check what's listening on a node
netstat -tulnp

# Run a bandwidth test between pods (run iperf3 -s in one pod first)
iperf3 -c <target-pod-ip> -t 10

For deeper packet inspection, you can run tcpdump inside a pod's network namespace by finding the pod's PID on the node and entering its netns:

bash
# On the node — find the container PID
POD_ID=$(crictl pods --namespace production --name api-pod -q)
PID=$(crictl inspectp $POD_ID | jq '.info.pid')

# Run tcpdump in the pod's network namespace
nsenter -t $PID -n -- tcpdump -i eth0 -nn port 5432 -w /tmp/capture.pcap

# Alternatively, run tcpdump as a sidecar debug container (Kubernetes 1.25+)
kubectl debug -it <pod-name> -n production \
  --image=nicolaka/netshoot \
  --target=<container-name> \
  -- tcpdump -i eth0 -nn -w /tmp/capture.pcap

Check the conntrack table when you suspect NAT or connection tracking is dropping traffic:

bash
# On the node
conntrack -L | grep <pod-ip>
conntrack -S  # Show statistics — look for drop/insert_failed counters

For Cilium-specific debugging, the cilium CLI is indispensable:

bash
# Check network policy enforcement for a pod
kubectl exec -n kube-system ds/cilium -- cilium endpoint list
kubectl exec -n kube-system ds/cilium -- cilium policy trace \
  --src-endpoint <src-id> --dst-endpoint <dst-id> --dport 5432/TCP

# Live packet drops with reason
kubectl exec -n kube-system ds/cilium -- cilium monitor --type drop

Common Networking Failures and Their Symptoms

ImagePullBackOff caused by DNS is more common than it should be. When a node's CoreDNS pod is on the same node and is being evicted or restarted, the node loses DNS resolution transiently. The kubelet cannot resolve the registry hostname and reports ImagePullBackOff. Check node-local DNS health and consider deploying NodeLocal DNSCache, which runs a caching DNS agent on every node to eliminate the dependency on CoreDNS availability for node-level resolution.

Inter-pod connectivity failures that appear as random connection timeouts usually point to a CNI misconfiguration or a network policy blocking traffic. Verify that NetworkPolicy isn't silently dropping traffic:

bash
# List all NetworkPolicies in a namespace
kubectl get networkpolicy -n production -o yaml

# Test with a temporary allow-all policy to isolate the issue
kubectl apply -f - <<EOF
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: debug-allow-all
  namespace: production
spec:
  podSelector: {}
  ingress:
  - {}
  egress:
  - {}
  policyTypes:
  - Ingress
  - Egress
EOF

NodePort not responding is almost always one of two things: the service's endpoint list is empty (no healthy pods), or the node's firewall is blocking the port range (30000-32767 by default). Verify with kubectl get endpoints &lt;service> and check cloud security groups or iptables INPUT rules on the node.

Ingress 502 errors typically mean the ingress controller cannot reach the upstream service. The cause is usually a pod that has terminated but whose endpoint hasn't been removed yet, a timeout mismatch (upstream takes longer to respond than the ingress's proxy-read-timeout), or a buffer overflow on large response bodies. Check the NGINX ingress controller logs first — they include upstream response codes and timing. Then check the upstream service's own logs and verify endpoint health:

bash
kubectl logs -n ingress-nginx deploy/ingress-nginx-controller --tail=100 | grep " 502 "
kubectl get endpoints <service-name> -n production
kubectl describe ingress <ingress-name> -n production

Ingress 504 is a timeout — the upstream is alive but slow. Cross-reference proxy-read-timeout in the ConfigMap against actual upstream p99 latency from your metrics system. A backend that normally responds in 30 seconds will 504 against the default 60-second timeout only under load when response time degrades. Raise the timeout on the specific Ingress annotation rather than globally to avoid masking real latency regressions elsewhere.

Kubernetes networking rewards operators who understand the full stack — from how CNI allocates IPs, through how CoreDNS resolves names, through how kube-proxy rewrite rules get installed, to how an ingress controller proxies HTTP. Each layer has its own failure modes and its own debugging tools. The clusters that stay healthy are the ones where the team has instrumented all of these layers before the outage, not after.


*Zak Hassan is a Staff SRE specializing in Kubernetes infrastructure, distributed systems reliability, and cloud-native observability. Find him at zakhassan.com or on LinkedIn.*

Topic Paths

About the Author

Zak Hassan writes about reliability engineering under real scale constraints.

Staff-level SRE and platform engineer focused on identity reliability, Kubernetes, observability, cloud architecture, AI infrastructure, and reducing operational uncertainty.

Connect on LinkedIn