Linux Performance Engineering: eBPF, Profiling, and Finding the Real Bottleneck

Most performance investigations start at the wrong layer. CPU high? Scale horizontally. Memory high? Increase instance size. Latency high? Add a cache. These interventions sometimes work, often mask the real problem, and occasionally make things worse. The engineers who consistently diagnose and fix performance problems correctly are the ones who know how to look at what the system is actually doing — at the OS level, where the real bottleneck lives.

This is the Linux performance engineering toolkit for SREs: the tools, the methodology, and the failure modes that require OS-level investigation to understand.

The USE Method: Starting with a Framework

Before reaching for profiling tools, apply the USE Method (Utilization, Saturation, Errors) systematically. For every resource in your system (CPU, memory, disks, network interfaces, any other hardware), answer:

Utilization: What percentage of the resource's capacity is being used?
Saturation: Is anything waiting for this resource? (Queue depth, wait time)
Errors: Are there errors from this resource?

This forces you to check the obvious before diving into profiling. A disk at 95% utilization with a write queue of 50 outstanding I/Os is your bottleneck — you don't need a flame graph to tell you that.

# Quick USE method snapshot on Linux

# CPU utilization + saturation (run queue)
vmstat 1 5
# r column = run queue (saturation); us/sy columns = utilization

# Memory utilization + saturation (swapping)
free -h
vmstat 1 5
# si/so columns = swap in/out (if non-zero: memory saturation)

# Disk utilization + saturation
iostat -xz 1 5
# %util column = utilization; await = average wait time (saturation indicator)

# Network utilization
sar -n DEV 1 5
# rxkB/s and txkB/s vs interface capacity

# Errors (all hardware)
dmesg | grep -i error | tail -50

eBPF: The Modern Linux Observability Layer

Extended Berkeley Packet Filter (eBPF) is the technology that has transformed Linux observability over the past five years. eBPF lets you run sandboxed programs inside the Linux kernel, attaching to virtually any kernel function or system call to observe what's happening — without modifying the application or the kernel.

The operational power: you can answer questions like "which processes are making the most system calls?", "which kernel path is causing my network latency?", "which database queries are calling read() most often?" — in production, without a code change, without a restart.

BCC tools and bpftrace are the practical entry points:

# Install BCC tools (Ubuntu)
apt-get install bpfcc-tools

# Which processes are making the most system calls?
syscount-bpfcc -p $(pgrep postgres) -d 10 2>/dev/null | head -20

# Trace slow disk I/O (operations taking >10ms)
biolatency-bpfcc 1 10
# Shows histogram of disk I/O latency

# Which files is a process reading most often?
filetop-bpfcc -p $(pgrep my-service) 1 5

# Profile CPU usage by stack trace (flame graph data)
profile-bpfcc -F 99 -a 30 > /tmp/cpu-profile.txt
# Then convert to flame graph with flamegraph.pl

# Trace TCP connection latency
tcplife-bpfcc | head -50
# Shows connection duration, bytes transferred per connection

bpftrace for custom one-liners:

# How long are database queries taking at the syscall level?
bpftrace -e '
  kprobe:sys_read /comm == "postgres"/ { @start[tid] = nsecs; }
  kretprobe:sys_read /comm == "postgres" && @start[tid]/ {
    @latency_us = hist((nsecs - @start[tid]) / 1000);
    delete(@start[tid]);
  }
  interval:s:10 { print(@latency_us); clear(@latency_us); }
'

# Which network connections are being established most often?
bpftrace -e '
  kprobe:tcp_connect {
    @[comm, inet_ntop(2, &((struct sock *)arg0)->__sk_common.skc_daddr)] = count();
  }
  interval:s:5 { print(@); clear(@); }
'

CPU Profiling: Finding Hot Code Paths

When CPU utilization is high and you need to know why, CPU profiling identifies which functions are consuming the most time.

perf for sampling profiling:

# Record CPU samples for 30 seconds
perf record -F 99 -a --call-graph dwarf -g -p $(pgrep my-service) -- sleep 30

# Generate report
perf report --no-children

# Generate flame graph
perf script | stackcollapse-perf.pl | flamegraph.pl > flamegraph.svg

A flame graph visualizes CPU time by call stack. The width of each box is proportional to the time spent in that function (including its callees). The widest boxes at the top are your hottest code paths.

Interpreting flame graphs:

Wide boxes near the top: these functions (and everything they call) consume significant CPU. Investigate whether they're doing necessary work.

Flat top: many functions at the top level that are each wide. This indicates CPU is distributed across many code paths — no single hotspot. Optimization requires reducing overall work, not targeting one function.

Tall, narrow stacks: deep call chains with little time at the leaves. May indicate recursive functions or unusual call patterns.

Memory Leak Investigation

Memory leaks in production are insidious — memory grows slowly, the process eventually OOMs, and the OOM killer terminates it (often taking down other processes in the process).

valgrind for leak detection (testing, not production):

valgrind --leak-check=full --track-origins=yes \
  --log-file=valgrind-output.txt \
  ./my-service

In production, track memory growth over time:

# Watch RSS (resident set size) growth over time
watch -n 5 "ps -o pid,rss,vsz,comm -p $(pgrep my-service)"

# Detailed memory map
cat /proc/$(pgrep my-service)/smaps | awk '
  /^[0-9a-f]/ {addr=$1}
  /Rss/ {rss+=$2}
  /Private_Dirty/ {dirty+=$2}
  END {print "RSS:", rss/1024, "MB | Private Dirty:", dirty/1024, "MB"}
'

For Python services: tracemalloc for runtime leak analysis:

import tracemalloc
import linecache
import os

def start_memory_tracing():
    tracemalloc.start(10)  # Keep 10 frames of traceback

def report_memory_usage(top_n: int = 10):
    snapshot = tracemalloc.take_snapshot()
    stats = snapshot.statistics('lineno')
    
    print(f"\nTop {top_n} memory allocations:")
    for stat in stats[:top_n]:
        frame = stat.traceback[0]
        filename = os.path.basename(frame.filename)
        print(f"  {stat.size / 1024:.1f} KB — {filename}:{frame.lineno}")

Network Performance Debugging

Network problems manifest as elevated latency, packet loss, or connection failures. The layered debugging approach:

Layer 1: Is the problem local to this host?

# Check for network errors on the interface
ip -s link show eth0
# Look for: errors, dropped, overrun — any non-zero is concerning

# Check for connection tracking table exhaustion
sysctl net.netfilter.nf_conntrack_count
sysctl net.netfilter.nf_conntrack_max
# If count approaches max: add capacity or reduce connection timeout

Layer 2: Is this a TCP-level problem?

# TCP connection state summary
ss -s
# TIME-WAIT accumulation is normal; CLOSE-WAIT accumulation indicates application bugs

# Detailed TCP statistics
netstat -s | grep -E "retransmit|failed|timeout|reset"
# Retransmits indicate packet loss; resets may indicate firewall issues

Layer 3: Is this a routing/latency problem?

# Trace path to destination with latency at each hop
mtr --report --report-cycles 10 database.internal.example.com

# Measure actual round-trip time distribution
ping -c 100 -i 0.1 database.internal.example.com | tail -5
# Look at: min/avg/max/stddev — high stddev means inconsistent latency

The Performance Investigation Mindset

The most important habit in performance engineering: measure before optimizing. The instinct to add a cache, increase threads, or change an algorithm is often wrong. Measurement tells you whether the bottleneck is CPU, memory, I/O, or network — and the optimization that fixes one often makes another worse.

The second most important habit: understand the baseline. A system running at "100% CPU" that was designed to run at 70% is in trouble. A system running at "100% CPU" that was designed for 100% is behaving as designed. You need the baseline to know whether the current utilization is abnormal.

The third: find the actual critical path. The function that's slow may not be the function that needs optimization — it may be slow because it's being called unnecessarily often, or because the data it operates on is structured inefficiently. Understanding why the bottleneck exists is as important as knowing where it is.

*Zak Hassan is a Staff SRE specializing in Linux performance engineering, distributed systems reliability, and AI-powered operations. Find him at zakhassan.com or on LinkedIn.*

Topic Paths

SRE and Reliability Kubernetes and Platform Engineering Observability and Incident Learning Cloud Cost and Capacity

About the Author

Zak Hassan writes about reliability engineering under real scale constraints.

Staff-level SRE and platform engineer focused on identity reliability, Kubernetes, observability, cloud architecture, AI infrastructure, and reducing operational uncertainty.

Connect on LinkedIn