*By Zak Hassan — Staff SRE | May 2026*
Distributed tracing is the observability technique that makes microservice latency legible. In a monolith, a slow request is easy to profile — the call stack is right there. In a system where a single user-facing request fans out through 15 services, understanding which service is slow, why it's slow, and what upstream or downstream effects that slowness creates requires tracing.
But raw tracing data is easy to generate and hard to use. Most teams that instrument their services with OpenTelemetry end up with terabytes of trace data and limited ability to extract signal from it. This is the guide to making tracing actually useful — structuring spans correctly, sampling intelligently, and building analysis workflows that surface the latency problems that matter.
What a Trace Is and Isn't
A trace is a record of a single request's journey through a distributed system. It's composed of spans — each span represents one unit of work (a service processing the request, a database call, an external API call). Spans have a parent-child relationship that forms the trace tree.
What traces are good at:
- Identifying which service in a call chain is adding latency
- Understanding call fan-out patterns (service A calls B, C, and D in parallel — which one is the bottleneck?)
- Correlating an individual user complaint with specific system behavior
- Debugging intermittent failures in specific code paths
What traces are not good at:
- Statistical analysis of latency across large populations (use metrics for that)
- Showing aggregate error rates (use metrics)
- Understanding why a problem is happening (traces show what happened, not why the system is configured that way)
The most effective observability stacks use traces as the drill-down tool after metrics identify that something is wrong.
Span Design: What to Instrument
The choice of what to make a span is the highest-leverage decision in tracing. Too coarse and the trace doesn't isolate latency. Too fine and the trace is noise.
Instrument at these boundaries:
from opentelemetry import trace
from opentelemetry.trace import Status, StatusCode
import functools
tracer = trace.get_tracer("my-service")
# 1. Every inbound HTTP/gRPC request (auto-instrumented by most frameworks)
# If not auto-instrumented:
@app.route("/api/orders/<order_id>")
def get_order(order_id: str):
with tracer.start_as_current_span("get_order") as span:
span.set_attribute("order.id", order_id)
span.set_attribute("user.id", get_current_user_id())
return _get_order_internal(order_id)
# 2. Every outbound call — database, cache, external API, internal service
def fetch_user(user_id: str) -> User:
with tracer.start_as_current_span("db.fetch_user") as span:
span.set_attribute("db.system", "postgresql")
span.set_attribute("db.operation", "SELECT")
span.set_attribute("db.table", "users")
span.set_attribute("user.id", user_id)
try:
result = db.query("SELECT * FROM users WHERE id = %s", user_id)
span.set_attribute("db.rows_returned", len(result))
return result
except Exception as e:
span.set_status(Status(StatusCode.ERROR, str(e)))
span.record_exception(e)
raise
# 3. Significant internal computations (>10ms expected duration)
def compute_recommendation_scores(user_features: dict) -> list:
with tracer.start_as_current_span("recommendation.score_computation") as span:
span.set_attribute("recommendation.feature_count", len(user_features))
scores = _run_scoring_model(user_features)
span.set_attribute("recommendation.candidates_scored", len(scores))
return scores
# 4. Message queue publish/consume
def publish_order_event(order: Order):
with tracer.start_as_current_span("kafka.publish") as span:
span.set_attribute("messaging.system", "kafka")
span.set_attribute("messaging.destination", "order-events")
span.set_attribute("messaging.message_id", order.event_id)
producer.send("order-events", order.to_bytes())Span attributes that make traces searchable:
The value of a span isn't just its duration — it's the attributes you attach that let you filter and correlate later. At minimum: service name, version, environment, user/tenant ID (where applicable), the specific entity being operated on (order ID, product ID), and error details when errors occur.
Trace Context Propagation: The Most Common Failure Mode
Distributed tracing breaks when context doesn't propagate across service boundaries. If service A calls service B but doesn't pass the trace ID in the request headers, B starts a new trace — and the traces are unlinked. You can't see the full call chain.
Most tracing failures in production-like lab environments are propagation failures, not instrumentation failures.
# Correct: using auto-instrumented HTTP clients that handle propagation
import httpx
from opentelemetry.instrumentation.httpx import HTTPXClientInstrumentor
HTTPXClientInstrumentor().instrument() # All httpx requests now propagate context
# Correct: explicit propagation for custom transports
from opentelemetry import propagate
def call_internal_service(endpoint: str, payload: dict) -> dict:
headers = {}
propagate.inject(headers) # Inject W3C trace context headers
# headers now contains: traceparent, tracestate
response = custom_http_client.post(endpoint, json=payload, headers=headers)
return response.json()
# Correct: async task queues — inject context into the job payload
from opentelemetry import propagate
import json
def enqueue_processing_job(data: dict):
carrier = {}
propagate.inject(carrier) # {"traceparent": "00-abc123...-def456...-01"}
job_payload = {
"data": data,
"trace_context": carrier # Carry the trace context through the queue
}
queue.enqueue(job_payload)
def process_job(job_payload: dict):
# On the consumer side, extract and restore context
ctx = propagate.extract(job_payload.get("trace_context", {}))
with trace.use_span(trace.get_current_span(), context=ctx):
_process(job_payload["data"])Propagation format: use W3C TraceContext (traceparent header) as the standard. If you have legacy services using B3 or Jaeger headers, configure the OTel SDK to accept multiple formats.
Sampling: The Strategy That Determines What You Can Learn
You cannot store every trace in a high-traffic system — a service handling 10,000 requests per second generates 10,000 traces per second, and storing all of them is prohibitively expensive. Sampling is the decision of which traces to keep.
Head sampling makes the decision at the start of the trace, before any service has processed the request. Simple to implement; misses rare errors on non-sampled requests.
Tail sampling makes the decision after the trace completes — letting you keep all error traces, all slow traces, and a random sample of everything else. Requires a central component (the OTel Collector) that buffers spans until the trace is complete.
# Tail sampling policy in the OTel Collector
processors:
tail_sampling:
decision_wait: 10s # Wait up to 10s for all spans to arrive
num_traces: 50000 # Buffer up to 50k traces in memory
expected_new_traces_per_sec: 1000
policies:
# Always keep error traces
- name: keep-errors
type: status_code
status_code:
status_codes: [ERROR]
# Always keep slow traces (>2 seconds end-to-end)
- name: keep-slow
type: latency
latency:
threshold_ms: 2000
# Always keep traces from specific high-value users
- name: keep-vip-users
type: string_attribute
string_attribute:
key: user.tier
values: ["enterprise", "vip"]
# Random sample of everything else
- name: probabilistic-sample
type: probabilistic
probabilistic:
sampling_percentage: 1 # Keep 1% of remaining traces
# Composite: apply policies in order, keep if ANY match
- name: composite
type: composite
composite:
max_total_spans_per_second: 5000
policy_order: [keep-errors, keep-slow, keep-vip-users, probabilistic-sample]
rate_allocation:
- policy: keep-errors
percent: 30
- policy: keep-slow
percent: 30
- policy: keep-vip-users
percent: 10
- policy: probabilistic-sample
percent: 30Trace Analysis: Finding the Slow Service
The trace UI (Jaeger, Tempo, Datadog APM) is good for investigating specific incidents. But systematic latency analysis requires querying across many traces.
The critical path problem: in a trace with parallel calls, the total latency is determined by the longest parallel branch — the critical path. A trace where service A calls B and C in parallel, B takes 50ms and C takes 200ms, has a critical path through C. Optimizing B doesn't help.
def find_critical_path(trace: dict) -> list[dict]:
"""
Given a trace (as returned by the Jaeger API), find the critical path —
the sequence of spans that determines the total trace duration.
"""
spans_by_id = {span['spanID']: span for span in trace['spans']}
def get_end_time(span):
return span['startTime'] + span['duration']
def find_longest_child_path(span_id: str) -> tuple[int, list]:
span = spans_by_id[span_id]
children = [s for s in trace['spans'] if span_id in [r['spanID'] for r in s.get('references', [])]]
if not children:
return span['duration'], [span]
child_paths = [find_longest_child_path(child['spanID']) for child in children]
longest_duration, longest_path = max(child_paths, key=lambda x: x[0])
return span['duration'] + longest_duration, [span] + longest_path
# Find root span
root_span = next(s for s in trace['spans'] if not s.get('references'))
_, critical_path = find_longest_child_path(root_span['spanID'])
return critical_path
# Aggregate: which service spans appear most often on the critical path?
def aggregate_critical_path_analysis(traces: list[dict]) -> dict:
service_critical_path_count = {}
for trace in traces:
critical_path = find_critical_path(trace)
for span in critical_path:
service = span.get('process', {}).get('serviceName', 'unknown')
service_critical_path_count[service] = service_critical_path_count.get(service, 0) + 1
return dict(sorted(service_critical_path_count.items(), key=lambda x: x[1], reverse=True))Trace-based SLO validation: for each user-facing operation, use traces to validate that latency SLOs are being met end-to-end:
# Query Tempo/Jaeger for traces over SLO threshold
def get_traces_exceeding_slo(operation: str, slo_ms: int, lookback_minutes: int = 60) -> list:
# Using Tempo HTTP API
params = {
"tags": f"operation={operation}",
"minDuration": f"{slo_ms}ms",
"start": int((time.time() - lookback_minutes * 60) * 1e9),
"end": int(time.time() * 1e9),
"limit": 100
}
response = requests.get(f"{TEMPO_URL}/api/search", params=params)
return response.json().get("traces", [])The Exemplar Bridge: Connecting Metrics to Traces
The most powerful observability workflow connects metrics to traces: a latency histogram shows a spike, you click on it, and it takes you to a sample trace from that time period. This requires exemplars — trace IDs embedded in metric data points.
from opentelemetry import trace
from prometheus_client import Histogram
request_duration = Histogram(
'http_request_duration_seconds',
'HTTP request duration',
['method', 'endpoint', 'status_code'],
buckets=[.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10]
)
def handle_request(method: str, endpoint: str):
start = time.time()
try:
response = _process_request()
duration = time.time() - start
# Get current span for exemplar
current_span = trace.get_current_span()
span_context = current_span.get_span_context()
# Record with exemplar — Prometheus will store the trace ID alongside
request_duration.labels(
method=method,
endpoint=endpoint,
status_code=str(response.status_code)
).observe(duration, exemplar={
"traceID": format(span_context.trace_id, '032x'),
"spanID": format(span_context.span_id, '016x')
})
return response
except Exception as e:
# ... error handling
raiseWith exemplars in place, Grafana can display "sample trace" links on any histogram panel — clicking a latency spike jumps directly to a representative trace from that exact time window.
*Zak Hassan is a Staff SRE specializing in observability engineering, distributed systems tracing, and AI-powered operations. Find him at zakhassan.com or on LinkedIn.*
Topic Paths