Distributed Tracing in Production: Sampling, Tail Latency, and Making Traces Useful

Distributed tracing is the observability tool that most teams implement but few use to its potential. The initial setup — instrument services, emit spans, visualize in Jaeger or Tempo — is the easy part. The hard parts are sampling strategy (which requests do you trace, and how?), making traces actually useful during an incident, and evolving the tracing infrastructure as your system grows.

This is the deep dive on making distributed tracing a production reliability tool rather than a checkbox.

The Sampling Problem

You can't trace every request in a high-volume production system. The overhead of full-fidelity tracing — the CPU to record spans, the network to ship them, the storage to retain them — is significant at scale. A service processing 50,000 requests/second generating one span per request produces 4.3 billion spans per day. At even 1KB per span, that's 4.3TB of daily trace data. The economics are untenable.

Sampling is the solution. The question is: which requests to sample, and how to decide?

Head-based sampling makes the decision at the start of a request. The simplest form is random sampling: trace 1% of requests uniformly. The problem: if your error rate is 0.01%, you'll almost never trace an erroring request at 1% sample rate. Important rare events fall below the sampling floor.

Tail-based sampling makes the decision at the end of a request, after all spans have been collected. This is the right approach: you can prioritize tracing requests that were slow, erroring, or otherwise interesting — regardless of how rare they are.

The architecture of tail-based sampling:

Service → OTel Collector (buffer all spans) → Sampling Decision → Backend

The Collector buffers spans for a configurable window (e.g., 30 seconds).
At the end of the window, it evaluates the complete trace and decides:
- Errors: always keep (100% sample rate)
- P99+ latency: always keep
- Random sample of "boring" successful requests: 0.1%

OpenTelemetry Collector tail-based sampling configuration:

# otel-collector-config.yaml
processors:
  tail_sampling:
    decision_wait: 10s       # Wait up to 10s for all spans to arrive
    num_traces: 100000       # Buffer this many traces simultaneously
    expected_new_traces_per_sec: 10000
    policies:
      # Always trace errors
      - name: errors-policy
        type: status_code
        status_code: {status_codes: [ERROR]}
      
      # Always trace slow requests (>1 second)
      - name: latency-policy
        type: latency
        latency: {threshold_ms: 1000}
      
      # Always trace requests with a specific header (e.g., from QA team)
      - name: forced-sampling-policy
        type: string_attribute
        string_attribute:
          key: http.request.header.x-force-trace
          values: ["true"]
      
      # Random 1% sample of everything else
      - name: probabilistic-policy
        type: probabilistic
        probabilistic: {sampling_percentage: 1}

This configuration ensures you always have traces for the cases you care about most (errors, slow requests) while keeping storage costs manageable for the boring majority.

Trace Context Propagation: The Infrastructure That Makes Tracing Work

A distributed trace spans multiple services. For the spans to assemble into a coherent trace, every service must propagate the trace context — the trace ID and parent span ID — to downstream services.

The W3C TraceContext standard (traceparent header) is now the universal mechanism, and OTel handles propagation automatically for instrumented HTTP clients and servers. The failure mode to watch: uninstrumented services that receive a traceparent header and don't forward it. These services break the trace chain — you get a trace that shows service A calling something, then a separate disconnected trace for service B with no link to A.

Audit your service graph for tracing gaps:

# Script to identify services that receive traced requests but don't propagate
# by looking for traces that have incomplete paths

def find_tracing_gaps(tempo_client, time_range_hours=24):
    """
    Find service pairs where tracing context is dropped.
    A gap exists when service A's span shows a call to service B,
    but no corresponding span appears in service B with the same trace ID.
    """
    traces = tempo_client.search_traces(
        service="api-gateway",
        limit=1000,
        start=time_range_hours
    )
    
    gaps = []
    for trace in traces:
        for span in trace.spans:
            if span.has_outbound_calls():
                for call in span.outbound_calls:
                    if not trace.has_span_for_service(call.destination_service):
                        gaps.append({
                            "source": span.service_name,
                            "destination": call.destination_service,
                            "example_trace_id": trace.trace_id
                        })
    
    # Count gaps by service pair
    return Counter(
        (g["source"], g["destination"]) for g in gaps
    ).most_common(20)

Making Traces Useful During Incidents

Traces are most valuable during incidents when you need to understand what a specific request did and why it failed. For traces to be useful in this context:

Correlation IDs that span systems. Your application logs, your metrics, and your traces need a common identifier so you can navigate between them during investigation. The trace ID serves this role — log it in every log line, surface it in error responses to users (as a request ID they can report).

from opentelemetry import trace
import logging

def get_current_trace_id() -> str:
    span = trace.get_current_span()
    if span.is_recording():
        return format(span.get_span_context().trace_id, '032x')
    return "no-active-trace"

# Structured log with trace correlation
logger.info(json.dumps({
    "message": "Processing payment",
    "order_id": order_id,
    "trace_id": get_current_trace_id(),  # Links log to trace
    "amount_cents": amount
}))

Rich span attributes for post-hoc filtering. Spans should carry the attributes you'll want to filter on during investigation: user ID, tenant ID, order ID, feature flag values, cache hit/miss, database query hash. The span is your primary artifact for understanding what happened for a specific request.

with tracer.start_as_current_span("process_payment") as span:
    span.set_attributes({
        "payment.order_id": order_id,
        "payment.amount_cents": amount,
        "payment.currency": currency,
        "payment.processor": processor_name,
        "payment.retry_attempt": retry_count,
        "db.connection_pool.available": pool.available_connections,
    })

Span events for significant moments within a long span. If a span represents a multi-step operation, span events capture the timestamps of intermediate steps without requiring a child span for each one.

with tracer.start_as_current_span("batch_process") as span:
    span.add_event("validation_complete", attributes={"record_count": len(records)})
    
    results = process_records(records)
    span.add_event("processing_complete", attributes={"success_count": results.success_count})
    
    write_results(results)
    span.add_event("write_complete")

The Jaeger vs. Tempo Decision

Jaeger (CNCF project, originally from Uber) is the original open-source distributed tracing backend. It's mature, well-documented, and has a capable UI for trace exploration. The operational overhead is meaningful — Jaeger has multiple components (collector, query, storage backend) that need to be operated and scaled.

Grafana Tempo is the newer entrant, designed for cost efficiency at scale. Tempo stores traces directly in object storage (S3, GCS) rather than requiring a specialized database, which dramatically reduces operational cost. The tradeoff: query performance is lower than Jaeger's because Tempo requires an index to find traces (TraceQL queries are fast; finding all traces for a service in a time window requires the Tempo index).

For new deployments, Tempo is the better choice if you're already running Grafana for metrics and logs — the unified UI (Grafana Explore with cross-datasource correlation) is genuinely valuable. Jaeger is the better choice if you have an existing Jaeger deployment and need maximum query flexibility.

Both are compatible with OpenTelemetry — switching backends is a Collector configuration change, not a re-instrumentation effort.

Tracing AI Agent Interactions

The GenAI semantic conventions for OTel (discussed in an earlier post) provide the schema. The specific value of tracing AI agents is capturing the multi-step reasoning chain as a trace:

Root span: agent.investigate_incident (duration: 3m 42s)
  ├── gen_ai.chat (LLM call 1 — initial analysis)
  │     duration: 2.3s, input_tokens: 1840, output_tokens: 312
  ├── gen_ai.tool.fetch_cloudwatch_logs (tool call 1)
  │     duration: 1.1s, log_group: /aws/lambda/payment, records_returned: 847
  ├── gen_ai.chat (LLM call 2 — after seeing logs)
  │     duration: 3.1s, input_tokens: 5200, output_tokens: 428
  ├── gen_ai.tool.get_recent_deployments (tool call 2)
  │     duration: 0.4s, service: payment-service, deployments_found: 2
  └── gen_ai.chat (LLM call 3 — final synthesis)
        duration: 2.8s, input_tokens: 6100, output_tokens: 892
        finish_reason: end_turn

This trace shows exactly what the agent did, how long each step took, what data it gathered, and how its token consumption accumulated. When the agent makes a wrong diagnosis, the trace shows you which tool call returned misleading data or which LLM call made the reasoning error.

*Zak Hassan is a Staff SRE specializing in observability engineering, distributed systems, and AI-powered operations. Find him at zakhassan.com or on LinkedIn.*

Topic Paths

SRE and Reliability Kubernetes and Platform Engineering Observability and Incident Learning AI Infrastructure and Operations Cloud Cost and Capacity

About the Author

Zak Hassan writes about reliability engineering under real scale constraints.

Staff-level SRE and platform engineer focused on identity reliability, Kubernetes, observability, cloud architecture, AI infrastructure, and reducing operational uncertainty.

Connect on LinkedIn