OpenTelemetry Collector in Production: Pipeline Design, Routing, and Cost Control

The OpenTelemetry Collector is the component that most production OTel deployments underinvest in. Teams instrument their services correctly, then pipe the data directly to a backend with a one-liner Collector config, and later discover they can't filter expensive data, can't route to multiple backends, can't transform attributes before storage, and have no way to control costs as telemetry volume grows.

The Collector is where your observability pipeline lives — and a well-designed pipeline gives you enormous flexibility. Here's how to design it for production.

The Collector Architecture

The Collector has three conceptual layers: receivers (ingest telemetry from services), processors (transform, filter, sample), and exporters (send to backends). These assemble into pipelines, one per signal type:

# otel-collector-config.yaml — production pipeline structure
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
  
  # Also receive from existing Prometheus endpoints
  prometheus:
    config:
      scrape_configs:
        - job_name: 'kubernetes-pods'
          kubernetes_sd_configs:
            - role: pod

processors:
  # Add resource attributes (cluster, environment) to everything
  resource:
    attributes:
      - key: deployment.environment
        value: "production"
        action: insert
      - key: k8s.cluster.name
        from_attribute: KUBE_CLUSTER_NAME
        action: insert

  # Batch spans before exporting (reduces network overhead)
  batch:
    send_batch_size: 10000
    timeout: 10s

  # Memory limiter prevents OOM when backends are slow
  memory_limiter:
    check_interval: 1s
    limit_percentage: 75
    spike_limit_percentage: 15

  # Tail sampling (covered in tracing post)
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: errors
        type: status_code
        status_code: {status_codes: [ERROR]}
      - name: slow-requests
        type: latency
        latency: {threshold_ms: 1000}
      - name: probabilistic
        type: probabilistic
        probabilistic: {sampling_percentage: 1}

exporters:
  # Primary backend
  otlp/datadog:
    endpoint: https://trace.agent.datadoghq.com
    headers:
      DD-API-KEY: "${DD_API_KEY}"
  
  # Long-term archive (cheap S3 storage)
  otlphttp/s3:
    endpoint: https://your-s3-otel-receiver.example.com
  
  # Prometheus metrics to Grafana
  prometheusremotewrite:
    endpoint: https://your-grafana-cloud.grafana.net/api/prom/push

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, resource, tail_sampling, batch]
      exporters: [otlp/datadog, otlphttp/s3]
    
    metrics:
      receivers: [otlp, prometheus]
      processors: [memory_limiter, resource, batch]
      exporters: [prometheusremotewrite]
    
    logs:
      receivers: [otlp]
      processors: [memory_limiter, resource, batch]
      exporters: [otlp/datadog]

The Fanout Pattern: Multiple Backends from One Pipeline

One of the Collector's key advantages: you can send the same telemetry to multiple backends simultaneously. This enables:

Primary + archive. Send traces to your primary observability platform (Datadog, Grafana) for interactive investigation, and simultaneously to cheap S3 storage for long-term retention and compliance.

Gradual backend migration. When migrating from one observability platform to another, route to both simultaneously. Validate that the new backend receives correct data before cutting over.

Team-specific routing. Route traces from service A to the backend owned by team A's tooling, and traces from service B to team B's backend — all from a single Collector deployment.

# Connector-based routing by service name
processors:
  routing:
    from_attribute: service.name
    table:
      - value: "payment-service"
        exporters: [otlp/datadog, otlphttp/security-siem]
      - value: "fraud-detection"
        exporters: [otlp/datadog, otlphttp/security-siem]
    default_exporters: [otlp/datadog]

Attribute Filtering for Cost Control

Telemetry cost is proportional to volume. Volume is proportional to the number of attributes on each data point. Attributes that seem useful during development often produce expensive high-cardinality metrics in production.

The attributes processor lets you control which attributes are included, excluded, or transformed before reaching your backend:

processors:
  attributes:
    actions:
      # Remove high-cardinality or sensitive attributes before storage
      - key: http.url          # Full URL may contain PII or query params
        action: delete
      - key: user.id           # PII — remove from metrics, keep in traces
        action: delete
      
      # Hash user IDs rather than removing them (preserves cardinality without PII)
      - key: user.id
        action: hash
      
      # Truncate long attribute values that blow up storage
      - key: db.statement
        action: truncate
        truncate_at: 500       # First 500 chars of SQL queries
      
      # Add derived attributes from existing ones
      - key: service.tier
        from_attribute: service.name
        action: extract
        pattern: "^(payment|order|checkout)-.*"
        # Sets service.tier="checkout" for "checkout-service", etc.

Metric cardinality control with the metricstransform processor:

processors:
  metricstransform:
    transforms:
      # Remove high-cardinality label from a metric before export
      - include: http_requests_total
        match_type: strict
        action: update
        operations:
          - action: delete_label_value
            label: user_id    # user_id makes this metric unbounded cardinality

The Collector Deployment Model: Agent vs. Gateway

Agent mode: One Collector per node (deployed as a DaemonSet in Kubernetes). Services send telemetry to the local Collector agent over localhost. The agent handles local processing and forwards to a central gateway or directly to backends.

Gateway mode: A centralized Collector deployment (a Deployment in Kubernetes) that receives from all agents and handles expensive processing (tail sampling, routing, expensive transformations) centrally.

The production pattern: agent + gateway in combination.

Services
  ↓ (send to localhost:4317)
Agent Collector (DaemonSet)
  - Add k8s resource attributes (pod name, namespace, node)
  - Basic batching
  - Send to gateway
  ↓
Gateway Collector (Deployment, scaled horizontally)
  - Tail sampling (requires seeing full trace — needs centralized view)
  - Routing to multiple backends
  - Attribute filtering for cost control
  - Export to backends
  ↓
Observability Backends

Tail sampling requires the gateway because it needs to see all spans for a trace to make the sampling decision. An agent that only sees the spans for services on its node cannot make a complete tail sampling decision.

Collector Health Monitoring

The Collector exposes its own metrics and health endpoints. Monitor these:

# Enable internal metrics exposure
service:
  telemetry:
    metrics:
      level: detailed
      address: 0.0.0.0:8888    # Scrape this with Prometheus

Key Collector metrics to alert on:

otelcol_processor_dropped_metric_points    # Data being dropped (pipeline full)
otelcol_exporter_send_failed_metric_points # Export failures (backend unavailable)
otelcol_processor_batch_send_size          # Batch efficiency
otelcol_exporter_queue_size                # Export queue depth (alert if growing)

An exporter queue that's growing means your backend can't keep up with ingest. Either reduce ingest volume (increase sampling), scale the backend, or scale the Collector (though more Collectors don't help if the backend is the bottleneck).

Migrating Off Proprietary Agents

The Collector's role in a migration from a proprietary observability agent (Datadog agent, New Relic agent) to OTel:

Phase 1: Run OTel instrumentation alongside the proprietary agent. OTel sends to the Collector; Collector sends to the proprietary backend. This validates that OTel data is correct before cutting over.

Phase 2: Add a new exporter to the Collector pointing to the new backend. Now data goes to both backends simultaneously.

Phase 3: Validate the new backend has complete, correct data. Cut application traffic over.

Phase 4: Remove the proprietary agent instrumentation from services.

The Collector's fanout capability makes this migration path smooth — there's no "big bang" cutover where you move everything at once and hope it works.

*Zak Hassan is a Staff SRE specializing in observability engineering, data platform reliability, and AI-powered operations. Find him at zakhassan.com or on LinkedIn.*

Topic Paths

SRE and Reliability Kubernetes and Platform Engineering Observability and Incident Learning Cloud Cost and Capacity

About the Author

Zak Hassan writes about reliability engineering under real scale constraints.

Staff-level SRE and platform engineer focused on identity reliability, Kubernetes, observability, cloud architecture, AI infrastructure, and reducing operational uncertainty.

Connect on LinkedIn