The OpenTelemetry Collector is the component that most production OTel deployments underinvest in. Teams instrument their services correctly, then pipe the data directly to a backend with a one-liner Collector config, and later discover they can't filter expensive data, can't route to multiple backends, can't transform attributes before storage, and have no way to control costs as telemetry volume grows.
The Collector is where your observability pipeline lives — and a well-designed pipeline gives you enormous flexibility. Here's how to design it for production.
The Collector Architecture
The Collector has three conceptual layers: receivers (ingest telemetry from services), processors (transform, filter, sample), and exporters (send to backends). These assemble into pipelines, one per signal type:
# otel-collector-config.yaml — production pipeline structure
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
# Also receive from existing Prometheus endpoints
prometheus:
config:
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
processors:
# Add resource attributes (cluster, environment) to everything
resource:
attributes:
- key: deployment.environment
value: "production"
action: insert
- key: k8s.cluster.name
from_attribute: KUBE_CLUSTER_NAME
action: insert
# Batch spans before exporting (reduces network overhead)
batch:
send_batch_size: 10000
timeout: 10s
# Memory limiter prevents OOM when backends are slow
memory_limiter:
check_interval: 1s
limit_percentage: 75
spike_limit_percentage: 15
# Tail sampling (covered in tracing post)
tail_sampling:
decision_wait: 10s
policies:
- name: errors
type: status_code
status_code: {status_codes: [ERROR]}
- name: slow-requests
type: latency
latency: {threshold_ms: 1000}
- name: probabilistic
type: probabilistic
probabilistic: {sampling_percentage: 1}
exporters:
# Primary backend
otlp/datadog:
endpoint: https://trace.agent.datadoghq.com
headers:
DD-API-KEY: "${DD_API_KEY}"
# Long-term archive (cheap S3 storage)
otlphttp/s3:
endpoint: https://your-s3-otel-receiver.example.com
# Prometheus metrics to Grafana
prometheusremotewrite:
endpoint: https://your-grafana-cloud.grafana.net/api/prom/push
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, resource, tail_sampling, batch]
exporters: [otlp/datadog, otlphttp/s3]
metrics:
receivers: [otlp, prometheus]
processors: [memory_limiter, resource, batch]
exporters: [prometheusremotewrite]
logs:
receivers: [otlp]
processors: [memory_limiter, resource, batch]
exporters: [otlp/datadog]The Fanout Pattern: Multiple Backends from One Pipeline
One of the Collector's key advantages: you can send the same telemetry to multiple backends simultaneously. This enables:
Primary + archive. Send traces to your primary observability platform (Datadog, Grafana) for interactive investigation, and simultaneously to cheap S3 storage for long-term retention and compliance.
Gradual backend migration. When migrating from one observability platform to another, route to both simultaneously. Validate that the new backend receives correct data before cutting over.
Team-specific routing. Route traces from service A to the backend owned by team A's tooling, and traces from service B to team B's backend — all from a single Collector deployment.
# Connector-based routing by service name
processors:
routing:
from_attribute: service.name
table:
- value: "payment-service"
exporters: [otlp/datadog, otlphttp/security-siem]
- value: "fraud-detection"
exporters: [otlp/datadog, otlphttp/security-siem]
default_exporters: [otlp/datadog]Attribute Filtering for Cost Control
Telemetry cost is proportional to volume. Volume is proportional to the number of attributes on each data point. Attributes that seem useful during development often produce expensive high-cardinality metrics in production.
The attributes processor lets you control which attributes are included, excluded, or transformed before reaching your backend:
processors:
attributes:
actions:
# Remove high-cardinality or sensitive attributes before storage
- key: http.url # Full URL may contain PII or query params
action: delete
- key: user.id # PII — remove from metrics, keep in traces
action: delete
# Hash user IDs rather than removing them (preserves cardinality without PII)
- key: user.id
action: hash
# Truncate long attribute values that blow up storage
- key: db.statement
action: truncate
truncate_at: 500 # First 500 chars of SQL queries
# Add derived attributes from existing ones
- key: service.tier
from_attribute: service.name
action: extract
pattern: "^(payment|order|checkout)-.*"
# Sets service.tier="checkout" for "checkout-service", etc.Metric cardinality control with the metricstransform processor:
processors:
metricstransform:
transforms:
# Remove high-cardinality label from a metric before export
- include: http_requests_total
match_type: strict
action: update
operations:
- action: delete_label_value
label: user_id # user_id makes this metric unbounded cardinalityThe Collector Deployment Model: Agent vs. Gateway
Agent mode: One Collector per node (deployed as a DaemonSet in Kubernetes). Services send telemetry to the local Collector agent over localhost. The agent handles local processing and forwards to a central gateway or directly to backends.
Gateway mode: A centralized Collector deployment (a Deployment in Kubernetes) that receives from all agents and handles expensive processing (tail sampling, routing, expensive transformations) centrally.
The production pattern: agent + gateway in combination.
Services
↓ (send to localhost:4317)
Agent Collector (DaemonSet)
- Add k8s resource attributes (pod name, namespace, node)
- Basic batching
- Send to gateway
↓
Gateway Collector (Deployment, scaled horizontally)
- Tail sampling (requires seeing full trace — needs centralized view)
- Routing to multiple backends
- Attribute filtering for cost control
- Export to backends
↓
Observability BackendsTail sampling requires the gateway because it needs to see all spans for a trace to make the sampling decision. An agent that only sees the spans for services on its node cannot make a complete tail sampling decision.
Collector Health Monitoring
The Collector exposes its own metrics and health endpoints. Monitor these:
# Enable internal metrics exposure
service:
telemetry:
metrics:
level: detailed
address: 0.0.0.0:8888 # Scrape this with PrometheusKey Collector metrics to alert on:
otelcol_processor_dropped_metric_points # Data being dropped (pipeline full)
otelcol_exporter_send_failed_metric_points # Export failures (backend unavailable)
otelcol_processor_batch_send_size # Batch efficiency
otelcol_exporter_queue_size # Export queue depth (alert if growing)An exporter queue that's growing means your backend can't keep up with ingest. Either reduce ingest volume (increase sampling), scale the backend, or scale the Collector (though more Collectors don't help if the backend is the bottleneck).
Migrating Off Proprietary Agents
The Collector's role in a migration from a proprietary observability agent (Datadog agent, New Relic agent) to OTel:
Phase 1: Run OTel instrumentation alongside the proprietary agent. OTel sends to the Collector; Collector sends to the proprietary backend. This validates that OTel data is correct before cutting over.
Phase 2: Add a new exporter to the Collector pointing to the new backend. Now data goes to both backends simultaneously.
Phase 3: Validate the new backend has complete, correct data. Cut application traffic over.
Phase 4: Remove the proprietary agent instrumentation from services.
The Collector's fanout capability makes this migration path smooth — there's no "big bang" cutover where you move everything at once and hope it works.
*Zak Hassan is a Staff SRE specializing in observability engineering, data platform reliability, and AI-powered operations. Find him at zakhassan.com or on LinkedIn.*
Topic Paths