Log Management at Scale: Structured Logging, Routing, and Cost Control

*By Zak Hassan — Staff SRE | May 2026*

Every team eventually hits the same wall: the system is on fire, engineers are tailing logs across eight terminal windows, and nobody can find the one line that explains what went wrong. The logs exist — terabytes of them — but they're a wall of freeform text, scattered across nodes, formatted inconsistently, and indexed at a cost that would make a CFO cry. Log management done well is invisible infrastructure. Done poorly, it's the reason the on-call rotation burns out. This post covers the full stack: how to emit logs that are actually useful, how to route them cheaply, how to query them during incidents, and how to stop paying for logs nobody reads.

The Structured Logging Imperative

printf("user %s logged in from %s", user, ip) works fine when you have one server and twenty users. At scale it's a trap. You can't filter it programmatically, you can't aggregate across it, and you can't join it with traces without writing fragile regex that breaks the moment a developer changes the message string.

Structured logging means every log line is a machine-readable document — JSON in practice for most stacks. Every line must carry a fixed set of fields: timestamp (RFC3339, always UTC), level, service, trace_id, request_id, and message. Everything else is context on top of those.

Here's a Python setup using python-json-logger that enforces this contract:

import logging
import sys
from pythonjsonlogger import jsonlogger

def build_logger(service_name: str) -> logging.Logger:
    logger = logging.getLogger(service_name)
    logger.setLevel(logging.INFO)

    handler = logging.StreamHandler(sys.stdout)
    formatter = jsonlogger.JsonFormatter(
        fmt="%(asctime)s %(levelname)s %(name)s %(message)s",
        datefmt="%Y-%m-%dT%H:%M:%S%z",
        rename_fields={"asctime": "timestamp", "levelname": "level", "name": "service"},
    )
    handler.setFormatter(formatter)
    logger.addHandler(handler)
    return logger

logger = build_logger("payments-api")

# Bind request-scoped context so every subsequent call carries it
import contextvars

_log_ctx: contextvars.ContextVar[dict] = contextvars.ContextVar("log_ctx", default={})

class ContextLogger:
    def __init__(self, base: logging.Logger):
        self._log = base

    def _merge(self, extra: dict) -> dict:
        return {**_log_ctx.get(), **extra}

    def info(self, msg: str, **kwargs):
        self._log.info(msg, extra=self._merge(kwargs))

    def error(self, msg: str, **kwargs):
        self._log.error(msg, extra=self._merge(kwargs))

    def warn(self, msg: str, **kwargs):
        self._log.warning(msg, extra=self._merge(kwargs))

log = ContextLogger(logger)

# In a FastAPI middleware:
# token = _log_ctx.set({"trace_id": request.headers.get("X-Trace-Id"), "request_id": str(uuid4())})
# try: yield
# finally: _log_ctx.reset(token)

log.info("payment processed", amount_cents=4999, currency="USD", customer_id="cust_abc123")
# {"timestamp": "2026-05-08T14:22:01+00:00", "level": "INFO", "service": "payments-api",
#  "message": "payment processed", "amount_cents": 4999, "currency": "USD",
#  "customer_id": "cust_abc123", "trace_id": "4bf92f3577b34da6", "request_id": "req-uuid"}

The output is a single JSON object per line. Any log collector, any query engine, any alert rule can work with it without custom parsing.

Log Levels in Practice

Most teams treat INFO as a catch-all and then wonder why their observability platform bill is enormous. The contract should be explicit:

DEBUG: everything needed to reconstruct execution — SQL queries, HTTP request/response bodies, cache hit/miss. Never on in production-like lab environments by default. High volume by design.
INFO: business-meaningful events that confirm normal operation. A user logged in, a payment succeeded, a batch job completed. One or two per request, not twenty.
WARN: something unexpected happened but the system compensated — a retry succeeded, a circuit breaker opened, a config value fell back to default. Warrants investigation, not a page.
ERROR: something failed and either a user was affected or an operator needs to act. Every ERROR should be either tied to an alert or explicitly documented as expected noise.

The practical problem is that DEBUG is useless if you can only turn it on by restarting pods. Dynamic log level adjustment solves this. Expose a /debug/log-level endpoint that adjusts the root logger level at runtime, protected by mTLS or an internal network boundary:

from fastapi import APIRouter, HTTPException
import logging

router = APIRouter()

@router.put("/debug/log-level")
async def set_log_level(level: str):
    numeric = getattr(logging, level.upper(), None)
    if not isinstance(numeric, int):
        raise HTTPException(status_code=400, detail=f"Invalid log level: {level}")
    logging.getLogger().setLevel(numeric)
    return {"level": level.upper(), "applied": True}

This lets you crank up DEBUG on a single pod for thirty seconds during an incident, capture what you need, then return to INFO — no deployment required.

The Log Pipeline

Logs travel from process stdout through at least three hops before they're queryable. Getting that pipeline wrong costs you either money, reliability, or both.

Collection: Fluent Bit as a DaemonSet. Fluent Bit runs on every node and tails container logs from /var/log/containers/. It's lightweight (~450KB binary, ~10MB RAM), and it's where you do your first-pass filtering — drop before you ship.

# fluent-bit-config.yaml (ConfigMap)
[SERVICE]
    Flush         5
    Log_Level     warn
    Parsers_File  parsers.conf

[INPUT]
    Name              tail
    Tag               kube.*
    Path              /var/log/containers/*.log
    Parser            docker
    DB                /var/log/flb_kube.db
    Mem_Buf_Limit     50MB
    Skip_Long_Lines   On
    Refresh_Interval  10

[FILTER]
    Name                kubernetes
    Match               kube.*
    Kube_URL            https://kubernetes.default.svc:443
    Kube_CA_File        /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    Kube_Token_File     /var/run/secrets/kubernetes.io/serviceaccount/token
    Merge_Log           On
    Keep_Log            Off
    Annotations         Off

# Drop health check and readiness probe noise before it ever leaves the node
[FILTER]
    Name    grep
    Match   kube.*
    Exclude log /healthz|/readyz|/metrics

[OUTPUT]
    Name          forward
    Match         kube.*
    Host          vector-aggregator.logging.svc.cluster.local
    Port          24224

Transformation and routing: Vector. Vector sits between collection and storage. This is where you apply your routing logic — ERRORs go to Elasticsearch for fast-access search, DEBUG goes to S3 for cheap archival, everything goes to Loki for unified querying.

# vector.toml
[sources.fluent]
  type = "fluent"
  address = "0.0.0.0:24224"

[transforms.parse_json]
  type = "remap"
  inputs = ["fluent"]
  source = '''
    # Parse the nested JSON log message if present
    if is_string(.log) {
      parsed, err = parse_json(.log)
      if err == null {
        . = merge(., parsed)
      }
    }

    # Normalize level to uppercase
    .level = upcase(string!(.level ?? "INFO"))

    # Drop DEBUG logs with >90% probability to control volume
    if .level == "DEBUG" {
      if random_float() < 0.9 {
        abort
      }
    }

    # Enrich with environment metadata
    .env = get_env_var!("ENVIRONMENT")
    .cluster = get_env_var!("CLUSTER_NAME")
  '''

[transforms.route_by_level]
  type = "route"
  inputs = ["parse_json"]
  [transforms.route_by_level.route]
    errors  = '.level == "ERROR" || .level == "WARN"'
    debug   = '.level == "DEBUG"'
    info    = '.level == "INFO"'

# Fast-access store for errors — searchable within seconds
[sinks.elasticsearch_errors]
  type = "elasticsearch"
  inputs = ["route_by_level.errors"]
  endpoint = "https://es-cluster.internal:9200"
  index = "logs-errors-%Y.%m.%d"
  bulk.action = "index"

# Cheap archival for debug — S3 with Parquet for Athena queries
[sinks.s3_debug]
  type = "aws_s3"
  inputs = ["route_by_level.debug"]
  bucket = "logs-archive-prod"
  key_prefix = "debug/%Y/%m/%d/%H/"
  encoding.codec = "json"
  compression = "gzip"
  batch.timeout_secs = 300

# Loki gets everything — unified querying across all levels
[sinks.loki]
  type = "loki"
  inputs = ["route_by_level.errors", "route_by_level.info", "route_by_level.debug"]
  endpoint = "http://loki.monitoring.svc.cluster.local:3100"
  labels.service = "{{ service }}"
  labels.level = "{{ level }}"
  labels.env = "{{ env }}"
  encoding.codec = "json"

Log Routing and Filtering

The most expensive logs are the ones that carry no signal. Health checks, readiness probes, and /metrics scrapes can easily account for 20-30% of your total log volume in a Kubernetes environment, and they tell you nothing you don't already know from your uptime monitor.

Drop them at the node level in Fluent Bit (shown above) before they hit the network. The second layer of filtering happens in Vector with the sampling logic in the VRL transform — 90% of DEBUG logs dropped randomly preserves enough for pattern detection without the volume. For services where you need guaranteed DEBUG capture on specific requests (say, for a specific customer ID during an investigation), add a request header X-Debug-Capture: true that bypasses the sampling:

# In the VRL remap transform, replace the DEBUG sampling block:
if .level == "DEBUG" {
  if !exists(.http_request_debug_capture) || .http_request_debug_capture != "true" {
    if random_float() < 0.9 {
      abort
    }
  }
}

Loki LogQL for Incident Investigation

Loki's LogQL is designed for this kind of log pipeline. The label-first model (you select streams by labels, then filter within them) means queries are fast even over large volumes — Loki only reads chunks that match your label selectors.

# Find all errors in the payments service in the last 30 minutes
{service="payments-api", level="ERROR"} | json | line_format "{{.message}} trace={{.trace_id}}"

# Follow a specific trace across all services
{env="prod"} | json | trace_id="4bf92f3577b34da6"

# Count error rate by service over time (metric extraction from logs)
sum by (service) (
  rate({env="prod", level="ERROR"} [5m])
)

# Find slow requests — extract latency_ms and filter
{service="payments-api"} | json | latency_ms > 2000 | line_format "{{.latency_ms}}ms {{.request_id}} {{.customer_id}}"

# Pattern detection — find messages that share structure even if values differ
{service="payments-api", level="ERROR"} | pattern "<_> failed: <reason> after <_> retries" | line_format "{{.reason}}"

# Log-to-metric bridge for alerting on log patterns
count_over_time({service="payments-api"} |= "payment declined" [1m])

The exemplars bridge between logs and metrics is particularly powerful: if your metrics include a trace_id exemplar label (supported in Prometheus 2.26+ and exposed via OpenTelemetry), Grafana can jump from a metric spike directly to the Loki stream for that trace. Configure it in the application's Prometheus histogram:

from prometheus_client import Histogram
PAYMENT_LATENCY = Histogram("payment_latency_seconds", "Payment processing latency",
                            labelnames=["status"])
# When recording, pass exemplar with current trace_id
PAYMENT_LATENCY.labels(status="success").observe(0.342, exemplar={"trace_id": current_trace_id()})

Retention Policy Design

Retention is a compliance question first and a cost question second. Get the compliance requirements wrong and the cost savings mean nothing. A working starting point for most B2B SaaS:

Tier	Storage	Retention	What Lives Here
Hot	Elasticsearch / Loki ingesters	7 days	All levels; fully indexed; sub-second queries
Warm	Loki object store (S3 + index)	30 days	INFO and above; queryable but slower
Cold	S3 Glacier / raw Parquet	1–7 years	Compliance-required logs (auth, payments, PII access); Athena-queryable
Delete	—	>7 years	Everything else

Auth logs (login success/failure, permission changes) and payment logs almost always have a regulatory floor — PCI DSS requires one year, SOC 2 expects you to define and enforce a policy. Build your Loki retention rules and S3 lifecycle policies to enforce this automatically:

# Loki retention config (loki.yaml)
compactor:
  retention_enabled: true
  retention_delete_delay: 2h
  working_directory: /data/retention

limits_config:
  retention_period: 744h  # 31 days default

# Per-tenant or per-stream overrides
ruler_storage:
  type: s3

# S3 lifecycle for the cold archive bucket (applied via Terraform)
# resource "aws_s3_bucket_lifecycle_configuration" "logs_archive" {
#   rule { id = "transition-to-glacier"
#     transition { days = 90; storage_class = "GLACIER" }
#     expiration { days = 2555 }  # 7 years
#   }
# }

Log-Driven Cost Control

In hosted platforms (Datadog, Splunk Cloud, New Relic), log cost has three components: ingestion, indexing, and retention. Ingestion and indexing are where teams hemorrhage money.

The most expensive log sources are almost always:

Istio/Envoy access logs — one line per request per sidecar, often at DEBUG, for every service-to-service call
Database slow query logs misconfigured with a threshold of 0ms (logging every query)
Health check and readiness probe endpoints not excluded at the collector
Batch job progress logs emitting a line every record processed

Audit your top sources before tuning anything. In Loki, this query identifies your highest-volume label streams:

# Top 10 services by log volume in the last hour
topk(10, sum by (service) (rate({env="prod"}[1h])))

Then work through a tiered reduction strategy:

Drop at source: health checks, Envoy access logs for 2xx responses, batch progress logs (replace with a single summary line at job completion)
Sample at collector: DEBUG at 10%, high-volume INFO endpoints (the /api/v1/status endpoint that gets polled every 5 seconds) at 1%
Reduce field cardinality: don't log raw SQL with parameter values embedded in the query string — that creates unbounded cardinality in any index. Log query name and bind parameters separately.
Shorten messages: a 4KB log line costs 10x what a 400-byte line costs. Stack traces belong in error tracking (Sentry, Rollbar), not in your log platform.

The combination of Fluent Bit edge filtering, Vector sampling and routing, and a well-designed retention policy can reduce hosted log platform spend by 40-60% without losing a single meaningful signal. The work is in auditing what you actually query during incidents and being ruthless about the rest.

*Zak Hassan is a Staff SRE specializing in observability, log infrastructure, and reliability engineering at scale. Find him at zakhassan.com or on LinkedIn.*

Topic Paths

SRE and Reliability Kubernetes and Platform Engineering Observability and Incident Learning Cloud Cost and Capacity

About the Author

Zak Hassan writes about reliability engineering under real scale constraints.

Staff-level SRE and platform engineer focused on identity reliability, Kubernetes, observability, cloud architecture, AI infrastructure, and reducing operational uncertainty.

Connect on LinkedIn