*By Zak Hassan — Staff SRE | May 2026*
Every team eventually hits the same wall: the system is on fire, engineers are tailing logs across eight terminal windows, and nobody can find the one line that explains what went wrong. The logs exist — terabytes of them — but they're a wall of freeform text, scattered across nodes, formatted inconsistently, and indexed at a cost that would make a CFO cry. Log management done well is invisible infrastructure. Done poorly, it's the reason the on-call rotation burns out. This post covers the full stack: how to emit logs that are actually useful, how to route them cheaply, how to query them during incidents, and how to stop paying for logs nobody reads.
The Structured Logging Imperative
printf("user %s logged in from %s", user, ip) works fine when you have one server and twenty users. At scale it's a trap. You can't filter it programmatically, you can't aggregate across it, and you can't join it with traces without writing fragile regex that breaks the moment a developer changes the message string.
Structured logging means every log line is a machine-readable document — JSON in practice for most stacks. Every line must carry a fixed set of fields: timestamp (RFC3339, always UTC), level, service, trace_id, request_id, and message. Everything else is context on top of those.
Here's a Python setup using python-json-logger that enforces this contract:
import logging
import sys
from pythonjsonlogger import jsonlogger
def build_logger(service_name: str) -> logging.Logger:
logger = logging.getLogger(service_name)
logger.setLevel(logging.INFO)
handler = logging.StreamHandler(sys.stdout)
formatter = jsonlogger.JsonFormatter(
fmt="%(asctime)s %(levelname)s %(name)s %(message)s",
datefmt="%Y-%m-%dT%H:%M:%S%z",
rename_fields={"asctime": "timestamp", "levelname": "level", "name": "service"},
)
handler.setFormatter(formatter)
logger.addHandler(handler)
return logger
logger = build_logger("payments-api")
# Bind request-scoped context so every subsequent call carries it
import contextvars
_log_ctx: contextvars.ContextVar[dict] = contextvars.ContextVar("log_ctx", default={})
class ContextLogger:
def __init__(self, base: logging.Logger):
self._log = base
def _merge(self, extra: dict) -> dict:
return {**_log_ctx.get(), **extra}
def info(self, msg: str, **kwargs):
self._log.info(msg, extra=self._merge(kwargs))
def error(self, msg: str, **kwargs):
self._log.error(msg, extra=self._merge(kwargs))
def warn(self, msg: str, **kwargs):
self._log.warning(msg, extra=self._merge(kwargs))
log = ContextLogger(logger)
# In a FastAPI middleware:
# token = _log_ctx.set({"trace_id": request.headers.get("X-Trace-Id"), "request_id": str(uuid4())})
# try: yield
# finally: _log_ctx.reset(token)
log.info("payment processed", amount_cents=4999, currency="USD", customer_id="cust_abc123")
# {"timestamp": "2026-05-08T14:22:01+00:00", "level": "INFO", "service": "payments-api",
# "message": "payment processed", "amount_cents": 4999, "currency": "USD",
# "customer_id": "cust_abc123", "trace_id": "4bf92f3577b34da6", "request_id": "req-uuid"}The output is a single JSON object per line. Any log collector, any query engine, any alert rule can work with it without custom parsing.
Log Levels in Practice
Most teams treat INFO as a catch-all and then wonder why their observability platform bill is enormous. The contract should be explicit:
- DEBUG: everything needed to reconstruct execution — SQL queries, HTTP request/response bodies, cache hit/miss. Never on in production-like lab environments by default. High volume by design.
- INFO: business-meaningful events that confirm normal operation. A user logged in, a payment succeeded, a batch job completed. One or two per request, not twenty.
- WARN: something unexpected happened but the system compensated — a retry succeeded, a circuit breaker opened, a config value fell back to default. Warrants investigation, not a page.
- ERROR: something failed and either a user was affected or an operator needs to act. Every ERROR should be either tied to an alert or explicitly documented as expected noise.
The practical problem is that DEBUG is useless if you can only turn it on by restarting pods. Dynamic log level adjustment solves this. Expose a /debug/log-level endpoint that adjusts the root logger level at runtime, protected by mTLS or an internal network boundary:
from fastapi import APIRouter, HTTPException
import logging
router = APIRouter()
@router.put("/debug/log-level")
async def set_log_level(level: str):
numeric = getattr(logging, level.upper(), None)
if not isinstance(numeric, int):
raise HTTPException(status_code=400, detail=f"Invalid log level: {level}")
logging.getLogger().setLevel(numeric)
return {"level": level.upper(), "applied": True}This lets you crank up DEBUG on a single pod for thirty seconds during an incident, capture what you need, then return to INFO — no deployment required.
The Log Pipeline
Logs travel from process stdout through at least three hops before they're queryable. Getting that pipeline wrong costs you either money, reliability, or both.
Collection: Fluent Bit as a DaemonSet. Fluent Bit runs on every node and tails container logs from /var/log/containers/. It's lightweight (~450KB binary, ~10MB RAM), and it's where you do your first-pass filtering — drop before you ship.
# fluent-bit-config.yaml (ConfigMap)
[SERVICE]
Flush 5
Log_Level warn
Parsers_File parsers.conf
[INPUT]
Name tail
Tag kube.*
Path /var/log/containers/*.log
Parser docker
DB /var/log/flb_kube.db
Mem_Buf_Limit 50MB
Skip_Long_Lines On
Refresh_Interval 10
[FILTER]
Name kubernetes
Match kube.*
Kube_URL https://kubernetes.default.svc:443
Kube_CA_File /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
Kube_Token_File /var/run/secrets/kubernetes.io/serviceaccount/token
Merge_Log On
Keep_Log Off
Annotations Off
# Drop health check and readiness probe noise before it ever leaves the node
[FILTER]
Name grep
Match kube.*
Exclude log /healthz|/readyz|/metrics
[OUTPUT]
Name forward
Match kube.*
Host vector-aggregator.logging.svc.cluster.local
Port 24224Transformation and routing: Vector. Vector sits between collection and storage. This is where you apply your routing logic — ERRORs go to Elasticsearch for fast-access search, DEBUG goes to S3 for cheap archival, everything goes to Loki for unified querying.
# vector.toml
[sources.fluent]
type = "fluent"
address = "0.0.0.0:24224"
[transforms.parse_json]
type = "remap"
inputs = ["fluent"]
source = '''
# Parse the nested JSON log message if present
if is_string(.log) {
parsed, err = parse_json(.log)
if err == null {
. = merge(., parsed)
}
}
# Normalize level to uppercase
.level = upcase(string!(.level ?? "INFO"))
# Drop DEBUG logs with >90% probability to control volume
if .level == "DEBUG" {
if random_float() < 0.9 {
abort
}
}
# Enrich with environment metadata
.env = get_env_var!("ENVIRONMENT")
.cluster = get_env_var!("CLUSTER_NAME")
'''
[transforms.route_by_level]
type = "route"
inputs = ["parse_json"]
[transforms.route_by_level.route]
errors = '.level == "ERROR" || .level == "WARN"'
debug = '.level == "DEBUG"'
info = '.level == "INFO"'
# Fast-access store for errors — searchable within seconds
[sinks.elasticsearch_errors]
type = "elasticsearch"
inputs = ["route_by_level.errors"]
endpoint = "https://es-cluster.internal:9200"
index = "logs-errors-%Y.%m.%d"
bulk.action = "index"
# Cheap archival for debug — S3 with Parquet for Athena queries
[sinks.s3_debug]
type = "aws_s3"
inputs = ["route_by_level.debug"]
bucket = "logs-archive-prod"
key_prefix = "debug/%Y/%m/%d/%H/"
encoding.codec = "json"
compression = "gzip"
batch.timeout_secs = 300
# Loki gets everything — unified querying across all levels
[sinks.loki]
type = "loki"
inputs = ["route_by_level.errors", "route_by_level.info", "route_by_level.debug"]
endpoint = "http://loki.monitoring.svc.cluster.local:3100"
labels.service = "{{ service }}"
labels.level = "{{ level }}"
labels.env = "{{ env }}"
encoding.codec = "json"Log Routing and Filtering
The most expensive logs are the ones that carry no signal. Health checks, readiness probes, and /metrics scrapes can easily account for 20-30% of your total log volume in a Kubernetes environment, and they tell you nothing you don't already know from your uptime monitor.
Drop them at the node level in Fluent Bit (shown above) before they hit the network. The second layer of filtering happens in Vector with the sampling logic in the VRL transform — 90% of DEBUG logs dropped randomly preserves enough for pattern detection without the volume. For services where you need guaranteed DEBUG capture on specific requests (say, for a specific customer ID during an investigation), add a request header X-Debug-Capture: true that bypasses the sampling:
# In the VRL remap transform, replace the DEBUG sampling block:
if .level == "DEBUG" {
if !exists(.http_request_debug_capture) || .http_request_debug_capture != "true" {
if random_float() < 0.9 {
abort
}
}
}Loki LogQL for Incident Investigation
Loki's LogQL is designed for this kind of log pipeline. The label-first model (you select streams by labels, then filter within them) means queries are fast even over large volumes — Loki only reads chunks that match your label selectors.
# Find all errors in the payments service in the last 30 minutes
{service="payments-api", level="ERROR"} | json | line_format "{{.message}} trace={{.trace_id}}"
# Follow a specific trace across all services
{env="prod"} | json | trace_id="4bf92f3577b34da6"
# Count error rate by service over time (metric extraction from logs)
sum by (service) (
rate({env="prod", level="ERROR"} [5m])
)
# Find slow requests — extract latency_ms and filter
{service="payments-api"} | json | latency_ms > 2000 | line_format "{{.latency_ms}}ms {{.request_id}} {{.customer_id}}"
# Pattern detection — find messages that share structure even if values differ
{service="payments-api", level="ERROR"} | pattern "<_> failed: <reason> after <_> retries" | line_format "{{.reason}}"
# Log-to-metric bridge for alerting on log patterns
count_over_time({service="payments-api"} |= "payment declined" [1m])The exemplars bridge between logs and metrics is particularly powerful: if your metrics include a trace_id exemplar label (supported in Prometheus 2.26+ and exposed via OpenTelemetry), Grafana can jump from a metric spike directly to the Loki stream for that trace. Configure it in the application's Prometheus histogram:
from prometheus_client import Histogram
PAYMENT_LATENCY = Histogram("payment_latency_seconds", "Payment processing latency",
labelnames=["status"])
# When recording, pass exemplar with current trace_id
PAYMENT_LATENCY.labels(status="success").observe(0.342, exemplar={"trace_id": current_trace_id()})Retention Policy Design
Retention is a compliance question first and a cost question second. Get the compliance requirements wrong and the cost savings mean nothing. A working starting point for most B2B SaaS:
| Tier | Storage | Retention | What Lives Here |
|---|---|---|---|
| Hot | Elasticsearch / Loki ingesters | 7 days | All levels; fully indexed; sub-second queries |
| Warm | Loki object store (S3 + index) | 30 days | INFO and above; queryable but slower |
| Cold | S3 Glacier / raw Parquet | 1–7 years | Compliance-required logs (auth, payments, PII access); Athena-queryable |
| Delete | — | >7 years | Everything else |
Auth logs (login success/failure, permission changes) and payment logs almost always have a regulatory floor — PCI DSS requires one year, SOC 2 expects you to define and enforce a policy. Build your Loki retention rules and S3 lifecycle policies to enforce this automatically:
# Loki retention config (loki.yaml)
compactor:
retention_enabled: true
retention_delete_delay: 2h
working_directory: /data/retention
limits_config:
retention_period: 744h # 31 days default
# Per-tenant or per-stream overrides
ruler_storage:
type: s3
# S3 lifecycle for the cold archive bucket (applied via Terraform)
# resource "aws_s3_bucket_lifecycle_configuration" "logs_archive" {
# rule { id = "transition-to-glacier"
# transition { days = 90; storage_class = "GLACIER" }
# expiration { days = 2555 } # 7 years
# }
# }Log-Driven Cost Control
In hosted platforms (Datadog, Splunk Cloud, New Relic), log cost has three components: ingestion, indexing, and retention. Ingestion and indexing are where teams hemorrhage money.
The most expensive log sources are almost always:
- Istio/Envoy access logs — one line per request per sidecar, often at DEBUG, for every service-to-service call
- Database slow query logs misconfigured with a threshold of 0ms (logging every query)
- Health check and readiness probe endpoints not excluded at the collector
- Batch job progress logs emitting a line every record processed
Audit your top sources before tuning anything. In Loki, this query identifies your highest-volume label streams:
# Top 10 services by log volume in the last hour
topk(10, sum by (service) (rate({env="prod"}[1h])))Then work through a tiered reduction strategy:
- Drop at source: health checks, Envoy access logs for 2xx responses, batch progress logs (replace with a single summary line at job completion)
- Sample at collector: DEBUG at 10%, high-volume INFO endpoints (the
/api/v1/statusendpoint that gets polled every 5 seconds) at 1% - Reduce field cardinality: don't log raw SQL with parameter values embedded in the query string — that creates unbounded cardinality in any index. Log query name and bind parameters separately.
- Shorten messages: a 4KB log line costs 10x what a 400-byte line costs. Stack traces belong in error tracking (Sentry, Rollbar), not in your log platform.
The combination of Fluent Bit edge filtering, Vector sampling and routing, and a well-designed retention policy can reduce hosted log platform spend by 40-60% without losing a single meaningful signal. The work is in auditing what you actually query during incidents and being ruthless about the rest.
*Zak Hassan is a Staff SRE specializing in observability, log infrastructure, and reliability engineering at scale. Find him at zakhassan.com or on LinkedIn.*
Topic Paths