Observability-Driven Development: Instrumentation as a Definition of Done

Test-driven development taught a generation of engineers to write tests before code. The discipline worked because it forced engineers to think about correctness before implementation, and it created a feedback loop that caught regressions automatically. Observability-driven development (ODD) is the same idea applied to production understanding: write the instrumentation before — or alongside — the code, and treat unobservable code as unfinished code.

Most teams treat observability as something you add after a feature ships. "Teams can add metrics once it's in production." "Teams can add tracing when teams have a problem to debug." This approach is backwards, and it's expensive: you discover the gaps in your observability at exactly the moment you need it most — during an incident, at 2am, under pressure.

Why "Teams can Add It Later" Doesn't Work

The economics of post-hoc instrumentation are worse than they look. When a service goes to production without instrumentation:

You don't know what "normal" looks like. Anomaly detection, trend analysis, and alerting all require a baseline. A service that's been running unobserved for six months has six months of production data you can never recover. Your first alert threshold is set without knowing whether 50ms latency is normal or elevated.

The instrumentation gap compounds. A service that ships without tracing depends on other services that also shipped without tracing. When you add tracing to service A, you get incomplete traces that terminate at the boundary with service B. Fixing observability in a system of uninstrumented services requires coordinated effort across many teams.

Engineers forget the internals. The engineer who built the service knows exactly which code paths are performance-critical, which caches can go stale, which database queries are slow under specific conditions. That knowledge evaporates six months after the feature shipped. Adding instrumentation later means re-learning what should have been captured at the time.

Making Observability Part of the Definition of Done

The organizational mechanism that enforces ODD is simple: a service or feature is not done until it has the required instrumentation. This needs to be a team agreement, enforced in code review, not an aspiration in a document nobody reads.

A practical definition of done checklist for a new service:

## Observability Checklist — Required Before Production

### Logging
- [ ] All errors logged at ERROR level with full context (request ID, user ID, relevant parameters)
- [ ] All external API calls logged with method, URL (sanitized), status code, latency
- [ ] Structured JSON format (not free text)
- [ ] No PII in log output (validated, not assumed)

### Metrics
- [ ] Request rate exposed (counter)
- [ ] Error rate exposed (counter, by error type)
- [ ] Latency distribution exposed (histogram, P50/P95/P99)
- [ ] Resource utilization (CPU, memory, connection pool usage)
- [ ] Business metric: at least one metric that reflects business value
  (e.g., orders_processed, users_authenticated, documents_indexed)

### Tracing
- [ ] All inbound requests produce a root span
- [ ] All outbound calls (database, cache, external APIs) produce child spans
- [ ] Trace context propagated to all downstream services
- [ ] Error status set on spans when errors occur

### Alerting
- [ ] At least one SLO-aligned alert configured before production traffic
- [ ] Alert routes to the correct on-call rotation
- [ ] Alert blog post written and linked from the alert definition

### Dashboards
- [ ] Service dashboard exists with: request rate, error rate, latency, key business metric
- [ ] Dashboard link added to service blog post

This checklist goes in the PR template. "Observability checklist complete" is a required field before merge. The PR reviewer validates it — not by checking a box, but by looking at the instrumentation code in the diff.

The Four Golden Signals, Done Right

The four golden signals (rate, errors, latency, saturation) are the minimum instrumentation baseline. Getting them right is more specific than most implementations.

Rate: count requests, not just HTTP requests. For a service that also processes queue messages, does batch jobs, or handles WebSocket connections, HTTP request rate is incomplete. Count all meaningful units of work the service performs.

Errors: separate categories matter. A 401 (authentication failure) is not the same reliability signal as a 503 (service unavailable). Client errors (4xx) indicate misuse or auth issues; server errors (5xx) indicate reliability problems. Track them separately and alert on them differently.

from opentelemetry import metrics

meter = metrics.get_meter("payment-service")

request_counter = meter.create_counter(
    "http_requests_total",
    description="Total HTTP requests",
)

error_counter = meter.create_counter(
    "http_errors_total",
    description="Total HTTP errors by category",
)

latency_histogram = meter.create_histogram(
    "http_request_duration_ms",
    description="HTTP request latency in milliseconds",
    unit="ms",
)

def record_request(method: str, path: str, status_code: int, duration_ms: float):
    labels = {"method": method, "path": normalize_path(path), "status": str(status_code)}
    
    request_counter.add(1, labels)
    latency_histogram.record(duration_ms, labels)
    
    if status_code >= 500:
        error_counter.add(1, {**labels, "error_category": "server_error"})
    elif status_code == 429:
        error_counter.add(1, {**labels, "error_category": "rate_limited"})
    elif status_code >= 400:
        error_counter.add(1, {**labels, "error_category": "client_error"})

Latency: histogram over average. Average latency is misleading — a service with P50 latency of 10ms and P99 latency of 5,000ms has "average" latency of maybe 60ms, which tells you nothing useful about the 1% of users experiencing 5-second responses. Always instrument with a histogram and expose percentiles.

Saturation: measure what you're most likely to run out of. For CPU-bound services: CPU utilization. For I/O-bound services: I/O wait. For database-heavy services: connection pool utilization. For memory-constrained services: heap utilization with GC pressure. The right saturation metric is different per service — choose the resource that's most likely to be the bottleneck.

Business Metrics: The Signal SREs Miss

Infrastructure metrics tell you whether the service is healthy. Business metrics tell you whether the service is doing its job. The two are not always correlated.

A service with 0% error rate and <100ms P99 latency is "technically healthy." If the service is an order processing pipeline and it's processing zero orders, it's not healthy from a business perspective. If it's an email sending service and it's sending emails but they're all going to spam, technical metrics are green and business outcomes are red.

Every service should have at least one metric that reflects business value:

# Business metrics alongside infrastructure metrics
orders_processed = meter.create_counter(
    "orders_processed_total",
    description="Total orders successfully processed",
)

orders_failed = meter.create_counter(
    "orders_failed_total",
    description="Total orders that failed processing, by reason",
)

order_value_processed = meter.create_counter(
    "order_value_processed_cents",
    description="Total value of successfully processed orders in cents",
)

def process_order(order: Order) -> ProcessingResult:
    result = _process_order_internal(order)
    
    if result.success:
        orders_processed.add(1, {"payment_method": order.payment_method})
        order_value_processed.add(order.amount_cents, {"currency": order.currency})
    else:
        orders_failed.add(1, {
            "reason": result.failure_reason,
            "payment_method": order.payment_method
        })
    
    return result

Business metrics enable the SLO definitions that actually matter: "99.5% of payment processing attempts succeed" rather than "99.9% HTTP availability."

Instrumentation Code Review: What to Look For

When reviewing a PR for observability quality, the questions to ask:

Is this code's behavior observable in production? If something goes wrong with this code path, will you be able to see it in your monitoring? Can you tell whether the new feature is being used? Can you measure its performance impact?

Are errors surfaced or swallowed? A try-except that logs "An error occurred" and returns a default value is hiding signal. Errors should be logged with full context, counted in metrics, and (where appropriate) recorded as span errors in traces.

Is the metric cardinality reasonable? High-cardinality labels (user ID, request ID, IP address) in metrics create performance problems for your metrics backend and cost explosions for hosted metrics. User IDs belong in traces, not in metric labels. Reject metrics that use unbounded cardinality dimensions.

Are the metric names following your naming convention? Inconsistent naming makes metrics hard to discover and use. Enforce naming standards: {service}_{noun}_{unit}_{aggregation} — e.g., payment_service_request_duration_ms_p99.

*Zak Hassan is a Staff SRE specializing in observability engineering, AI-powered operations, and data platform reliability. Find him at zakhassan.com or on LinkedIn.*

Topic Paths

SRE and Reliability Kubernetes and Platform Engineering Observability and Incident Learning Cloud Cost and Capacity

About the Author

Zak Hassan writes about reliability engineering under real scale constraints.

Staff-level SRE and platform engineer focused on identity reliability, Kubernetes, observability, cloud architecture, AI infrastructure, and reducing operational uncertainty.

Connect on LinkedIn