Topic Hub

Observability and Incident Learning

Prometheus, Grafana, OpenTelemetry, tracing, metrics, alerting, and operational feedback loops that reduce uncertainty during incidents.

Curated Writing

76 posts in this signal path.

9 min readSRE Research

On-Call Engineering: Reducing Toil and Improving Handoffs

On-call is where reliability theory meets operational reality. The best-designed system still has a human being paged at 2am when something goes wrong, and that human being's ability to respond effectively — quickly, calmly, without making things worse —...

Read post →
7 min readSRE Research

What On-Call Actually Feels Like in 2026

I've been on-call in some form for most of my career. Early on, that meant being woken up by a PagerDuty alert at 3am, fumbling for my laptop, opening a blog post in one tab and a dozen monitoring dashboards in another, and spending the next hour trying to...

Read post →
7 min readAI Infrastructure

AI-Powered Security Operations: What Actually Works in 2026

Security operations and SRE share more DNA than either community usually acknowledges. Both involve monitoring large volumes of signals to detect anomalies, both require rapid triage and investigation when something goes wrong, and both are fighting the same...

Read post →
6 min readObservability

The Observability Landscape in 2026: Choosing Your Stack

The observability market has consolidated significantly over the past few years, but "consolidated" doesn't mean "simple." Datadog, Grafana Cloud, New Relic, Honeycomb, Dynatrace, and a collection of newer entrants are all competing for the same budget line.

Read post →
6 min readPlatform Engineering

What Platform Engineering Looks Like at E-Commerce Scale

Platform engineering is having a moment. The concept — a dedicated team building the internal tools, abstractions, and "paved roads" that product engineers use to ship software — has moved from a boutique practice at FAANG companies to a mainstream...

Read post →
7 min readAI Infrastructure

I Modeled a 6x Cloud Cost Reduction with an LLM Agent

Cloud cost optimization is one of those problems that's theoretically easy and practically miserable. Everyone knows the levers: right-size instances, delete unused resources, use Spot where possible, move cold data to cheaper storage tiers.

Read post →