Building a Learning Culture from Incidents: Beyond Blameless Postmortems

*By Zak Hassan — Staff SRE | May 2026*

When a system fails, most engineering organizations reach for the same playbook: hold a postmortem, write up a timeline, assign action items, close the ticket. Over the last decade, the SRE community successfully evangelized removing blame from this process — and that was necessary progress. But somewhere along the way, "blameless postmortem" became conflated with "learning culture," as if dropping the punishment reflex was the end of the work rather than the beginning. Organizations that stop there end up with a process that is psychologically safer but epistemically no richer. The same classes of incidents recur. Action items pile up unresolved in Jira. People leave postmortem meetings relieved they weren't blamed, but not fundamentally changed in how they understand their systems. Building a genuine learning culture from incidents requires going much further — into how humans reason under pressure, who speaks in rooms, how reviews are structured, and how organizations measure whether any of this is actually working.

Why Blameless Is Necessary But Not Sufficient

Removing blame solves a specific problem: it stops individuals from hiding information out of fear of punishment. When an engineer knows that admitting "I didn't check the blog post" won't result in a performance mark, they're more likely to tell the truth about the sequence of events. That truth-telling is the prerequisite for learning. Without it, you're analyzing a sanitized version of reality.

But blamelessness, on its own, creates a vacuum. If you remove blame without replacing it with a genuine inquiry framework, postmortems tend to drift toward comfortable conclusions. Teams identify the proximate technical cause — the query that caused the table lock, the deployment that flipped the feature flag — and declare victory. The contributing factors that made the system brittle enough for that trigger to matter go unexamined. A blameless culture can still produce shallow analysis. The difference between a blameless culture and a learning culture is that the latter is designed to generate insight, not just safety. It asks harder questions, uses structured techniques to surface system conditions, and builds organizational memory that changes future behavior.

The Fundamental Attribution Error in Incident Reviews

Cognitive psychology has a name for the bias that derails most postmortems: the Fundamental Attribution Error. When something goes wrong, humans systematically over-weight individual character and decision-making as causes, and under-weight situational and systemic factors. An engineer deployed on a Friday afternoon before a long weekend; the system fell over. The Fundamental Attribution Error pushes us toward "that was a bad decision" rather than "what conditions made that deployment seem reasonable at the time?"

This matters practically because individual attribution produces individual solutions (training, checklists, "don't do that again"), while systemic attribution produces systemic solutions (deployment gates, better observability, capacity buffers). Individual solutions don't scale and don't transfer. Systemic solutions do.

Designing incident reviews to resist this bias means changing the questions you ask. Replace "who deployed this?" with "what information was available to the person making this decision, and would a reasonable engineer have acted differently given that same information?" Replace "why didn't we catch this in code review?" with "what would code review have needed to look like for this class of issue to be detectable?" These framings force the room to reason about system conditions rather than individual failures, and they produce action items that actually change the environment rather than lecturing individuals who already feel bad.

Psychological Safety and Who Speaks in Postmortem Meetings

Even in a formally blameless environment, postmortem meetings have power dynamics. The engineer who was on-call during the incident, the junior developer who wrote the implicated code, the product manager who deprioritized the tech debt that contributed — none of these people enter the room with equal standing or equal comfort speaking. Senior engineers and managers talk more. People with less status self-censor. The result is that postmortems systematically miss information held by the people closest to the work.

Several structural techniques help counteract this. Pre-meeting surveys, sent to all incident participants before the postmortem, allow people to contribute observations asynchronously and pseudonymously. The facilitator synthesizes these into the meeting agenda, which means quieter voices shape the conversation before it starts, without requiring anyone to assert themselves in a room full of more senior colleagues. Anonymous contribution channels — a shared doc where people can add observations without attribution — serve a similar function during the meeting itself.

Round-robin check-ins, where the facilitator explicitly asks each participant for one observation before opening general discussion, break the pattern where two or three confident speakers fill all available airtime. The goal isn't to force people to talk; it's to create a moment where speaking is the default rather than an act of courage. These are small procedural changes, but they materially alter whose knowledge enters the analysis.

Learning Reviews vs. Postmortems

A postmortem is incident-specific. It asks: what happened in this event, why, and what will teams do about it? That's a necessary unit of analysis, but it's not sufficient for organizational learning. Individual incidents are noisy. A single postmortem might produce idiosyncratic findings that don't generalize. Treating each incident in isolation also makes it easy to miss patterns — the same class of failure appearing in different subsystems, the same type of action item that never gets implemented.

Learning reviews operate at a different timescale and scope. Run quarterly, they ask: what did the team learn as an organization across all incidents this period? Which failure modes recurred? Which action items were completed, and which were deprioritized? What does the incident data suggest about systemic risks we haven't addressed? A learning review treats the postmortem corpus as evidence about the state of the sociotechnical system, not just a collection of individual events.

The template below shows what a learning review agenda might look like as a structured document, used by the facilitator to guide a 90-minute quarterly session:

## Quarterly Learning Review Template

**Period:** Q[N] [Year]
**Facilitator:** [Name]
**Participants:** SRE leads, Engineering managers, On-call representatives

### 1. Incident Inventory (15 min)
- Total incidents by severity (P1/P2/P3)
- Affected services and domains
- Incidents with completed postmortems vs. outstanding

### 2. Pattern Analysis (20 min)
- Recurring failure modes (same root class appearing ≥2x)
- Services with disproportionate incident load
- Time-of-day / deployment-correlation patterns

### 3. Action Item Audit (20 min)
- Action items from previous quarter: completed / in-progress / deprioritized
- Deprioritized items: were any implicated in this quarter's incidents?
- Velocity trend: are action items closing faster or slower than they open?

### 4. Psychological Safety Pulse (15 min)
- Anonymous survey results: "I feel comfortable raising concerns during postmortems"
- Facilitator observations: participation distribution across postmortems
- Changes to postmortem format trialed this quarter and outcomes

### 5. Forward-Looking Risks (20 min)
- Based on patterns: what failure modes are teams NOT seeing that teams should be?
- Known tech debt with incident potential
- Proposed systemic investments for next quarter

Counterfactual Analysis: Finding Leverage Points

Traditional root cause analysis looks backward and asks "what caused this?" Counterfactual analysis asks a subtly different question: "what would have had to be true for this incident not to happen?" That shift in framing changes what you find.

Root cause analysis tends to terminate at the first plausible explanation, which is usually the proximate trigger. Counterfactual analysis forces you to enumerate multiple conditions that were jointly sufficient for the outcome — and then, crucially, to assess which of those conditions you can actually change. An incident might not have happened if: the cache invalidation logic had been different, or the traffic load had been 20% lower, or the alerting threshold had fired 10 minutes earlier, or the blog post had included the escalation path. Each of these is a counterfactual. Each represents a potential leverage point. The question is which ones are worth investing in based on generalizability and cost.

This technique reframes postmortems away from "what was the root cause" (which implies a single answer) toward "what is the portfolio of interventions that would reduce the probability of this class of incident?" That's a more honest framing of how complex systems fail, and it produces richer action item sets.

Vendor Failures and External Dependencies

When an incident is caused by a third-party outage — a cloud provider's managed database flaps, a payment processor goes down — there is enormous organizational pressure to close the postmortem quickly. "Stripe was down; nothing the team could have done." The incident gets attributed to the external party, the postmortem is thin, and the team moves on.

This is a mistake. Vendor failures are the richest source of architectural learning precisely because they reveal your system's dependencies in their most exposed state. The questions that remain after "Stripe was down" include: did the circuit breakers activate correctly? Did the service degrade gracefully or fail completely? Did customers see a coherent error, or cascading chaos? How long did it take the on-call engineer to determine the failure was external? Did teams have a vendor status dashboard integrated into the observability stack, or was someone manually checking a status page?

Each of these questions points at something within your control. You can't prevent Stripe from having incidents. You can determine how your system behaves when they do. Postmortems for vendor-caused incidents should have a dedicated section: "Given that this external failure will happen again, what is the desired system behavior, and does the current architecture produce it?"

Measuring Whether Your Learning Culture Is Working

Learning cultures are easy to perform and hard to actually build. Organizations can run excellent-looking postmortems, produce detailed timelines and well-formatted action items, and still be making no progress. Measurement is how you tell the difference.

The most direct signal is incident recurrence rate — the fraction of incidents that represent a failure mode you've seen before. A genuine learning culture drives this number down over time. Calculating it requires tagging incidents with failure mode categories and querying across your incident management data:

# Prometheus: fraction of incidents tagged as recurring failure modes
# Requires incident labels: failure_mode, is_recurrence="true|false"

sum(increase(incident_opened_total{is_recurrence="true"}[90d]))
/
sum(increase(incident_opened_total[90d]))

Beyond recurrence rate, track action item closure velocity — the ratio of action items completed per quarter to action items opened. If this ratio is below 1.0, your backlog is growing and systemic debt is accumulating faster than you're paying it down. Track the age distribution of open action items; items older than 90 days without progress are a signal of organizational prioritization failure, not engineering failure.

Psychological safety is harder to quantify but essential to track. A simple quarterly pulse survey — "I feel comfortable raising concerns during incident reviews" on a 1-5 scale — gives you a trend line. Participation distribution across postmortems, measured by counting unique speakers against attendee count, tells you whether your format changes are actually broadening who contributes.

The honest signal is whether postmortems are changing how the organization works, or just documenting how it failed. If your system design looks the same a year from now despite a year of postmortems, the learning culture is aspirational rather than real. If the on-call engineers are citing specific postmortem learnings when making architectural decisions, it's working.

The goal was never to have better postmortem documents. The goal is an organization that genuinely understands its systems more deeply after each failure than it did before — and acts on that understanding. That's the gap between blameless and learning, and closing it is worth the effort.

*Zak Hassan is a Staff SRE specializing in incident management, reliability culture, and sociotechnical systems design. Find him at zakhassan.com or on LinkedIn.*

Topic Paths

SRE and Reliability Kubernetes and Platform Engineering Observability and Incident Learning Cloud Cost and Capacity

About the Author

Zak Hassan writes about reliability engineering under real scale constraints.

Staff-level SRE and platform engineer focused on identity reliability, Kubernetes, observability, cloud architecture, AI infrastructure, and reducing operational uncertainty.

Connect on LinkedIn