Incident Communication: The Skill Every SRE Underestimates

The technical work of incident response — diagnosis, remediation, root cause analysis — gets most of the attention in SRE literature. The communication work gets almost none, despite being the part that most often determines whether an incident is remembered as "handled well" or "handled badly" by the people it affected.

A technically perfect incident response that communicated poorly leaves customers and stakeholders feeling blindsided, disrespected, and less trusting of the organization. A technically adequate incident response that communicated well leaves people feeling informed, respected, and confident the team is on top of it. Communication is not soft skills decoration on top of real SRE work. It is real SRE work.

The Three Communication Audiences

Every significant incident has three distinct audiences, each with different needs:

Customers affected by the degradation. They want to know: is the problem real (not just me), how bad is it, when will it be fixed, and what can I do in the meantime? They don't need technical details; they need enough information to decide how to respond to the disruption.

Internal stakeholders — your support team, customer success, sales, leadership. They need to know what's happening so they can respond to customer inquiries, set expectations, and decide whether to escalate. They need more detail than customers but still not the full technical picture.

Your own team — engineering, on-call, SREs. They need complete technical detail: what's broken, what's been tried, what the current theory is, who's doing what, when the next update is.

The mistake is using one communication channel and one level of detail for all three audiences. Customer-facing communications written for engineers confuse customers. Technical communications written for customers obscure the information engineers need. Maintain separate channels and tailor the level of detail to each audience.

The Status Page: Your Most Important Reliability Signal

A status page is not just a convenience for customers — it's a trust signal. An organization that maintains an accurate, timely status page is demonstrating transparency and operational maturity. An organization whose status page says "all systems operational" while customers are experiencing outages is destroying trust faster than the outage itself.

The operational discipline of a good status page:

Update it before customers start asking. If your alerting fires and you're investigating, your status page should reflect "investigating" within 5 minutes. You don't need to have a root cause to post an update. "Teams are aware of elevated error rates affecting [service] and are actively investigating. Teams can provide an update within 30 minutes." This is better than silence.

Post updates on a predictable cadence. During an active incident, customers and stakeholders don't know whether silence means "everything is fine" or "things are getting worse and they're too busy to update us." A commitment to "teams can update every 30 minutes while this is active" tells people when to expect the next update, which reduces the anxiety of watching a status page that isn't changing.

Write for customers, not engineers. "Database failover in progress, Aurora replica promotion underway" is not a status page update. "Customers may experience login failures and slow page loads. Our team is working on a fix and expects service to be restored within the next 20 minutes" is a status page update.

The five-word summary test: Can you describe the user impact in five words? "Users cannot complete checkout" is a five-word summary. Write that first; the details come after.

The Incident Commander Role

Complex incidents need an incident commander (IC) — a person whose sole job is coordination and communication, not technical investigation. The IC is not the most senior engineer in the room. They're the person who:

Keeps track of what's been tried and what the current working theory is
Assigns tasks explicitly ("Alice, can you investigate the database logs from 14:00-14:30?")
Posts status updates to the incident channel on a regular cadence
Runs the customer-facing communication
Manages the incident timeline (documenting when things happened)
Calls the incident resolved when the criteria are met

Without an IC, incidents devolve into multiple engineers investigating the same things, nobody posting external updates because everyone assumes someone else will, and a post-incident timeline that's reconstructed from memory and Slack scrollback.

The IC skill is learnable and should be practiced in game days, not discovered during a real incident. Rotate the IC responsibility so multiple engineers develop the skill.

The Communication Templates

Pre-written templates reduce cognitive load during incidents. You shouldn't be writing from scratch at 2am with adrenaline running.

Initial status page post:

[INVESTIGATING] [Service name] degraded performance

Teams are currently investigating reports of [brief description of user impact].
An update will be provided an update within 30 minutes.

Updated: [timestamp]

Update post (in-progress):

[INVESTIGATING] [Service name] degraded performance — Update

Teams have identified [brief description of what's been found] as a contributing factor
and are working on a resolution.

Impact: [Current user impact — what is and isn't affected]
Estimated time to resolution: [Estimate, or "We do not yet have an ETA"]

An update will be provided another update within 30 minutes.

Updated: [timestamp]

Resolution post:

[RESOLVED] [Service name] is fully operational

The issue affecting [brief description] has been resolved. Service is operating normally.

Duration: [Start time] to [End time] ([X hours Y minutes])
Impact: [Description of what was affected and how many users/transactions]

A full incident report will be published a full incident report within 24-72 hours.

Updated: [timestamp]

Internal Slack update template:

INCIDENT UPDATE [timestamp]

Status: [Investigating / Mitigating / Monitoring / Resolved]
IC: [Name]
Current theory: [Best current hypothesis]
What has been tried: [Brief list]
Currently doing: [Alice: investigating DB logs | Bob: checking deployment]
External status: [Link to status page post]
Next update: [Time]

Writing the Customer-Facing Incident Report

The postmortem document is a technical artifact for your platform team. The customer-facing incident report is a communication document for users. They're not the same thing.

A good customer-facing incident report:

Acknowledges the customer impact first. Not "we experienced a failure in an Aurora cluster" but "customers experienced login failures for approximately 47 minutes."

Explains what happened in plain English. Not the technical chain of events, but the understandable version: "A database configuration change deployed at 2:15pm caused our login service to stop responding correctly. We detected the issue within 3 minutes and immediately began working to restore service."

Describes what you did, not just what failed. "The on-call team identified the configuration change as the cause within 8 minutes and rolled it back. Service began recovering at 3:02pm."

Explains what you're doing to prevent recurrence. Be specific: "Teams are adding automated validation for database configuration changes before they are deployed" is better than "teams are taking steps to improve the processes."

Doesn't over-apologize. One clear, sincere apology at the beginning is appropriate. Repeated apologies throughout the document feel performative rather than genuine.

AI-Assisted Incident Communication

AI tooling assists the communication track of incident response in practical ways:

Drafting status page updates. Given a brief technical summary, an AI drafts a customer-facing update in the right tone and format. The IC reviews and approves; they don't write from scratch.

Generating the incident timeline. By parsing the incident Slack channel, an AI reconstructs the event timeline — who said what when, what was tried, when status changed — much faster than a human scrolling through message history. This dramatically reduces postmortem preparation time.

Drafting the customer-facing incident report. Given the technical postmortem, an AI drafts the customer version — translating technical language, identifying the customer impact summary, structuring the narrative.

What AI cannot replace: the judgment about what to disclose, the nuanced tone decisions, and the stakeholder relationships that make people trust an incident report when it arrives.

*Zak Hassan is a Staff SRE with experience running incident response at Red Hat, SAP, and Hootsuite. Find him at zakhassan.com or on LinkedIn.*

Topic Paths

SRE and Reliability Kubernetes and Platform Engineering Observability and Incident Learning Cloud Cost and Capacity

About the Author

Zak Hassan writes about reliability engineering under real scale constraints.

Staff-level SRE and platform engineer focused on identity reliability, Kubernetes, observability, cloud architecture, AI infrastructure, and reducing operational uncertainty.

Connect on LinkedIn