*By Zak Hassan — Staff SRE | May 2026*


Every production incident has two parallel problems: the technical problem the engineers are solving, and the communication problem no one assigned. Customers are hitting errors and don't know why. Sales is fielding calls from enterprise accounts and has no information. Leadership is watching Slack for updates that aren't coming. The CEO's assistant emailed asking for a status briefing.

Poor incident communication doesn't make the technical problem worse, but it makes the blast radius larger: customer trust erodes, account managers lose confidence, and the leadership distraction from the outage ripples into the organization long after the systems are back up.

Incident communication is a practiced discipline. The templates, channels, and cadence should all be defined before an incident happens — because defining them during an incident, when the team is already at capacity, is too late.


The Incident Communication Roles

The most important structural decision in incident response: the engineer investigating the technical problem should not be the person writing stakeholder updates.

Incident Commander (IC): coordinates the technical response. Delegates tasks, tracks progress, makes prioritization calls. Does not write updates.

Communications Lead (Comms): owns all external and internal communication during the incident. Writes the updates, posts to the status page, responds to stakeholder questions. Gets information from the IC; does not investigate the technical problem.

These can be the same person on a small team, but should be different people on any incident lasting more than 15 minutes. The cognitive load of simultaneously debugging a production system and composing stakeholder updates degrades performance at both tasks.


The Status Page: Your Public Communication Channel

For any customer-facing incident, the status page is the primary communication channel. Customers who can see that you know about the problem and are working on it are far more forgiving than customers who discover the outage themselves and have no information.

The initial acknowledgment (within 15 minutes of incident confirmation):

text
Title: Investigating Issues with [Service/Feature]

Status: Investigating

teams are aware of an issue affecting [describe impact in customer-facing terms, 
not technical terms]. a platform team is actively investigating and will provide 
updates as the team learns more.

Affected: [List specific features or services affected]
Not Affected: [List major features confirmed working, if applicable]

Next update: [specific time, e.g., "within 30 minutes"]

Posted: [time] UTC

The key elements: customer-facing language (not "the primary database lost quorum"), specific next update time (not "soon"), and what is and isn't affected (customers need to know if their workflows are impacted).

The in-progress update (every 30 minutes until resolved):

text
Title: [Service] — Update [N]

Status: Identified / Monitoring / In Progress

teams have identified the cause of the issue affecting [service]. The root 
cause is [customer-facing description — "a configuration change that 
affected payment processing", not "an invalid circuit breaker threshold"].

the team is currently [specific action in progress]. The team expects to have a fix 
deployed within [time estimate].

Customer impact: [current state — "Users may experience errors or 
timeouts when attempting to complete purchases."]

Next update: [specific time]

Posted: [time] UTC | Duration: [time since first notice]

The resolution notice:

text
Title: [Service] — Resolved

Status: Resolved

The issue affecting [service] has been resolved as of [time] UTC. 
Service is now operating normally.

Root cause: [brief, customer-appropriate description]

teams will publish a full incident report within [24-72 hours depending 
on severity] describing the root cause, impact, and the steps teams are 
taking to prevent recurrence.

Duration: [start time] to [end time] UTC ([total duration])

Posted: [time] UTC

Internal Incident Communication

The status page handles customers. Internal communication — to other engineering teams, customer success, sales, and leadership — needs its own structure.

The incident Slack channel:

Every significant incident gets a dedicated Slack channel, created immediately when the incident is declared: #incident-YYYY-MM-DD-brief-description. This concentrates incident communication in one place and keeps it out of general channels where it would be lost.

Pinned to the channel immediately:

  • The incident ticket link
  • The video bridge link (if doing voice coordination)
  • The current status page link
  • The IC and Comms Lead names

The IC posts brief technical updates in this channel as the investigation progresses. Not every thought — a status update every 15-20 minutes, and when significant decisions are made:

text
[15:23 UTC] — IC Update

Root cause identified: the payment processor API is returning 503s on 
/v1/charge endpoint. The retry logic is hitting the circuit breaker.

Mitigation in progress: deploying config to enable failover to backup 
payment processor (Braintree). ETA 15 minutes.

Monitoring: error rate is at 42%, up from baseline 0.2%.

Executive Briefings: The Separate Channel

Leadership needs different information than engineering needs. They need business impact, timeline, and mitigation status — not a replay of the debugging session.

The executive update template (for P1 incidents):

text
Subject: [INCIDENT] Payment Processing Disruption — Current Status

Severity: P1 (Highest)
Start Time: 15:12 UTC
Duration: 47 minutes (as of this update)

Customer Impact:
- Checkout is failing for approximately 40% of customers
- Estimated revenue impact: $X,XXX per minute at current run rate
- [N] enterprise customers known to be affected (CSM has been notified)

Current Status:
- Root cause identified: primary payment processor API outage
- Mitigation in progress: failing over to backup payment processor
- Expected resolution: 15:45 UTC (approximately 30 minutes from now)

Actions Taken:
- Engineering team engaged at 15:14
- Failover to backup processor approved at 15:28
- Customer communication posted to status page at 15:27

What teams need:
- No action needed from leadership at this time
- Will update at 15:45 or when status changes

IC: [Name] | Comms Lead: [Name]

The executive update answers four questions leadership actually has: what is broken, how many customers are affected, how much is this costing us, and when will it be fixed. Everything else can wait for the postmortem.

Send executive updates on a schedule, not just when there's news: every 30 minutes for P1s, every hour for P2s. An update that says "no change, still working on mitigation" is more reassuring than silence. Silence implies either nothing is happening or nobody remembered to communicate.


Customer-Facing Communication Templates

For enterprise customers with SLAs, the customer success team needs templates they can use to proactively reach out:

text
Subject: [Company Name] Service Disruption Notice

Dear [Customer Name],

teams want to proactively inform you that [Company Name] is currently 
experiencing a disruption affecting [specific feature].

What you may experience: [specific customer-facing symptoms — "errors 
when attempting to process payments" or "delays in report generation"]

Current status: The engineering team identified the issue at [time] 
and is actively working on a resolution. the current estimate is 
restoration by [time].

You can track real-time updates at: [status page URL]

a platform team will follow up directly once service is fully restored. 
If you have an urgent business need in the meantime, please contact 
your account manager at [contact].

The team apologizes for this disruption and appreciate your patience.

[Customer Success Manager Name]
[Company Name] Customer Success

What to avoid in customer communication:

  • Technical jargon ("the primary database replica lost quorum")
  • Uncertainty without a next update time ("teams are working on it")
  • Minimize language ("a minor disruption") when the impact is significant
  • Passive constructions that obscure responsibility ("errors are occurring")

The Post-Incident Communication: Closing the Loop

After the incident, communication continues through two more deliverables:

Internal postmortem (within 5 business days for P1/P2): the full technical account — what happened, timeline, root cause, contributing factors, and action items. Internal audience: engineering, product, leadership.

Customer-facing incident report (within 48-72 hours for significant incidents): a public or customer-delivered account of what happened and what you're doing to prevent recurrence. Customer-appropriate language; appropriate level of technical detail for your customer base.

markdown
# Incident Report: Payment Processing Disruption
## May 8, 2026 | 15:12 – 16:03 UTC

### What happened

On May 8, 2026 from 15:12 to 16:03 UTC, customers using [Company Name]'s 
checkout experienced elevated error rates due to a disruption in the primary 
payment processing integration.

### Impact

During the 51-minute incident, approximately 38% of checkout attempts were 
unsuccessful. Customers who received errors during this period were not charged; 
any interrupted payment attempts can be safely retried.

### Root cause

the primary payment processor experienced an outage affecting their charge API. 
the system's automatic failover to the backup payment processor did not activate 
as expected due to a misconfigured health check threshold.

### What the team did

At 15:14 UTC, the on-call team was alerted and began investigating. At 15:28 UTC, 
the team identified the root cause and manually initiated failover to the backup payment 
processor. Service was restored at 16:03 UTC.

### What teams are doing to prevent recurrence

1. teams have corrected the health check configuration that prevented automatic failover.
2. teams have added automated testing of the failover path as part of the deployment process.
3. teams have reduced the time threshold for automatic failover from 5 minutes to 90 seconds.

The team takes the reliability of the checkout experience seriously and regret the disruption 
this caused to your business.

[Company Name] SRE Team

Measuring Communication Effectiveness

Like any operational practice, incident communication should be measured:

Time to first external communication: from incident declaration to first status page update. Target: under 15 minutes for P1s.

Update cadence adherence: did updates happen at the promised time? An update that says "next update in 30 minutes" and arrives 45 minutes later erodes trust faster than no update at all.

Stakeholder satisfaction: quarterly survey to customer success, sales, and leadership: "During the last major incident, did you have the information you needed to do your job?" The answers identify communication gaps.

Escalation calls: how often does leadership have to initiate contact with engineering for incident status? Zero should be the target — if leadership is calling, your proactive communication isn't meeting their needs.


*Zak Hassan is a Staff SRE specializing in incident management, reliability culture, and engineering operations. Find him at zakhassan.com or on LinkedIn.*

Topic Paths

About the Author

Zak Hassan writes about reliability engineering under real scale constraints.

Staff-level SRE and platform engineer focused on identity reliability, Kubernetes, observability, cloud architecture, AI infrastructure, and reducing operational uncertainty.

Connect on LinkedIn