Incident Commander

First 10 Minutes

1

DECLARE Formally open the incident. Assign severity now. It's easier to downgrade later than to upgrade under pressure.
2

STAFF THE BRIDGE Confirm three minimum roles before anything else: Technical Lead, Comms Lead, Scribe.
3

GET THE ONE-LINER What is broken. Who is affected. Since when. One sentence. No theories yet. Scope before you solve.
4

SEND THE FIRST UPDATE Send an update within 10 minutes. "We are investigating" is enough. Silence is not.
5

SET THE CLOCK Set the cadence out loud: "We update every 15 minutes." Then keep it.

Severity Quick Reference

P1	Complete service loss or critical data exposure. Revenue, brand, or safety impact.	NOW
P2	Significant degradation. A workaround exists, but it will not hold.	URGENT
P3	Limited impact. One team or non-critical function affected.	MONITOR
P4	Cosmetic or informational. No user-facing service impact.	LOG

Bridge Roles & Accountability

Commander	Owns the process and timeline. Stays out of the fix. Drives cadence and decisions.
Technical Lead	Owns investigation and resolution. Single technical voice to the Commander.
Comms Lead	Owns stakeholder messaging. Nothing goes out without their sign-off.
Scribe	Real-time log of actions, decisions, and timestamps. Facts only.

Bridge Checkpoints

→Keep every update aligned on severity, impact, and ETA.
→Re-check metrics before calling a workaround resolved.
→End each update with the next decision window. No silent gaps.
→Bring in extra expertise by 30 minutes if the bridge needs it.
→Log owner, time, and rationale before moving on.
→If scope grows, escalate at the 30-minute check.

Bridge Pulse

🕒 Update rhythm met? Comms and Technical leads both confirm.
📈 Observability trend is flat or improving since the last fix.
🧭 Commander next action matches the stated decision window.

Commander's Rules

→Manage the process, not the fix. Stay out of the technical weeds.
→One voice to stakeholders. Always. Conflicting comms destroy trust.
→No blame on the bridge. It kills candour when you need it most.
→If you do not know, say so, then find out. Speculation is not an update.
→Fatigue degrades judgment. Rotate at two hours.
→Document decisions in real time. The Scribe is the record of truth.
→Scope the blast radius before fixing. Know what else could be hit.
→A workaround that restores service is valid. Perfection can wait.
→Before approving a fix, confirm the Scribe has logged the evidence.

Stakeholder Minimum Comms Templates

■ Initial Declaration

We are experiencing [IMPACT]. This affects [WHO / WHAT]. We are investigating. Next update at [TIME].

■ Bridge Update

Update at [TIME]. Status: [SITUATION]. In progress: [WHAT / WHO]. Next update at [TIME].

■ Resolution Declaration

Service restored at [TIME]. Root cause: [BRIEF STATEMENT]. Monitoring remains in place. PIR within [TIMEFRAME].

Resolution Phases

Triage

Confirm scope
Assign severity
Staff bridge
Send initial comms

›

Investigate

Form hypotheses
Test one at a time
Eliminate, don't assume
Log all findings

›

Resolve

Apply fix or workaround
Confirm service restored
Monitor for regression
Update all parties

›

Formal declaration
Final stakeholder update
Lock the incident record
Schedule PIR

Escalation Triggers

If any condition below is met, escalate.

No viable root-cause hypothesis after 30 minutes.

Customer data exposure confirmed or reasonably suspected.

Fix applied but service not restored. Reassess severity now.

Regulatory, legal, or safety implications appear at any point.

P2 widening in scope? Re-evaluate for P1 immediately.

Quick Commander Cues

Lead with facts: service, scope, impact, and start time.
Name the next update time, then deliver it even if nothing changed.
Confirm key decisions with the Scribe so the timeline stays accurate.
Monitor for two full update cycles before closing.

▶

Post-Incident Review (PIR)

Schedule within 5 business days for P1/P2. Keep it blameless and focused on process and tooling, not people. Outputs must include verified root cause, reconstructed timeline, contributing factors, and at least two preventive actions with owners and due dates. Feed PIR actions into Problem Management and review progress a month later.