INCIDENT COMMANDER

Quick Reference Guide  ·  First 10 Minutes & Beyond

MAJOR INCIDENT RESPONSE
First 10 Minutes
  • 1
    DECLARE Formally open the incident. Assign severity now. It's easier to downgrade later than to upgrade under pressure.
  • 2
    STAFF THE BRIDGE Confirm three minimum roles before anything else: Technical Lead, Comms Lead, Scribe.
  • 3
    GET THE ONE-LINER What is broken. Who is affected. Since when. One sentence. No theories yet. Scope before you solve.
  • 4
    SEND THE FIRST UPDATE Send an update within 10 minutes. "We are investigating" is enough. Silence is not.
  • 5
    SET THE CLOCK Set the cadence out loud: "We update every 15 minutes." Then keep it.
Severity Quick Reference
P1 Complete service loss or critical data exposure. Revenue, brand, or safety impact. NOW
P2 Significant degradation. A workaround exists, but it will not hold. URGENT
P3 Limited impact. One team or non-critical function affected. MONITOR
P4 Cosmetic or informational. No user-facing service impact. LOG
Bridge Roles & Accountability
Commander Owns the process and timeline. Stays out of the fix. Drives cadence and decisions.
Technical Lead Owns investigation and resolution. Single technical voice to the Commander.
Comms Lead Owns stakeholder messaging. Nothing goes out without their sign-off.
Scribe Real-time log of actions, decisions, and timestamps. Facts only.
Bridge Checkpoints
  • Keep every update aligned on severity, impact, and ETA.
  • Re-check metrics before calling a workaround resolved.
  • End each update with the next decision window. No silent gaps.
  • Bring in extra expertise by 30 minutes if the bridge needs it.
  • Log owner, time, and rationale before moving on.
  • If scope grows, escalate at the 30-minute check.
Bridge Pulse
  • 🕒 Update rhythm met? Comms and Technical leads both confirm.
  • 📈 Observability trend is flat or improving since the last fix.
  • 🧭 Commander next action matches the stated decision window.
Commander's Rules
  • Manage the process, not the fix. Stay out of the technical weeds.
  • One voice to stakeholders. Always. Conflicting comms destroy trust.
  • No blame on the bridge. It kills candour when you need it most.
  • If you do not know, say so, then find out. Speculation is not an update.
  • Fatigue degrades judgment. Rotate at two hours.
  • Document decisions in real time. The Scribe is the record of truth.
  • Scope the blast radius before fixing. Know what else could be hit.
  • A workaround that restores service is valid. Perfection can wait.
  • Before approving a fix, confirm the Scribe has logged the evidence.
Stakeholder Minimum Comms Templates
■ Initial Declaration

We are experiencing [IMPACT]. This affects [WHO / WHAT]. We are investigating. Next update at [TIME].

■ Bridge Update

Update at [TIME]. Status: [SITUATION]. In progress: [WHAT / WHO]. Next update at [TIME].

■ Resolution Declaration

Service restored at [TIME]. Root cause: [BRIEF STATEMENT]. Monitoring remains in place. PIR within [TIMEFRAME].

Resolution Phases
Triage
Confirm scope
Assign severity
Staff bridge
Send initial comms
Investigate
Form hypotheses
Test one at a time
Eliminate, don't assume
Log all findings
Resolve
Apply fix or workaround
Confirm service restored
Monitor for regression
Update all parties
Close
Formal declaration
Final stakeholder update
Lock the incident record
Schedule PIR
Escalation Triggers
If any condition below is met, escalate.
No viable root-cause hypothesis after 30 minutes.
Customer data exposure confirmed or reasonably suspected.
Fix applied but service not restored. Reassess severity now.
Regulatory, legal, or safety implications appear at any point.
P2 widening in scope? Re-evaluate for P1 immediately.
Quick Commander Cues
  • Lead with facts: service, scope, impact, and start time.
  • Name the next update time, then deliver it even if nothing changed.
  • Confirm key decisions with the Scribe so the timeline stays accurate.
  • Monitor for two full update cycles before closing.
Post-Incident Review (PIR)
Schedule within 5 business days for P1/P2. Keep it blameless and focused on process and tooling, not people. Outputs must include verified root cause, reconstructed timeline, contributing factors, and at least two preventive actions with owners and due dates. Feed PIR actions into Problem Management and review progress a month later.