POST-INCIDENT REVIEW

Template  ·  Blameless Analysis and Action Tracking

PIR TEMPLATE
PIR Metadata
Core Details

Incident ID: [INC-XXXX] | Severity: [P1 / P2 / P3] | Start: [DATE TIME UTC] | Restore: [DATE TIME UTC]

Service(s): [SERVICE NAMES] | Business owner: [NAME] | IC: [NAME]

Comms: [NAME] | Resolver: [TEAM / NAME] | PIR class: [MANDATORY / DISCRETIONARY]

PIR Date: [DATE] | Facilitator: [NAME] | Scribe: [NAME] | Approver: [NAME]

Executive Summary

[3-4 SENTENCES: WHAT HAPPENED, IMPACT, HOW SERVICE WAS RESTORED, TOP LEARNING, AND THE NEXT PRIORITY ACTION.]

Timeline Reconstruction
  • 1
    Detection [TIME] Alert / report received via [SOURCE].
  • 2
    Declaration [TIME] Incident declared at severity [LEVEL]; bridge opened.
  • 3
    Diagnosis [TIME RANGE] Key hypotheses tested and evidence collected.
  • 4
    Mitigation / Fix [TIME] Workaround or fix applied by [TEAM].
  • 5
    Recovery Confirmed [TIME] Metrics stable, user impact ended, monitoring continued.
Impact Summary
Customers [WHO WAS AFFECTED / HOW MANY / REGIONS] REQUIRED
Business [REVENUE, SLA, BRAND, LEGAL IMPACT] REQUIRED
Duration [TOTAL USER-IMPACTED DURATION + DEGRADED WINDOW] REQUIRED
Targets [SLA / SLO BREACH, SERVICE CREDITS, CONTRACTUAL OR REGULATORY EXPOSURE] USEFUL
Ops Load [TICKET / CONTACT SPIKE, BACKLOG CREATED, FOLLOW-ON WORKLOAD] USEFUL
Response Learnings
Worked Well

[DETECTION / COMMAND / COMMS / COLLABORATION / RESTORE DECISION]

Did Not Go Well

[OWNERSHIP / TELEMETRY / ESCALATION / WORKAROUND RISK / DECISION DELAY]

Where We Got Lucky

[LOW TRAFFIC / FAILOVER HELD / MANUAL SAFEGUARD / SINGLE POINT DID NOT FAIL]

Contributing Factors
Primary Root Cause

[SINGLE STATEMENT OF THE VERIFIED TECHNICAL / PROCESS ROOT CAUSE]

Why It Escaped

[CONTROL OR DETECTION GAP: TEST COVERAGE, CHANGE CHECKS, MONITORING, OWNERSHIP]

Recurrence Risk

[LIKELIHOOD OF RECURRENCE BEFORE ACTIONS LAND, AND TEMPORARY CONTROLS IN PLACE NOW]

Contributors

[ENVIRONMENT, TOOLING, DEPENDENCY, DOCUMENTATION, OR CAPACITY FACTORS]

5 Whys Prompt
  • 1.Why did the customer-facing impact occur? [ANSWER]
  • 2.Why was that condition possible? [ANSWER]
  • 3.Why did controls fail to prevent it? [ANSWER]
  • 4.Why was detection/response slower than target? [ANSWER]
  • 5.Why does the system/process still allow recurrence? [ANSWER]
Corrective and Preventive Actions (CAPA)
Action 1 [PREVENT RECURRENCE ACTION] | P: [HIGH] | Owner: [NAME] | Due: [DATE] | Metric: [SUCCESS] | Validate: [NAME / DATE] | Status: [OPEN]
Action 2 [DETECTION / OBSERVABILITY IMPROVEMENT] | P: [MEDIUM] | Owner: [NAME] | Due: [DATE] | Metric: [SUCCESS] | Validate: [NAME / DATE] | Status: [OPEN]
Action 3 [RUNBOOK / TRAINING / PROCESS CHANGE] | P: [MEDIUM] | Owner: [NAME] | Due: [DATE] | Metric: [SUCCESS] | Validate: [NAME / DATE] | Status: [OPEN]
Governance and Follow-Up
  • Customer PIR summary by: [DATE/TIME]
  • Problem linked: [PRB-XXXX]
  • Change / release linked: [CHG / REL-XXXX]
  • Known error / KB updated: [KE / KB-XXXX]
  • Risk / continuity / supplier follow-up: [REF / OWNER]
  • Leadership readout: [DATE]
  • 30-day check: [DATE]
Blameless Standard
Focus on systems, signals, response quality, and control design. Do not attribute fault to individuals. A PIR is complete only when actions are owned, measured, due-dated, and tracked to closure.