Wolf Incident Postmortem
|January 8th, 2023|
|kids, satire, tech|
StatusComplete, one action item outstanding.
SummarySentinel consumed by wolf after repeated false alarms.
ImpactLoss of sentinel. No flock impact.
Root causesSentinel generated noisy alerts due to premature deployment, incomplete training, and overly monotonous task. Oncalls failed to respond to true positive due to alert fatigue.
ResolutionGathered flock. Deployed replacement sentinel.
DetectionSentinel did not report at end of shift.
|P0||Deploy replacement sentinel||mitigate||complete|
|P1||Update playbook for wolf alerts||prevent||complete|
|P2||Update remaining sentinels||prevent||complete|
|P2||Revise sentinel training program||prevent||complete|
|P2||Investigate equipping sentinels with flutes or slings||prevent||in progress|
What went well
- Flock gathering proceeded without issues.
- No flock injuries or losses.
- Replacement sentinel did not exhibit false positive alerts.
What went wrong
- Noisy alerts not addressed.
- Alerts silenced contrary to playbook.
- Loss of sentinel.
Where we got lucky
- Only one wolf.
- Wolf sated after sentinel consumption.
- Replacement sentinel available.
TimelineAll times local
- 16:32 Oncalls paged "wolf".
- 16:34 First oncall arrives at sentinel location.
- 16:34 Alert diagnosed as false positive. No corrective action performed.
- 14:15 Oncalls paged "wolf".
- 14:19 First oncall arrives at sentinel location.
- 14:19 Alert diagnosed as false positive. No corrective action performed.
- 17:03 (Reconstructed) Outage begins, sentinel notices wolf.
- 17:03 Oncalls paged "wolf".
- 17:04 Oncalls paged "wolf".
- 17:04 Oncalls paged "real wolf".
- 17:05 (Reconstructed) Wolf consumes sentinel.
- 18:45 Sentinel does not report at end of shift.
- 19:05 Primary oncall dispatched to field.
- 19:10 Oncall diagnoses issue.
- 19:10 Incident begins, secondary and tertiary oncalls paged.
- 19:15 First sheep located.
- 19:52 Last sheep located.
- 20:05 Flock safe in pens.
- 20:05 Outage ends, flock protection fully restored.
- 20:45 Replacement sentinel identified.
- 07:38 Replacement sentinel deployed
- 18:45 Replacement sentinel reports at end of shift
- 18:45 Incident ends, 24hr without wolf alerts or activity (exit criterion).
Comment via: facebook, hacker news, mastodon