• Posts
  • RSS
  • ◂◂RSS
  • Contact

  • Wolf Incident Postmortem

    January 8th, 2023
    kids, satire, tech

    Incident #210

    Status

    Complete, one action item outstanding.

    Summary

    Sentinel consumed by wolf after repeated false alarms.

    Impact

    Loss of sentinel. No flock impact.

    Root causes

    Sentinel generated noisy alerts due to premature deployment, incomplete training, and overly monotonous task. Oncalls failed to respond to true positive due to alert fatigue.

    Trigger

    Wolf.

    Resolution

    Gathered flock. Deployed replacement sentinel.

    Detection

    Sentinel did not report at end of shift.

    Action Items

    Priority Action Item Type Status
    P0 Gather flock mitigate complete
    P0 Deploy replacement sentinel mitigate complete
    P1 Update playbook for wolf alerts prevent complete
    P2 Update remaining sentinels prevent complete
    P2 Revise sentinel training program prevent complete
    P2 Investigate equipping sentinels with flutes or slings prevent in progress

    Lessons Learned

    What went well

    • Flock gathering proceeded without issues.
    • No flock injuries or losses.
    • Replacement sentinel did not exhibit false positive alerts.

    What went wrong

    • Noisy alerts not addressed.
    • Alerts silenced contrary to playbook.
    • Loss of sentinel.

    Where we got lucky

    • Only one wolf.
    • Wolf sated after sentinel consumption.
    • Replacement sentinel available.

    Timeline

    All times local

    March 3rd:

    • 16:32 Oncalls paged "wolf".
    • 16:34 First oncall arrives at sentinel location.
    • 16:34 Alert diagnosed as false positive. No corrective action performed.

    March 4th:

    • 14:15 Oncalls paged "wolf".
    • 14:19 First oncall arrives at sentinel location.
    • 14:19 Alert diagnosed as false positive. No corrective action performed.

    March 5th:

    • 17:03 (Reconstructed) Outage begins, sentinel notices wolf.
    • 17:03 Oncalls paged "wolf".
    • 17:04 Oncalls paged "wolf".
    • 17:04 Oncalls paged "real wolf".
    • 17:05 (Reconstructed) Wolf consumes sentinel.
    • 18:45 Sentinel does not report at end of shift.
    • 19:05 Primary oncall dispatched to field.
    • 19:10 Oncall diagnoses issue.
    • 19:10 Incident begins, secondary and tertiary oncalls paged.
    • 19:15 First sheep located.
    • 19:52 Last sheep located.
    • 20:05 Flock safe in pens.
    • 20:05 Outage ends, flock protection fully restored.
    • 20:45 Replacement sentinel identified.
    March 6th:
    • 07:38 Replacement sentinel deployed
    • 18:45 Replacement sentinel reports at end of shift
    • 18:45 Incident ends, 24hr without wolf alerts or activity (exit criterion).

    Comment via: facebook, hacker news, mastodon

    Recent posts on blogs I like:

    How did we decide to have a kid?

    ...and then some more kids? The post How did we decide to have a kid? appeared first on Otherwise.

    via Otherwise January 28, 2023

    My Rainbow Kit

    For Christmas I got a really fun kit about rainbows. It had a rainbow catcher, a really cool necklace, a streamer thingy, and it also had a really really cool pinwheel, and it also had a bracelet and a pinata. Unfortunately the pinata didn't work out …

    via Anna Wise's Blog Posts January 5, 2023

    Phones

    I think that once a kid is in third grade they should be able to get a phone. I think that while sometimes parents might want kids not to have them at certain ages, phones can be quite useful at times. Tablets don't have GPS, they don't have WiFi…

    via Lily Wise's Blog Posts January 5, 2023

    more     (via openring)


  • Posts
  • RSS
  • ◂◂RSS
  • Contact