Incident Archeology

An approach to learning across a broad collection of incidents—emphasis on breadth not depth. Are incidents just paperwork? "Can't I just fix things and move on with my job?" How can we captures the benefits of understanding communication, accountability, coordination, and learning from the paperwork? Incident Archeology. Clint Byrum @ LFI Conf 2023. youtube

YOUTUBE cKurUbYvWLA Incident Archeology. Clint Byrum @ LFI Conf 2023. youtube

Once you file the ticket for an incident, we ask for all of this extra work: write up the timeline, make sure status is accurate, set start time, time of detection, end time, estimate impact, document actions, coordinate post-incident meetings, facilitate discussions, track remediation, close the ticket. The teams are busy and being measured on other things. It is important to answer "why am I doing all this extra work?"

In a strongly autonomous culture, Spotify's incident response is still very top-down.

First hypothesis was disproved: "after-hours will have high MTTR and complexity." It is a falsifiable claim, but built on shaky foundation of MTTR (all start and end times are disputed) and even shakier ground of measuring complexity. We did get some data. We found that there are almost no incidents after hours. We also found that MTTR is useless data. On the other hand, the graph of start times for incidents proved to be very reassuring to worried engineers going on-call.

Digging into complexity. "How hard was it to fix? Did it have a clear and obvious resolution? Were senior engineers required to fix it? Graded 1 to 5 from simplest to hardest known solution." The main thing we learned here was how bad this question is. We spent so much time trying to estimate the complexity of an incident and turns out complexity is complex. No material value from this work.

Second hypothesis: "At least 50% of incidents have high avoidability. How easily could we have avoided this with pro-active work? Did we see it coming and fail to act, or was this an unpredictable event? Graded 1 to 5 from hard to predict to identified before it happened." This one is embarassing to show the LFI community. We're basically showing we like counterfactuals and we're going to measure them. We learned how easy it is to score a 3. We also learned scoring a 5 didn't actually teach us any way to improve. Main lesson was this is another bad question.

We did learn something very unexpected: 45% of our incidents had no post-mortem despite our policy requiring post-mortems. We had assumed all teams were running post-mortems on all incidents. This was evidence that we were asking people to do too much. When we showed this data people got angry. Turns out they almost always had a meeting, but in many cases didn't write anything down.

We also published a chart comparing post-mortem rate with MTTR broken down by team. Our data scientists were so polite in telling us how terribly stupid this chart was. Big lesson here is talk to your data scientists, or anyone with skills in statistical analysis before you go to print. Also, don't bother trying to compare MTTR with post-mortem rates.

In the following year we ran a big campaign to advocate teams perform post-mortems and write down their findings. We simplified our measurement to simple binary. Did we find any evidence of the post-mortem? But results were disappointing. We moved from 55% compliance to 62%. So we followed up with usability research to find out why teams weren't doing it. What we learned is they knew they were supposed to do them, but protested about the heaviness of the template and process. We learned if you don't give people lighter options they will skip the heavy process.

Found an easy-to-collect rubric that yielded a high signal in our incident data. Incidents that were easy for one person to solve didn't get post-mortems. Incidents involving many teams did get post-mortems. We were then able to change our advertising to encourage more post-mortems about less complex incidents.

24m25s Stuff we learned that we weren't looking for. Severity of impact isn't the only driver of chaos—even small impacts can pull hundreds of people into an incident. Nobody knows what the start time or end time of incidents means—therefore MTTR is a useless metric. Uptime success can hide massive problems with productivity—teams with high uptime might be completely burned out. 80% of our incidents are declared during business hours. Only 30% of declared incidents are local change failures—everything else was a change in the environment.