Improve Incident Response

Responding to incidents is now business as usual. There are a lot of ideas out there about Incident Command Systems (ICS). They offer some help. But deep improvement needs more than policies, a severity matrix, documentation, roles and responsibilities, and basic training.

Detection is the first phase of incident response.

The fastest path to improving incident response is building relationships between customer support teams and software engineering teams.

In the short term, just building relationships between the teams will improve detection.

The longer-term investment is to develop meaningful service-level indicators (SLIs) and service-level objectives (SLOs). Here again, the best SLI/SLO story you can build is through collaboration between the people closest to your customers and the people closest to your code. Build upon the relationships between customer support & product design & engineering.

The people close to the customer will have the best understanding of the customer's pain.

The people close to the code will have the best understanding of how to instrument the system to report data that's a close-as-possible approximation of when our systems are causing customer pain.

Neither one of those teams can build the SLI/SLO alone. It must come from their collaboration. The collaboration will need to be preceded by fostering of healthy relationship and communication between those teams.

The ultimate goal is to have the engineering teams 1) practiced at instrumenting their systems to provide signals of potential problems for customers, 2) practiced at monitoring those signals, 3) practiced at responding when the signals fire, and 4) practiced at adapting and updating the signals in collaboration with the people closest to the customers.

In an ideal world customer support should be able to start incidents. Even better if SLO thresholds can start incidents (because hopefully faster detection for known failure modes).

Both ways of starting incidents need feedback loops to balance planned work against responsive customer service. Foster healthy relationships across organizational boundaries.

.

Elisa Binette and Beth Adele Long gave an exceptional talk about how to be a better Incident Commander. It's really a map of the whole ecosystem. See Incident Facilitation

Evolution of Incident Management at Slack. Brent Chapman @ SRECon2021. Learn how we've made incident management a core capability of everyone on our engineering team: where we are, how we got here, and where we're going. Brent has basically given everyone a roadmap to building an effective incident response program. usenix pdf youtube