In the CLL 19 talk, New Relic's Elisa Binette and Beth Adele Long explored the basics of incident response and coordination, how to be effective as an incident commander (IC), and how organizations can cultivate a strong pool of ICs so that both engineers and customers are happier. youtube
(note: video has Elisa's name misspelled.)
YOUTUBE pFTohdgeG1s (34min) Beth Adele Long and Elisa Binette, How to be a great Incident Commander
Sharp End—especially the people who carry the pager, and also people who write code.
Use severity to signal how many resource to devote. Firefighters need to know how many ladders to send. In software we need to know how many engineers are needed and which skillsets.
Two clear roles: Tech Lead does hands-on-keyboard troubleshooting, "Incident Commander is focus of this talk."
(editorial: don't let that fool you. There's an ecosystem described here. Incident Commander is important, but pay attention to the whole ecosystem)
# Overview
Wide variation in incidents. Also wide variation in incident management process on spectrum from rigid to ad-hoc.
Lifecycle of incident.
Something precipitates.
Someone notices a problem and declares the start of the incident response.
Triage. Assessment.
Hypothesis phase and Interventions.
Resolution. Symptoms mitigated. Incident is declared over.
Recovery—sometimes there's a mess that needs immediate cleanup after the resolution.
Post-incident activity. Learning. Remediation.
Tools involved revolve around communication, collaboration, and coordination.
Roles: Tech Lead. Communications Lead. Incident Commander.
Characteristics of Incidents—what makes them different?
1. High Stakes.
2. Time Pressure. People make different trade-offs under time pressure.
3. Group Activity. Incident Response is a team sport.
# Incident Commanders
Working with many people and many perspectives is very expensive.
Regulating three Flows. Reduce stress for all the participants.
1. Emotion
2. Information
3. Analysis
Danger triggers Fight, Flight, Freeze
Stress hormones are terrible for solving engineering problems.
IC helps people regulate the fear response to keep our cortex online.
Information: Listening, Filtering, Acting on the right information.
Do we have the right people in the room?
Calibrating Mental Models!
Incidents can be a really valuable calibration time when people learn how their mental models are out of sync with reality, or out of sync with each other.
Ask the right questions.
Challenge assumptions.
Ask for evidence to support claims.
Ask tech leads to articulate their thought process.
Remind abandoned or stalled lines of inquiry.
Notice brute forcing and suggest bisecting the problem.
Skills:
Broad sense of the system and reasonable sense of fluency in general architecture
What areas are under the most strain?
Understand the organization to know where to get help and how to reach those people.
Need to be familiar with incident response process with good muscle memory.
Sufficient mastery of technical jargon. Need to be able to follow the deep technical conversations between experts.
Need to know company priorities to inform hard sacrifice decisions.
# Learning and Grow
How to prepare for incident command.
Have tooling, clear processes, easy-to-navigate-even-under-stress documentation.
Training.
Basic training for all engineers.
Advanced training for the advanced incident commanders.
Gamedays. To practice the incident response itself, not just to see how software responds to faults.
Adversarial gamedays. Similar to gameday, but the team doesn't know exactly when or how the fault will be injected, and one member of the team gets to trigger the surprise. Try this in staging first.
New incident commanders shadow experienced responders to become familiar with being on call before also having the pressure to resolve a real outage.
Reverse shadowing by experienced responders when a new incident commander is taking their first rotations in the lead.
Strong incident responders may not know what makes them good at it. Helping them understand their own expertise can make them better mentors for helping new responders learn.
.
Some related talks for comparison.
The IC process focuses on clear communication, delegation, and trust between teams working in harmony. New Relic has used the IC process for over two years (as of 2016), iterating and refining the process as we go. We train all our engineers to be ICs and have used this process to handle small deployment hiccups to network outages. We’ve built tools to support and archive our incident responses and have seen significant improvement in our understanding and response to such situations. Alice Goldfuss. nrrd 911 ic me: The Incident Commander Role. SRECon 2016. youtube
The Incident Command System is a decades-old tool used for responding to real-world incidents and emergencies, and some form of it has been adopted by many operational teams. However, most don’t know the origins of the system, how it grew to what it is today or why it’s as useful for computer systems as it is for hurricane response. Come learn about the history of the ICS, its successes and failures, and how you can adopt the best aspects of it for your emergencies, today! Alex Hidalgo. Earthquakes, Forest Fires, and Your Next Production Incident. LISA 2019. youtube