In the CLL 19 talk, New Relic's Elisa Binette and Beth Adele Long explored the basics of incident response and coordination, how to be effective as an incident commander (IC), and how organizations can cultivate a strong pool of ICs so that both engineers and customers are happier.
(note: video has Elisa's name misspelled.)
YOUTUBE pFTohdgeG1s (34min) Beth Adele Long and Elisa Binette, How to be a great Incident Commander
Sharp End—especially the people who carry the pager, and also people who write code.
Use severity to signal how many resource to devote. Firefighters need to know how many ladders to send. In software we need to know how many engineers are needed and which skillsets.
Two clear roles: Tech Lead does hands-on-keyboard troubleshooting, "Incident Commander is focus of this talk."
(editorial: don't let that fool you. There's an ecosystem described here. Incident Commander is important, but pay attention to the whole ecosystem)
# Overview
Wide variation in incidents. Also wide variation in incident management process on spectrum from rigid to ad-hoc.
Lifecycle of incident.
Something precipitates.
Someone notices a problem and declares the start of the incident response.
Triage. Assessment.
Hypothesis phase and Interventions.
Resolution. Symptoms mitigated. Incident is declared over.
Recovery—sometimes there's a mess that needs immediate cleanup after the resolution.
Post-incident activity. Learning. Remediation.
Tools involved revolve around communication, collaboration, and coordination.
Roles: Tech Lead. Communications Lead. Incident Commander.
Characteristics of Incidents—what makes them different?
1. High Stakes.
2. Time Pressure. People make different trade-offs under time pressure.
3. Group Activity. Incident Response is a team sport.
# Incident Commanders
Working with many people and many perspectives is very expensive.
Regulating three Flows. Reduce stress for all the participants.
1. Emotion
2. Information
3. Analysis
Danger triggers Fight, Flight, Freeze
Stress hormones are terrible for solving engineering problems.
IC helps people regulate the fear response to keep our cortex online.
Information: Listening, Filtering, Acting on the right information.
Do we have the right people in the room?
Calibrating Mental Models!
Incidents can be a really valuable calibration time when people learn how their mental models are out of sync with reality, or out of sync with each other.
Ask the right questions.
Challenge assumptions.
Ask for evidence to support claims.
Ask tech leads to articulate their thought process.
Remind abandoned or stalled lines of inquiry.
Notice brute forcing and suggest bisecting the problem.
Skills:
Broad sense of the system and reasonable sense of fluency in general architecture
What areas are under the most strain?
Understand the organization to know where to get help and how to reach those people.
Need to be familiar with incident response process with good muscle memory.
Sufficient mastery of technical jargon. Need to be able to follow the deep technical conversations between experts.
Need to know company priorities to inform hard sacrifice decisions.
# Learning and Grow