I presented a poster session at the 2019 New Relic Product Offsite entitled "More Agile than Agile: Learning from Incidents". The story draws heavily from resilience engineering, especially the Stella Report and the Theory of Graceful Extensibility. Here we elaborate on three sections in the poster itself, the three main ideas to take away, and a call to action for teams at New Relic which almost certainly apply more broadly in the software industry.
> Teams are part of the system > > Teams adapt when computers fall down > > Resilience happens because humans learn
Experts in resilience engineering have been studying complex systems in high-stakes industries for decades. Software has joined these illustrious ranks. What can we learn from the industries who grew into astronomical complexity before us? First, systems get good at managing incidents. The scale and the boundless growth of complexity ensure that there are always components in varying states of distress and fatigue; there is a continuous stream of surprises. We get good at responding and mitigating in the face of those surprises. Second, the especially resilient systems also get good at learning from the surprises. This second behavior, learning from incidents, is orthogonal to managing the incidents. Failure to learn from incidents is a common thread among complex systems that have failed catastrophically.
> Two orthogonal concerns: > * Managing Incidents > * Learning from Incidents > > It is common that complex systems get good at the first part out of necessity. Getting good at the second part is much more elusive.
Context: Line of Representation
Context: About Graceful Extensibility
Analysis: Assessing Team Health Survey
The blind spots we discovered by analyzing our Team Health indicators are all of the statements within Outmaneuvering Constraints. In broad strokes, these statements concern how teams support or inhibit each others' capacity for maneuver. This is where you come in. We need organic efforts across and between teams to build communication channels and practice working together. This will take practice. Our incidents can serve to focus our investment in practice.
See Learning From Incidents for more detail of what is paraphrased below.
Pay attention next time you're in an incident retro arguing about which team has to own the SLO impact. The incident in question is telling us there's entanglement between our teams that caught us by surprise this time.
What dashboards and alerts were the other team using to investigate the symptoms that showed up in the incident?
What slack channels do they follow closely or respond to quickly?
Where can you look at tickets to coordinate work between teams?
Perhaps plan a shared game day to practice inter-team communications and troubleshooting.
Make the effort to actively learn to improve coordination with your neighboring teams. The next incident is unlikely to be anything like this one. But this one does tell us that our respective components can collide.