Incident chat bot, incident database
Related docs, policies and training materials.
First experience in a full blown DevOps Culture.
Teams deployed their own code—kinda GitOps
Everything was instrumented with New Relic.
No hoverboards here, kid—Bladerunner
Felt like I had time-traveled into the future.
No hoverbords. Lotta rain.
Lifecycle of a typical incident.
A bunch of things change.
Someone's pager goes off, or customers call support.
A team of experts quickly assemble and run around in the smoke of terminals and dashboards and log files trying to figure out where the fire is.
Various things are tried to put the fire out.
At some point the smoke clears.
Then an incident review happens.
TODO lists are created "so it never happens again."
And then it happens again.
failing to get traction on SLIs/SLOs