Incident Management

Incident chat bot, incident database

Related docs, policies and training materials.

First experience in a full blown DevOps Culture.

Teams deployed their own code—kinda GitOps

Everything was instrumented with New Relic.

No hoverboards here, kid—Bladerunner

Felt like I had time-traveled into the future.

No hoverbords. Lotta rain.


Lifecycle of a typical incident.

A bunch of things change.

Someone's pager goes off, or customers call support.

A team of experts quickly assemble and run around in the smoke of terminals and dashboards and log files trying to figure out where the fire is.

Various things are tried to put the fire out.

At some point the smoke clears.

Then an incident review happens.

TODO lists are created "so it never happens again."

And then it happens again.





failing to get traction on SLIs/SLOs