Chaos Engineering

I heard about Netflix & Chaos Monkey while I was at Comverge/Itron.

Right place right time at New Relic.

My 2nd team held the keys to Gremlin & were evaluating whether it was the right chaos platform.

Netflix Chaos Monkey. Randomly turned off VMs in AWS.

Gremlin. A Chaos Engineering SaaS.

We were also extra hands for the SRCs.

Just before COVID-19 shutdowns in the spring

+10 ms latency the Big Databases: Accounts & Agents.

Chaos Engineering is not creating more chaos.

Engineering amid the chaos that already exists.

Testing in production SAFELY.

Build Confidence.

Design

Run

Reflect

Design

Met with 13 teams.

Ran many experiments in staging.

Proving we could stop the experiment.

Proving teams were well calibrated with their systems.

Every team: "but staging tells us nothing about prod"

1. Map dependencies

2. Use latency. 80% of problems show up as latency. Spend your design time on other details.

3. We used tc to add latency at the kernel level

Whether you run the experiment or not, there's a ton of value already discovered.

5. Run the experiment — this part is boring.

You'll have gained confidence.