Chaos and Lightning

Just a random brain dump about what Reliability Fitness have been up to with Chaos Engineering.

Hi I'm Eric Dobbs. I joined Reliability Fitness in early November. You may know me as the guy who makes a lot of noise about learning from incidents. I also teach aikido which means one of the things I do for fun is get knocked down and get back up again. A lot.

Chaos is where we sail.

It isn't about creating chaos. It is about admitting that Chaos is where we sail. We're rebuilding our ship at sea, way out, off the charts, where there be dragons.

We create controlled incidents where we get to skip the discovery process of finding out what broke and focus our attention on all of the peripheral side effects. That changes what we can learn compared to learning from incidents.

We can also start the incident knowing where to be looking. We can prepare our dashboards and alerting with specific hypotheses about expected failures. There are likely to be surprises that teach us things.

Two core questions to ask when designing experiments. What do we think will surprise us? How will we know?

Two core questions to ask during and after the experiment. What actually surprised us? Does this change how we are prioritizing our work?

Chaos Engineering is a tool to help us proactively discover latent risks and surprising combinations of behaviors. It is also a way to exercise our observability and monitoring—which of our dashboards help us and which don't.

What we're going to do with Gremlin. One (or two) kind of attack: introducing latency or creating a network blackhole around a service. 1) Resource attacks like CPU & RAM & Disk are hard to control with Gremlin in our Mesos/Marathon clusters. 2) The impact they have on other systems is generally latency anyway. The other obvious attacks of broken payloads are probably better covered in unit tests—less costly investment.

One core mechanic for limiting blast radius is to introduce the latency on the client side of a client-server relationship. Then other clients of that service do not have to suffer from the experiment. Obviously, services downstream of that consumer should expect symptoms.

What else? We want to take a close look at chaos panda and start building skills across the org designing experiments for latency in nerdlets and nerdpacks.

If we just add to the list of problems we already ignore and defer and postpone, then it's a wasted investment. You'd do better to spend that time working down your existing risk matrix. You have a risk matrix, right?

What do we hope to learn? Key thing is to get some confirmation about where we have existing robustness. For example, it should be easy for us to automate killing some services in container fabric 'cos we know mesos is good at restarting them.

Note for people outside of New Relic. Chaos Panda is an internal tool for introducing fault injection into the GraphQL ecosystem within New Relic. Nerdlets and Nerdpacks are components built to work within the programable corners of NewRelic One.