Complexity Distributed

Peter Alvaro, Assistant Professor of CS at UCSC, presented _The Twilight of the Experts_ at ObservabilityCon 2018. The whole talk is a gem and well worth your time. I've chosen a few key points for my story.

> Failure is just the absence of a message in a distributed system. > > We can't hide the complexity. We can't turn this substrate into a reliable component and build up. Fault tolerance does not compose! Our abstractions are gonna leak. > > They're gonna do more than leak.

Fault tolerance in a distributed system goes all the way up the stack. It does more than leak. It burns everything to the ground. source

Alvaro has done some work with Netflix. He has some hope for fault injection (aka. chaos engineering). But nine minutes into his talk he presents analysis of injecting faults into the combinatorial space of a specific system at Netflix composed of 100 services. youtube 9:30

The time it would take to test the combinations of failures of all the systems approaches the heat death of the universe. source

This is combinatorial explosion.

Log scale image of growth rates that exceed linear. source
Copyright 2018 Eric Dobbs. Licensed CC BY-SA 4.0

As Alvaro puts it, we can't just throw darts into this space: > We're gonna have to be smart about how we select our experiments if we're gonna do this testing plus fault injection thing.

He goes on to offer a promising technical solution to the discovery of good targets for fault injection in large scale distributed systems.

Use distributed tracing to discover a graph of successful paths through the system.

Enumerate those paths and the faults therein that would interrupt the paths.

Compose the collection of faults as a boolean satisfiability problem and run it through a SAT solver.

The solutions are good candidates for fault injection that are likely to reveal weaknesses in the system.