Peter Alvaro, Assistant Professor of CS at UCSC, presented _The Twilight of the Experts_ at ObservabilityCon 2018. The whole talk is a gem and well worth your time. I've chosen a few key points for my story.
> Failure is just the absence of a message in a distributed system. > > We can't hide the complexity. We can't turn this substrate into a reliable component and build up. Fault tolerance does not compose! Our abstractions are gonna leak. > > They're gonna do more than leak.
Alvaro has done some work with Netflix. He has some hope for fault injection (aka. chaos engineering). But nine minutes into his talk he presents analysis of injecting faults into the combinatorial space of a specific system at Netflix composed of 100 services. youtube 9:30
This is combinatorial explosion.
As Alvaro puts it, we can't just throw darts into this space: > We're gonna have to be smart about how we select our experiments if we're gonna do this testing plus fault injection thing.
He goes on to offer a promising technical solution to the discovery of good targets for fault injection in large scale distributed systems.
Use distributed tracing to discover a graph of successful paths through the system.
Enumerate those paths and the faults therein that would interrupt the paths.
Compose the collection of faults as a boolean satisfiability problem and run it through a SAT solver.
The solutions are good candidates for fault injection that are likely to reveal weaknesses in the system.