Software businesses exhibit a lot of wishful thinking about process. All business processes are barely held together because there are skilled people working around the gaps and contradictions in the process. In many cases skilled people are actively ignoring or defying the approved process where it has proven to be counterproductive and making judgment calls about which among the conflicting priorities and rules really matters right now.
To get real reliability that actually improves customer experience, what’s needed is investment in skill-building and time for humans to practice.
When the tempo of the work is right, the natural operational work might provide a good rhythm of mostly-right-sized opportunities for teams to practice and learn and improve.
If that tempo is too slow, then teams should be running game days to create their own tempo for maintaining the skills that make for effective incident response.
If that tempo is too fast, the team will need outside reinforcements to improve the baseline reliability of their services into a manageable tempo. As the systems stabilize into a better tempo, the reinforcements can turn their attention to other parts of the system.
High performing teams regularly conduct Operational Review Meetings. The idea is for teams to regularly review their dashboards and monitoring and alerting to reduce false alarms and to make sure they can see how their systems are working. Teams develop a feeling for what is normal right now. This develops the invisible skill to anticipate when the pace of change is accelerating beyond current capacities and limits of the system.
Review runbooks. Review continuous deploy code. Review test automation. Build a practice that enables the team members to refresh each other's mental models of how the systems behave under normal conditions so that everyone will recognize when conditions have become abnormal.
Adjacent to Operational Review Meetings, teams should be periodically running game days to double-check their failover or the circuit breakers or other safety measures for their systems. Start these as table-top role-playing exercises and grow incrementally into chaos engineering experiments.
Establishing these practices is a big cultural change that will require the help of many hands. Recruiting that help only works by developing relationships with the people who already have power, authority, or influence to advocate for the changes.
Law of Fluency—expertise hides the effort in work. From the outside it looks like the process works. What's really happening is that mundane experts make the thing work and the process gets all the credit.
One signal that your organization misunderstands this truth about expertise: when things break, how often do you attribute the breakage to human error—mistakes, failure to follow procedure, etc.? How much effort do you expend in trying to create and enforce process, OKRs, KPIs and the like?
The truth is your people are holding your rickety ship together every single day and only occasionally do the seas churn violently enough to expose the brittleness you have collectively ignored.