The Evolution of SRE at Google: Using STAMP to improve resilience in Google production systems – Falzone and Sloss (2024)

[hat tip John Siegrist]

December 18, 2024

Research

Authors:

Article shepherded by:

Rik Farrow

Billions of people around the world use Google’s products every day, and they count on those products to work reliably. Behind the scenes, Google’s services have increased dramatically in scale over the last 25 years — and failures have become rarer even as the scale has grown. Google’s SRE team has pioneered methods to keep failures rare by engineering reliability into every part of the stack. SREs have scaled up methods that have gotten us very far—Service Level Objectives (SLOs), error budgets, isolation strategies, thorough postmortems, progressive rollouts, and other techniques. In the face of increasing system complexity and emerging challenges, we at Google are always asking ourselves: what’s next? How can we continue to push the boundaries of reliability and safety?

To address these challenges, Google SRE has embraced systems theory and control theory. We have adopted the STAMP (System-Theoretic Accident Model and Processes) framework, developed by Professor Nancy Leveson at MIT, which shifts the focus from preventing individual component failures to understanding and managing complex system interactions. STAMP incorporates tools like Causal Analysis based on Systems Theory (CAST) for post-incident investigations and System-Theoretic Process Analysis (STPA) for hazard analysis.

In this article, we will explore the limitations of our traditional approaches and introduce you to STAMP. Through a real-world case study and lessons learned, we’ll show you why we believe STAMP represents the future of SRE not just at Google, but across the tech industry.

Continues in link:
The Evolution of SRE at Google | USENIX

https://www.usenix.org/publications/loginonline/evolution-sre-google

Systems Community of Inquiry

An open discussion on systems

The Evolution of SRE at Google: Using STAMP to improve resilience in Google production systems – Falzone and Sloss (2024)