Weekly Tech Notes #19: Availability

🎬 A video about the difference between SLIs, SLOs and SLAs. In a nutshell, SLIs drive SLOs, which in turn inform SLAs. More specifically:

SLIs are Service Level Indicators or metrics over time which inform about the health of a service (e.g. 95th percentile of latency over last 5 minutes)
SLOs are Service Level Objectives agreed upon bounds for how often those SLIs must be met (e.g. 95th percentile SLI will succeed 99,9% over the year)
SLAs are business-level agreements which define the service availability for a customer and the penalties for breaking that availability

📚 Availability is an important concept for distributed systems. There’s math behind it, but also reasonable rules of thumb to follow. This week, I’m sharing a few articles that cover the topic.

The quest for availability in the cloud – Some basic definitions and AWS concepts to achieve high availability.
The Calculus of Service Availability – Some more in-depth concepts from the Google SRE book are explained in this article. This is a good starting point, but if you want the full picture, just read the book.
Lessons Learned from Twenty Years of Site Reliability Engineering – Eleven lessons learned from Google’s SRE team over 20 years.