Real-World SRE
上QQ阅读APP看书,第一时间看更新

Chapter 1. Introduction

As the internet has grown, people have become used to having access to content all of the time, from a variety of devices. This means that the reputation of a brand has slowly become connected with the responsiveness and reliability of its products. People choose Google for searching because it always returns relevant and useful results quickly. People share content on Twitter because their message will be seen in real time by their followers. Netflix's great content selection is useless if it cannot deliver consistently on a variety of network speeds. As this reliability has become more important to businesses, a specialization focused on software reliability has emerged: Site Reliability Engineering (SRE). This chapter will introduce you to the field and also describe what you will learn from this book, helping you to write software to navigate the ever-changing internet landscape.

Before we explain what the field and role of SRE pertains to, let us start with a thought experiment. Imagine that it's early in the morning and you wake up to a screenshot of a blank web page in a text message from a friend with the caption: "I can't load your website."

If your personal website is indeed down, maybe you will message back with an, "I'll check it after breakfast," or an, "Oh yeah, been meaning to look into that." If it is your company's website, or maybe the page hosting your resume that you just sent to 15 possible employers, then a stream of expletives and indecipherable emojis will probably erupt from your mouth and in your text message back. This is because, for many businesses, websites have become the main source of incoming business. For some companies, like Facebook, Amazon, or iFixit, their entire business is a website. For other businesses, like restaurants or advertising agencies, a website acts as a way for people interested in the organization to learn more. It is often part of the marketing flow that helps companies to grow.

Introduction

It is probably impossible to completely remove the adrenaline spike that comes from discovering a website is down if you are responsible for fixing it. However, we can work to set up a framework to limit how often things break. We can create a world where responding to outages is easy, and transition from, "Oh god, everything is on fire, what do I do?!" to "Oh hey, a page isn't loading, so let's check out what's having a rough day."

This chapter is our introduction to the book and the field of SRE. We will cover the following topics in the next few pages:

  • Exploring a brief history of the people who work on information systems
  • Defining what SRE is
  • Describing what is in the book and providing a rough framework for SRE.