SRE: Site/Systems Reliability Engineer

A site reliability engineer (SRE) will spend up to 50% of their time doing “ops” related work such as issues, on-call, and manual intervention. Since the software system that an SRE oversees is expected to be highly automatic and self-healing, the SRE should spend the other 50% of their time on development tasks such as new features, scaling or automation. The ideal SRE candidate is a highly skilled system administrator with knowledge of code and automation.

“Fundamentally, it’s what happens when you ask a software engineer to design an operations function…So SRE is fundamentally doing work that has historically been done by an operations team, but using engineers with software expertise, and banking on the fact that these engineers are inherently both predisposed to, and have the ability to, substitute automation for human labor.”

Site Reliability Engineer = Software Engineer + Systems Enthusiast

The goal is to bridge the gap between the development team that wants to ship things as fast as possible and the operations team that doesn’t want anything to blow up in production. At their core, the ops teams want to make sure the service doesn’t break while they are holding the pager. Because most outages are caused by some kind of change—a new configuration, a new feature launch, or a new type of user traffic—the two teams’ goals are fundamentally in tension. Maintaining 100% availability isn’t the goal of SRE. “Instead, the product team and the SRE team select an appropriate availability target for the service and its user base, and the service is managed to that SLO. Deciding on such a target requires strong collaboration from the business.”

Some of the typical responsibilities of a site reliability engineer:

  • Proactively monitor and review application performance
  • Handle on-call and emergency support
  • Ensure software has good logging and diagnostics
  • Create and maintain operational runbooks
  • Help triage escalated support tickets
  • Work on feature requests, defects and other development tasks
  • Contribute to overall product roadmap

Innovation and stability are always at odds. This is why DevOps has become popular. Make the developers help to keep the site up and suddenly they care a lot more about stability. This is more efficient both because it reduces organizational boundaries and because it’s easier to code reliability in up front than to retrofit it later. The downside of DevOps is less specialization. Some people strongly prefer engineering or ops. Even those who are balanced need to learn more skills which means those skills are learned less well. And this is where an SRE comes into the picture, doing engineering and operations, both 50% of the time.

SREs are in charge of the deployed services and are dedicated to taking care of specific services. They monitor and make sure these services respond with the expected QPS across geographies quickly enough. SRE’s are in charge of Deployment, monitoring, failure handling, traffic management etc. At the end of the day, being an SRE is not just about running a scalable architecture that exposes a web stack on top of distributed data backends, but it is also about good practices and standardization. They are required to have a time-commitment while ‘on duty/on call‘ – being able to monitor and respond to critical situations quickly enough which a big responsibility within itself.

There is a wrong notion that a person “on-call” only has to fix things. The on-call’s job is not actually to fix an issue but to see that it gets fixed, by getting help, filing bugs, declaring an incident, or whatever it takes. He/She has to write a postmortem document that covers what happened, what were the causes, and critically, what is being done to prevent a recurrence.

Running reliable services requires reliable release processes. Site Reliability Engineers (SREs) need to know that the binaries and configurations they use are built in a reproducible, automated way so that releases are repeatable and aren’t “unique snowflakes.” Changes to any aspect of the release process should be intentional, rather than accidental. SREs care about this process from source code to deployment.

“Page me once, shame on you; page me twice, shame on me.”