I barely know jack shit about anything but just the word of being an SRE or a system being down and me being called to fix it pumps the blood in my veins. I was a software dev, Infra Engineer (3 months) and Security engineer (2.5 months). I do not have more than 2 years of experience but I want to be an SRE. I see it as a responsibility more than just a Tag for a role.
What does it mean to be an SRE and what are the tasks:
Able to understand and often predict the emergent behavior of complex systems. At least one SRE should/must will be involved in the design of any large system, and their ability to reason about the failure modes of the system under design is why.
Able to solve problems with high-quality code. Google runs large services that are constantly getting new features, and SRE is responsible for writing the software that makes that possible with a sub-linear number of humans running the machine. No joke, capacity planning is hard enough at this scale that a couple of SREs decided it would be best to write a large-scale mixed-integer solver to approximate the bin-packing.
Statistically literate; part of SRE’s job is measuring and enforcing service SLAs. In practice, this means that SREs spend a lot of time bringing rigor to the administrative process.
Both SRE and DevOps are methodologies addressing organizations’ needs for production operation management. But the differences between the two doctrines are quite significant: While DevOps raise problems and dispatch them to Dev to solve, the SRE approach is to find problems and solve some of them themselves. While DevOps teams would usually choose the more conservative approach, leaving the production environment untouched unless absolutely necessary, SREs are more confident in their ability to maintain a stable production environment and push for rapid changes and software updates. Not unlike the DevOps team, SREs also thrive on a stable production environment, but one of the SRE team’s goals is to improve performance and operational efficiency.
SREs do write code, but they tend to spend at least as much time on operational work. Officially, when the service is considered “healthy,” SREs are expected to spend up to 50% of their time on operational work. For services that aren’t so healthy, it can be even more.
Operational work is everything you do to maintain the health of your service that isn’t building software. That includes highly skilled, engaging work like troubleshooting outages in real time, and responding to problems detected by monitoring systems to prevent outages before they happen.
It also includes less skilled, tedious work like preparing for planned maintenance, tweaking the size or location of your service to handle additional users, rolling out new versions of code, rolling out config changes, configuring A/B tests to verify new code and configurations before they roll out, waiting for rollouts to finish, checking in on rollouts to see why they haven’t finished yet, and filing tickets with other teams whose bugs are keeping your rollouts from finishing.
When you do write code as an SRE, it probably won’t be big projects. It’ll be refactoring configurations, automating some of that tedious operational work, and tweaking tools you use frequently. Most of it won’t be visible to anyone outside your team, and none of it will be visible to anyone outside of Google.
Your involvement with big, user-facing applications will mostly take the form of reviewing new designs (and sometimes code) with an eye for reliability. Occasionally you might dig into your service’s code to track down a bug, but more likely you’ll hand that off to the developers once you suspect the bug exists.
There’s a key difference to notice in the job requirements for SWE and SRE roles. SREs are expected to have:
“Experience with Unix/Linux operating systems internals and administration (e.g., filesystems, inodes, system calls) or networking (e.g., TCP/IP, routing, network topologies, and hardware, SDN).”
And SWEs are expected to have: “Experience with algorithms, data structures, complexity analysis, and software design.”
As opposed to a typical site operations/support role in most companies, Google’s SREs are a very specialized crew of sharp engineers. They are specifically trained for months on pieces of the Google stack. Deployment, monitoring, failure handling, traffic management – SREs are in charge of these. Yes, it has bits that have to do with systems programming and computer architecture, but it is also much more. It is about running a scalable architecture that exposes a web stack on top of distributed data backends. It is also about good practices and standardization. Google has built sophisticated systems towards this end and SREs are those that are well-acquainted with these systems and processes. In addition to that, SREs are also required to have a time-commitment while ‘on duty’ – Be able to monitor and respond to critical situations quickly enough. This extra commitment is also duly compensated by Google (by paying a little extra on top of what a typical software engineer at the same level gets).
A good SRE is proactive in dealing with incidents. The good SRE will ensure that all pages are actionable, and the non-actionable pages are fixed or its thresholds adjusted. The good SRE is well aware of the system demands and knows the true meaning of N+1 redundancy. He will never accept a system with no redundancy, nor will he end up building an N+10 system that will never get utilized.
A good SRE needs good communication skills. Just being a naysayer to any launch isn’t productive. Instead, the SRE will assess and mitigate the risks, and help in all ways possible with the launch.
SRE work at Google covers a lot of ground. One would have to be familiar with the internal working of Linux OS, networking, system administration, and programming and algorithms. It is not expected to find an expert in all of those fields at the same time, but one should hold ground in each area pretty well for some fundamental knowledge at least.
An SRE has to have a master in:
Software Engineering (Algorithms and data structures, architecture and design, etc.)
Sysadmin (Linux administration, networking, troubleshooting, etc.)
“Class SRE implements DevOps”
What it means is that DevOps comes under SRE but it is much more, well, what are the extra aspects? The one major aspect is programming, system level programming, to be precise.
The main concern here is that, though a typical entry-level SRE knows how to code(hacks, dirty patching, small feature snippets, etc..) he falls short of regular SWE experience where the traditional languages are used. (That is me) For example, C, C++, Java, etc. At most companies the newbie SREs tend to use Infrastructure/Scripting friendly languages like Python, Go(maybe Rust) and some companies do need someone who need significant experience with these formerly mentioned(OOP) languages and they might be willing to train the person on the Infrastructure/Systems side once he comes on-board.
Most big companies and good startups tend to see what all you have learnt over the past years. They might be willing to train you in the areas that you fall short, their most important goal would be to hire smart folks or folks who they feel could contribute(sooner or later) greatly for the team.
At the end of the day, the job of an SRE is to increase the reliability of whatever it is that they’re touching on that day. This could be any number of things:
- the application/product itself
- the server/OS environment you run in
- the human processes you use
- the monitoring systems you depend on
- the additional tech you use (databases, web servers, the list goes on)
“Increasing the reliability” of things is a very vague phrase. The actual output of an SRE, at least as the role is defined at StumbleUpon, varies from day to day. In the last two weeks, people on my team have done a bunch of things that fall within the scope of reliability engineering, including:
- Identified and fixed an issue with our monitoring data by writing a tool to export, munge, and re-import the data into the underlying Hbase storage system used by OpenTSDB. Output: code.
- Identified an issue with Hbase (in our configuration) where a specific type of single node failure can bring the site down after ~5 minutes. Output: detailed information and a bug report for Hbase developers.
- Wrote a postmortem for an outage we had, describing what went wrong and how we intend to ensure that we can identify and recover quickly from these problems in the future. Output: a report for general consumption by the company.
- Debugged an issue with a long running set of Gearman workers leaking memory and making worker machines become unavailable. Output: a suggested fix for the engineering team to evaluate.
Some of these are code related. Some of the code related items are easily driven by a single person working in isolation. Most of the work that my team does, though, requires interacting with the rest of the company. Not just engineers, either, and not just people who are within the company. The SREs on my team spend a lot of time interacting with people.
SRE is what happens when you assign a group of developers to an operations role, and ask them to make it not suck. It is the discipline of maintaining the illusion that Google is always working. It’s an illusion, the truth is, things are always broken… but nearly all the time we manage to keep it below the threshold where anyone else will notice.
Developers get bored of running manual procedures, and so automate everything they can.
Eventually, you get SRE, where the operations jobs are mostly done by automation, and the SREs are there to teach the automation to do new things, and to fix it when it goes wrong.
Google invented SRE because it very quickly became clear that you can’t operate enough computers to do a search engine if you are manually configuring each one. Instead, the tag line is “machines are cattle, not pets”; they have serial numbers, not names, and if one fails it is taken offline, repaired, and reinstalled from scratch as it is returned to service. Most of the time even SREs don’t think much about individual machines.
An SRE team is responsible for a set of related systems within Google; that set is chosen so it is feasible to learn enough about them all to do oncall. The number of systems can be anything from one, to five or six closely connected but quite complex systems (my team’s portfolio is like this), or up to a few hundred simple and very similar systems (Maps is like this, it’s made of a lot of microservices).
There are two kinds of a typical day; either you’re oncall, or you’re not.
If oncall, your time is dedicated to interrupts from the system and/or other engineers. The monitoring systems have rules in them that will alert the oncall if something that should be true about the system is not. At which point, the oncall will take a quick look and decide if they can fix it themselves, or if they need to enlist some help. Enlisting help might be asking the oncall of another team to examine what is happening at their end, or calling an incident and launching the incident management first. The oncall’s job is not actually to fix an issue but to see that it gets fixed, by getting help, filing bugs, declaring an incident, or whatever it takes.
Oncall frequently spills over into the following days, because incidents require documentation. Someone has to write a postmortem document that covers what happened, what were the causes, and critically, what is being done to prevent a recurrence. There is a tight deadline for this.
If not oncall, you have project work, which might be to make some improvements to the system you work on; often this may be to do with improving its deployment, or monitoring, or performance, but if the service is infrastructure you might actually be doing development work on it to build features; my current project is somewhat like that, involving replacing the mechanism by which one of our services finds and loads its configuration. Often project work arises from the work items in postmortem documents.
But project work can also be development work on generic SRE-wide tools, like infrastructure for building dashboards, or the alerting tools, or things like the programs (there are four or so choices…) that build SRE oncall calendars.
Project work can also be writing standards documents, like the set of standard practices that Maps SRE use to run a few hundred services; for them, it’s pretty common for an oncall to be paged by something they’ve never heard of, and yet be able to figure out what it is and why it paged and get it back serving within a fairly short time frame. That’s only possible because they’re all as close to identical as they can be given the core function is different. Same monitoring, same links in the same places in the dashboards, same documentation layout, autogenerated links to everything that matters in the alert itself.
SRE project work is sometimes what is called a “production readiness review”, which is the process by which an SRE team takes up the pager from the team that developed it. The review goes through the team and SRE-wide standards for services, makes sure the new one complies, gets the rate of pages down to an acceptable level, and all the details are set up so it is manageable.
SRE teams are always split across two sites about 6 timezones apart; this means your business days overlap, so you have times you can hold a team meeting, but it also means you can have a 12 hour oncall shift that does not require anyone to be working past midnight or before 5 AM unless there is a major incident going on. Occasionally the handoff and reporting requirements of the largest incidents mean people’s oncall shifts stretch a bit at each end; when this happens, there’s usually some kind of recognition that it was over-and-above normal.
There’s a pretty substantial bonus for taking the pager, either in cash or extra leave. SRE culture is very supportive, and postmortems truly are blameless; everyone in SRE knows that they’ve done something that in hindsight wasn’t the best action, the aim being that you do something reasonable at the time given what you know. Better to act than to sit and allow a problem to grow out of hand.
Only the largest and/or most important systems have SRE support. It’s expensive. So many products are run by their developers, and SRE teams also do a fair bit of consulting for those, since SREs are the experts on how production works and the best practices for keeping it happy.