Over the years, there have been a lot of new methodologies that aim to help an organization manage their technology more efficiently, whether that means making programmers more efficient or the operators who manage a company’s technology infrastructure. DevOps, which sought to bring developers and operators together, is one such example of this, and one that has seen a strong uptake. But one of Google’s internal processes has also seen some surge in popularity lately: Site Reliability Engineering (SRE).
SRE is a development practice that incorporates operations thinking. “SRE is what you get when you treat operations as if it’s a software problem,” Google’s SRE documentation states.
“So [development and operations] are really important for ensuring the services are up and running. But these two are entirely separate entities where they work in silos,” said Nith Mehta, VP of technical services at Catchpoint. “So obviously when you have two different teams working in silos, but for the same cause, there’s going to be a gap…And this is where the need for SRE comes in because they [can] bridge this.”
Mehta explained that smaller companies can get away with these siloed environments, but as a company scales, this is where SRE comes in. “[With SRE], you’re not having a separate team, but you’re trying to marry an engineer and ops skillset,” he said. “Typically companies look for someone who has an engineering background, but also is pretty good at operating systems, and also understands networks pretty well. So that way there is less friction, more efficiency, and there is one single team that is capable of seeing through end-to-end and learning through mistakes and improving as they progress.
Finding the right person to be an SRE can be a challenge for companies. Unlike when hiring developers or system admins, SREs need to have a wider range of skills, or at least be capable of learning them. On the development side, SREs need to be able to code, and on the operations side, they need to be familiar with networking concepts in order to handle the infrastructure aspect of applications, Mehta explained.
The talent pool for SREs is much smaller than a hiring manager would expect from developers or operators. “Traditionally, organizations have looked at engineers who could code and operations folks who could run the systems, like system admins and so on,” said Mehta. “That’s a model that has worked for decades…But for SREs, you’re looking at developers who also understand the operating systems and network and so on. So that kind of reduces the talent pool that is available for organizations to hire.”
According to Mehta, this requirement slows down the hiring process. To compensate, companies will often look to build up an SRE team with a balance of skills across their SREs. “The organizations, what they do is they try to build a mix of these different skill sets, hoping that they will eventually learn from each other,” said Mehta. “And you kind of balance it out around the career path and the progression of an engineer being able to learn the Ops side of the world and vice versa. That’s a process. That’s not something that’s going to happen overnight. So this whole talent pool is the first problem to tackle.”
Apart from the technical skills, there are also some soft skills that are helpful, Auth0 software engineer Damian Schenkelman explained in a talk at Datadog Dash last year. He believes those who are teachers, advocates, and problem-solvers would be a good fit for SRE. In order for the SRE organization to scale, SREs need to be able to transfer their knowledge to others. They also need to be advocates for reliability and for the SRE brand. “The SRE team brand is very important because people need to be aware of what it does and doesn’t do in order for your team SRE to be effective,” he said. Finally, SRE team members need to be good problem-solvers because these teams will get all sorts of issues thrown at them.
He believes these qualities are all things that can be learned by someone who is willing. “I definitely think all of those qualities can be learned. What needs to come from the person is the willingness to learn those things. Not everyone might be interested in those skills, and that’s OK,” he said.
Apart from the talent pool, building up an SRE organization requires a change in the way that a team works and collaborates. In order to be successful, teams need to adopt blameless postmortems. “And that can only be solved when management comes in and introduces a process in place, which helps reduce the blame games and fear of blame for a problem, where companies can collaborate,” said Mehta.
Mehta recommends introducing things like error budgets and performance budgets. This gives organizations room to collaborate and try things out.
Mehta also explained that like any other cultural shift, it’s important that you’re ensuring that your team doesn’t fall back into old roles and habits. “How do you ensure that they’re not back to the old days of doing the job, just the operations folks or system admins, they’re lost into handling outages, incidents, day-after-day, which means they don’t really have the time to do the actual job of an SRE, which is building a system, making it more reliable, automating some of those manual efforts.”
To succeed with SRE, Mehta recommended companies start off gradually. He said organizations should start off with a certain service, and start introducing SRE to that area. He said that organizations that have been successful with SRE have started with this gradual approach.
Organizations also need to ensure that they’re providing their SREs with enough time to actually do SRE. “First you have to measure the amount of time that SREs spend on incidents and troubleshooting, being on call because if they’re spending most of their time on this, then they’re not SREs,” said Mehta. “They are SREs in title, but they’re essentially doing the job of a system admin or operations.”
Since implementing SRE, Auth0 has seen a number of benefits, Schenkelman explained. It has created a culture of building reliable services, instrumenting important things in production, and creating actionable alerts. They have also noted that engineers are now more aware of the techniques that are used in building reliable systems and now consider those their designs. Tooling and libraries for instrumentation have also improved, especially with alerting, which was one of the team’s focus areas. Finally, reliability of SRE-owned services has been great, and they have also improved reliability for the whole system by contributing code across different teams, said Schenkelman.
Challenges of implementing SRE
The main challenge for Auth0 when implementing SRE was bringing clarity and understanding to the work that the SRE team would be doing. There was a big focus on education and explicitly stating what SRE would mean at Auth0, since SRE is such a widely-used term across the industry. They also created internal blog posts, did presentations, and held office hours to help them with this goal.
Schenkelman has learned a lot from this process, and has a lot of advice to share for those about to begin this journey:
- “Start with the ‘why,’” he said. It’s important to understand what your motivations and goals are. He believes SRE should be “a means to an end, not an end itself.”
- Once the “why” is determined, do research to decide your company’s “SRE flavor.” According to Schenkelman, there are many different ways that companies do SRE. “Even teams at Google have different practices, and they wrote the book on SRE.”
- Communication is key. You must communicate your plan clearly to stakeholders. He explained that some stakeholders might have heard about SRE and just need clarification, while others may be new to SRE completely.
- “Collaborate with other teams, deliver value frequently internally, and showcase it often,” said Schenkelman. He explained that teams should quickly show how the new decision is paying off for the organization.
For all of these points, he believes that it can be helpful to have a sponsor for the idea high up in the organization. “They can open doors for the SRE team(s), point you to opportunities, help have tough conversations and ensure budget for SRE as a team,” said Schenkelman.