Transitioning to SRE

Published: February 11th, 2020

Over the years, there have been a lot of new methodologies that aim to help an organization manage their technology more efficiently, whether that means making programmers more efficient or the operators who manage a company’s technology infrastructure. DevOps, which sought to bring developers and operators together, is one such example of this, and one that has seen a strong uptake. But one of Google’s internal processes has also seen some surge in popularity lately: Site Reliability Engineering (SRE).

SRE is a development practice that incorporates operations thinking. “SRE is what you get when you treat operations as if it’s a software problem,” Google’s SRE documentation states.

“So [development and operations] are really important for ensuring the services are up and running. But these two are entirely separate entities where they work in silos,” said Nith Mehta, VP of technical services at Catchpoint. “So obviously when you have two different teams working in silos, but for the same cause, there’s going to be a gap…And this is where the need for SRE comes in because they [can] bridge this.”

Mehta explained that smaller companies can get away with these siloed environments, but as a company scales, this is where SRE comes in. “[With SRE], you’re not having a separate team, but you’re trying to marry an engineer and ops skillset,” he said. “Typically companies look for someone who has an engineering background, but also is pretty good at operating systems, and also understands networks pretty well. So that way there is less friction, more efficiency, and there is one single team that is capable of seeing through end-to-end and learning through mistakes and improving as they progress.

Finding the right person to be an SRE can be a challenge for companies. Unlike when hiring developers or system admins, SREs need to have a wider range of skills, or at least be capable of learning them. On the development side, SREs need to be able to code, and on the operations side, they need to be familiar with networking concepts in order to handle the infrastructure aspect of applications, Mehta explained.

The talent pool for SREs is much smaller than a hiring manager would expect from developers or operators. “Traditionally, organizations have looked at engineers who could code and operations folks who could run the systems, like system admins and so on,” said Mehta. “That’s a model that has worked for decades…But for SREs, you’re looking at developers who also understand the operating systems and network and so on. So that kind of reduces the talent pool that is available for organizations to hire.”

According to Mehta, this requirement slows down the hiring process. To compensate, companies will often look to build up an SRE team with a balance of skills across their SREs. “The organizations, what they do is they try to build a mix of these different skill sets, hoping that they will eventually learn from each other,” said Mehta. “And you kind of balance it out around the career path and the progression of an engineer being able to learn the Ops side of the world and vice versa. That’s a process. That’s not something that’s going to happen overnight. So this whole talent pool is the first problem to tackle.”

Apart from the technical skills, there are also some soft skills that are helpful, Auth0 software engineer Damian Schenkelman explained in a talk at Datadog Dash last year. He believes those who are teachers, advocates, and problem-solvers would be a good fit for SRE. In order for the SRE organization to scale, SREs need to be able to transfer their knowledge to others. They also need to be advocates for reliability and for the SRE brand. “The SRE team brand is very important because people need to be aware of what it does and doesn’t do in order for your team SRE to be effective,” he said. Finally, SRE team members need to be good problem-solvers because these teams will get all sorts of issues thrown at them.

He believes these qualities are all things that can be learned by someone who is willing. “I definitely think all of those qualities can be learned. What needs to come from the person is the willingness to learn those things. Not everyone might be interested in those skills, and that’s OK,” he said.

Apart from the talent pool, building up an SRE organization requires a change in the way that a team works and collaborates. In order to be successful, teams need to adopt blameless postmortems. “And that can only be solved when management comes in and introduces a process in place, which helps reduce the blame games and fear of blame for a problem, where companies can collaborate,” said Mehta.

Mehta recommends introducing things like error budgets and performance budgets. This gives organizations room to collaborate and try things out.

Mehta also explained that like any other cultural shift, it’s important that you’re ensuring that your team doesn’t fall back into old roles and habits. “How do you ensure that they’re not back to the old days of doing the job, just the operations folks or system admins, they’re lost into handling outages, incidents, day-after-day, which means they don’t really have the time to do the actual job of an SRE, which is building a system, making it more reliable, automating some of those manual efforts.”

To succeed with SRE, Mehta recommended companies start off gradually. He said organizations should start off with a certain service, and start introducing SRE to that area. He said that organizations that have been successful with SRE have started with this gradual approach.

Organizations also need to ensure that they’re providing their SREs with enough time to actually do SRE. “First you have to measure the amount of time that SREs spend on incidents and troubleshooting, being on call because if they’re spending most of their time on this, then they’re not SREs,” said Mehta. “They are SREs in title, but they’re essentially doing the job of a system admin or operations.”

Since implementing SRE, Auth0 has seen a number of benefits, Schenkelman explained. It has created a culture of building reliable services, instrumenting important things in production, and creating actionable alerts. They have also noted that engineers are now more aware of the techniques that are used in building reliable systems and now consider those their designs. Tooling and libraries for instrumentation have also improved, especially with alerting, which was one of the team’s focus areas. Finally, reliability of SRE-owned services has been great, and they have also improved reliability for the whole system by contributing code across different teams, said Schenkelman.

Challenges of implementing SRE
The main challenge for Auth0 when implementing SRE was bringing clarity and understanding to the work that the SRE team would be doing. There was a big focus on education and explicitly stating what SRE would mean at Auth0, since SRE is such a widely-used term across the industry. They also created internal blog posts, did presentations, and held office hours to help them with this goal.

Schenkelman has learned a lot from this process, and has a lot of advice to share for those about to begin this journey:

“Start with the ‘why,’” he said. It’s important to understand what your motivations and goals are. He believes SRE should be “a means to an end, not an end itself.”
Once the “why” is determined, do research to decide your company’s “SRE flavor.” According to Schenkelman, there are many different ways that companies do SRE. “Even teams at Google have different practices, and they wrote the book on SRE.”
Communication is key. You must communicate your plan clearly to stakeholders. He explained that some stakeholders might have heard about SRE and just need clarification, while others may be new to SRE completely.
“Collaborate with other teams, deliver value frequently internally, and showcase it often,” said Schenkelman. He explained that teams should quickly show how the new decision is paying off for the organization.

For all of these points, he believes that it can be helpful to have a sponsor for the idea high up in the organization. “They can open doors for the SRE team(s), point you to opportunities, help have tough conversations and ensure budget for SRE as a team,” said Schenkelman.

Article Tags

auth0, Catchpoint, Google, Site Reliability Engineering, SRE

About Jenna Barron

Jenna Barron is Executive News Editor of SD Times and ITOps Times.

View all posts by Jenna Barron

Cookie	Duration	Description
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent	1 year	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
__atuvc	1 year 1 month	AddThis sets this cookie to ensure that the updated count is seen when one shares a page and returns to it, before the share count cache is updated.
__atuvs	30 minutes	AddThis sets this cookie to ensure that the updated count is seen when one shares a page and returns to it, before the share count cache is updated.

Cookie	Duration	Description
__gads	1 year 24 days	The __gads cookie, set by Google, is stored under DoubleClick domain and tracks the number of times users see an advert, measures the success of the campaign and calculates its revenue. This cookie can only be read from the domain they are set on and will not track any data while browsing through other sites.
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_WTGVKVXEZJ	2 years	This cookie is installed by Google Analytics.
_gat_gtag_UA_107693958_2	1 minute	Set by Google to distinguish users.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
_jsuid	1 year	This cookie contains random number which is generated when a visitor visits the website for the first time. This cookie is used to identify the new visitors to the website.
at-rand	never	AddThis sets this cookie to track page visits, sources of traffic and share counts.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
iutk	5 months 27 days	This cookie is used by Issuu analytic system to gather information regarding visitor activity on Issuu products.
uvc	1 year 1 month	Set by addthis.com to determine the usage of addthis.com service.

Cookie	Duration	Description
IDE	1 year 24 days	Google DoubleClick IDE cookies are used to store information about how the user uses the website to present them with relevant ads and according to the user profile.
loc	1 year 1 month	AddThis sets this geolocation cookie to help understand the location of users who share the information.
mc	1 year 1 month	Quantserve sets the mc cookie to anonymously track user behaviour on the website.
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
__gpi	1 year 24 days	No description
_heatmaps_g2g_101137905	10 minutes	No description
cf_7167_id	20 years	No description
cf_7167_person_last_update	session	No description
GoogleAdServingTest	session	No description
prism_252377639	1 month	No description
querylyvid	3 months	No description
xtc	1 year 1 month	No description

Article Tags

Subscribe to SDTimes

About Jenna Barron

Related Articles

Catchpoint updates Internet Stack Map with capabilities to speed up incident detection

Google begins fabrication on OpenTitan silicon

Podcast: Preparing for 90-day TLS certificates

SentinelOne and Google Cloud announce expanded security partnership