Site reliability engineers (SREs) are failing to include observability into their process. A new report revealed only 53% of respondents have an observability tool. The top tools SREs are using include monitoring and alerting, dashboarding, and infrastructure as code. 

Additionally, when asked about key responsibilities, SRE’s ignored key aspects of observability such as events, metrics and tracing. 

The 2020 SRE Survey Report was conducted by Catchpoint and the DevOps Institute, and based on more than 600 responses from SREs. According to the report, the lack of observability within SREs is troubling because SRE provides improved service reliability, incident management effectiveness and customer satisfaction. 

The ITOps Observability Guide
Transitioning to SRE

“Solving complex problems and ensuring reliability in today’s highly distributed world can be very difficult and requires greater monitoring and true observability. Prior to the pandemic, most companies had a handle on end-user/customer experience monitoring for distributed systems,” said Mehdi Daoudi, CEO of Catchpoint. “But now with a greater distribution of users comes new challenges and added reliability needs. True observability is the key to ensure reliability and customer experiences for all things distributed.”

The report also found that while Google, who developed the role of SREs, believes SREs should split operations and development work 50/50, SREs were spending about 75% on operations work pre-COVID. When the report went back and asked the same question during the COVID-19 pandemic, there was an additional 10% increase in ops-related activities. In addition, 53% of the respondents felt they were involved too late in the application life cycle.

When looking more at how the role of SREs has changed in the wake of COVID-19, the report found that the workplace is expected to continue being remote even after the pandemic. In 2018, the report found 81% of SREs worked in an office, but the post-pandemic environment shows 50% believe they will continue to work remotely — however working from home comes with a new set of challenges such as a good work/life balance and the ability to stay focused. 

Other challenges SREs are dealing with include spending too much time on manual and repetitive tasks as well as debugging activities.

Based on the findings, the report suggests SREs include consideration for not only code but also networks and third-party services, include work earlier in the development process, and turn previously-ignored challenges into strategic differentiators such as focusing on morale, employee experience and engagement. 

“SRE is one of the most innovative approaches to managing services since the early days of ITIL and is most closely aligned with the principles and practices of Agile and DevOps. The data in this report supports the rising criticality of both SRE as a practice and Site Reliability Engineer as a role for any organization trying to adapt to the digital age,” said Jayne Groll, CEO of the DevOps Institute.