There’s a missing pillar in production operations — and it may be killing innovation

Published: April 18th, 2022

On-call is broken. It’s not a little broken. It’s really broken. We all take the current on-call experience for granted. It’s the way it’s always been. Most of us never stop and think about how expensive, manual, and error-prone on-call really is. That is, until a major outage sets your company back and damages trust with your customers.

If you’re like most developers, you hate being on call. No one wants to be awakened in the middle of the night to resize a disk! But it goes way beyond that. It’s one thing to support your own code; it’s another to diagnose problems in code you didn’t write. All too often, there is no runbook for the issue they are being asked to diagnose and if there is one, it’s long, complicated, and too often out-of-date. And, of course, there is no glory in debugging an issue. Developers get recognized for innovation, when it’s a new feature or some new automation.

The fact that on-call incident response is broken costs you time, toil, and money. It can also cost you customers, damage your company’s reputation, and block potential innovation.

On August 25, 2021, Amazon’s 13 minutes of downtime translated to almost $5 million in lost revenue.

On October 4, 2021, Facebook (and its subsidiaries Messenger, Instagram, and WhatsApp) was globally unavailable for six hours. Estimated lost revenue: $100 million due to a network configuration error.

Additional outages have been reported with Verizon, Microsoft, and AWS, affecting millions of users and costing millions of dollars.

And that is just the beginning. Engineering leaders often overlook the magnitude of the day-to-day costs of being on call. There are over 1.2 million site reliability engineers (SREs) and cloud operations engineers on LinkedIn. These are the engineers who work on improving software system reliability across a number of key areas, including incident response. The cost of these engineers is over $180 billion, which is more than the revenues of AWS, Azure, and Google Cloud Platform combined. And the incidents they are fixing lead to almost 1 billion hours of degraded service for customers. And businesses are constantly trying to hire more SREs. As demand for SREs reaches an all-time high, so does SRE burnout; the average SRE tenure is less than 18 months. Companies are hiring more people to play “whack-a-mole,” handling one issue as three more pop up. The result? Companies are spending more time keeping the lights on vs. innovating, placing them at a competitive disadvantage.

So, how did we get here?

Operations are more complex than ever before. Today’s production fleets are convoluted environments with a mixture of VMs and containers running across multiple clouds and multiple accounts. Keep in mind that each environment has its own nuances, credentials, and APIs. All of this makes on-call work very tedious and automation even harder. Also, faster release cycles translate to an ever-increasing burden being placed upon engineers working with systems in production.

Few companies have fully internalized and aggressively adopted an automation strategy for incidents — and this is a huge gap in the software development lifecycle. While testing, deployment, and configuration has been automated, the manual execution of tasks in production has become a bottleneck with engineers addressing the same or similar tasks repeatedly. Companies do have observability and incident management tools in place that can shine some light on an issue and route it to the appropriate channel. Despite this automatically generated incident alert, a human is still required to manually repair the issue. This lack of effective automation within production operations means that downtime, errors, and toil just continue to grow.

What about simply enabling more people to run existing scripts? While a good first step, this is quite difficult with most of today’s tools — and it isn’t scalable. If you write a script that runs on one box, this is a straightforward task. However, determining where and when to run this script across thousands of boxes with the right credentials can be a daunting project.

When it comes to debug and repair, the engineer must log into box after box when an alarm sounds to first diagnose and then fix the problem. Since automation itself can be time-consuming, on-call teams only automate away a tiny fraction of the issues they deal with on a day-to-day basis. As there is a huge array of on-call incidents, the effort to automate away any one issue is often deemed too much. On top of that, the industry overall is reinventing the wheel repeatedly. Every company experiences full disks, memory leaks, and networking issues — and each company is figuring out how to debug these issues even though thousands of companies have done it before.

The missing link

Production operations and on-call lack a critical third pillar: incident automation. People have far more tools to find problems than they do to actually fix problems. Companies have addressed observability (monitoring and detection) and incident management (which assigns and prioritizes incidents), but few are focused on incident automation, which encompasses diagnosis, repair, and automation. Too much of on-call and incident repair is manual. The lack of automation, or even partial automation, in this key aspect of production ops is costing the industry dearly.

In the production ops world, even a 0.1 percent human error rate can lead to a major outage down the line. And, reliance on runbooks, detailed guides for completing a commonly repeated task within the IT operations process, isn’t working. In fact, these runbooks (documentation or wikis) are often ignored. Meanwhile, employee skill sets vary from “beginner” to “advanced” and institutional knowledge is continually walking out the door due to high turnover. The knowledge is in their heads (not the runbooks) and when employees leave, so does valuable information.

Now is the time to automate production ops. Companies should work to automate away repetitive incidents in production, including expired certificates, disk failures, stuck pods, and JVM memory leaks. This is particularly important since production operations is a 24×7 function. This reduces errors, IT fatigue, and increases time available for higher value work. Yet, companies have been reluctant to automate, citing how complex and time-consuming it is.

How do you start your path to production ops automation?

Automation can be an intimidating process. While you can’t automate everything, smart automation will simplify and streamline your production ops. Here are some guidelines:

Crawl before you walk, and walk before you run. Start tracking and categorizing your tickets so that you can truly understand both the impact on customers and the engineering cost for each issue.
When you ticket or track your issues, be sure you can measure how many hours it took to address each issue. This will allow you to prioritize where to invest in automation.
Standardize debugging practices. At most companies, there are a top 5 or top 7 diagnostics that your best engineers run almost every time they debug an issue. Automate the collection of these diagnostics.
Build precise alarms mapped to specific issues. This is an overlooked but critical step for automation. If your alarm is too generic, it is impossible to know what caused the issue and then it’s impossible to automate the repair.
Then build “human-in-the-loop” automations to repair issues. This will ensure that you still have human oversight while dramatically improving mean time to repair. It will also allow you to empower a much broader team to repair many common incidents.
Ensure your human-in-the-loop automations include post-repair diagnostics so that you can confirm that your automation is actually what fixed the problem.
Once your team has seen and fixed the same issue multiple times with the same approach, then you should be ready for full automation. No matter how you choose to automate, make sure that you treat your automations just like any other production code and integrate the deployment of these automations into your standard CI/CD process.

The time to invest in automation is now. The current manual strategy is doomed. Every year, the number of incidents increases exponentially. Few companies are fixing tomorrow’s issues today, and automation keeps you ahead of the curve as your team spends more time innovating (and less time debugging).

Article Tags

on-call, operations

About Ashley Stirrup

Ashley Stirrup is COO at Shoreline.io.

View all posts by Ashley Stirrup

Cookie	Duration	Description
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent	1 year	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
__atuvc	1 year 1 month	AddThis sets this cookie to ensure that the updated count is seen when one shares a page and returns to it, before the share count cache is updated.
__atuvs	30 minutes	AddThis sets this cookie to ensure that the updated count is seen when one shares a page and returns to it, before the share count cache is updated.

Cookie	Duration	Description
__gads	1 year 24 days	The __gads cookie, set by Google, is stored under DoubleClick domain and tracks the number of times users see an advert, measures the success of the campaign and calculates its revenue. This cookie can only be read from the domain they are set on and will not track any data while browsing through other sites.
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_WTGVKVXEZJ	2 years	This cookie is installed by Google Analytics.
_gat_gtag_UA_107693958_2	1 minute	Set by Google to distinguish users.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
_jsuid	1 year	This cookie contains random number which is generated when a visitor visits the website for the first time. This cookie is used to identify the new visitors to the website.
at-rand	never	AddThis sets this cookie to track page visits, sources of traffic and share counts.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
iutk	5 months 27 days	This cookie is used by Issuu analytic system to gather information regarding visitor activity on Issuu products.
uvc	1 year 1 month	Set by addthis.com to determine the usage of addthis.com service.

Cookie	Duration	Description
IDE	1 year 24 days	Google DoubleClick IDE cookies are used to store information about how the user uses the website to present them with relevant ads and according to the user profile.
loc	1 year 1 month	AddThis sets this geolocation cookie to help understand the location of users who share the information.
mc	1 year 1 month	Quantserve sets the mc cookie to anonymously track user behaviour on the website.
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
__gpi	1 year 24 days	No description
_heatmaps_g2g_101137905	10 minutes	No description
cf_7167_id	20 years	No description
cf_7167_person_last_update	session	No description
GoogleAdServingTest	session	No description
prism_252377639	1 month	No description
querylyvid	3 months	No description
xtc	1 year 1 month	No description

There’s a missing pillar in production operations — and it may be killing innovation

Article Tags

Subscribe to SDTimes

About Ashley Stirrup

Related Articles

Who owns the Industrial IoT strategy: IT or OT?

Infrastructure and Operations predictions for 2024

PagerDuty announces Status Pages to reduce support team burden

2023 predictions for Infrastructure and Operations