Zebrium, the leader in the use of machine learning to automatically find the root cause of software problems, today announced that it has launched Zebrium Root Cause as a Service (RCaaS), a new solution that adds the capability for monitoring and observability tools such as Datadog, New Relic, Elastic, Dynatrace, Grafana, AppDynamics, ScienceLogic, and others to automatically find the root cause of software and infrastructure incidents.
When an incident with production software occurs, Zebrium RCaaS automatically finds the root cause and presents a summary of the problem directly on existing monitoring dashboards, alongside other charts showing metrics, traces and APM data. This allows Site Reliability Engineers (SREs), DevOps personnel and developers to reduce the Mean-Time-to-Resolve (MTTR) software or infrastructure problems by 90 percent.
Today, when technical teams encounter a new service outage or problem, they typically rely on observability tools to facilitate the troubleshooting process. Without Zebrium, this involves looking at metrics to determine “when” the problem started, drilling-down on traces or APM data to narrow down the source of the problem (the “where”), and finally combing through large volumes of logs from the application and infrastructure stack to determine the root cause (the “why”). This process can take many hours and requires extensive team resources while critical services remain impacted. Now with Zebrium RCaaS, the painstaking process of digging through logs is automated. The end-result is that RCaaS quickly uncovers the root cause indicators that technical teams would have eventually found by manually combing through logs.
RCaaS has a validated accuracy rate of finding the correct root cause in over 95% of incidents. “The Cisco Technical Assistance Center (TAC) spends thousands of hours each month analyzing software logs to find the root cause of customer incidents,” said Koree Mires, Director, Global TAC Innovation, Automation and Disruption at Cisco Systems. “We had been investigating ways to help automate this process for many years. When we came across Zebrium, we were immediately impressed. In order to validate its effectiveness, we tested RCaaS with four product lines and 192 actual customer incidents. We were astonished to find that RCaaS correctly found the root cause automatically over 95 percent of the time. We are now leveraging the technology to speed-up customer incident resolution and will continue rolling it out to more product lines throughout the year.”
Zebrium RCaaS is designed to make the details of root cause available in the same tools and workflows that SREs, Devops engineers and developers are already using. RCaaS has complete “out-of-the-box” integrations with popular observability and monitoring tools, including Datadog, New Relic, Elastic, Dynatrace, AppDynamics, Grafana, ScienceLogic and others. It also natively integrates with incident management and response platforms including PagerDuty, Opsgenie, Victorops, Slack, Teams and email systems. Additional 3rd party tools can also easily be integrated through a set of open-APIs.
“The cost of downtime keeps rising, and throwing engineers at the problem is not a scalable solution,” said Ajay Singh, CEO, Zebrium. “Since speed and accuracy are essential when software teams need to resolve application incidents, the only way forward is an automated approach to Root Cause Analysis (RCA). Zebrium RCaaS is a proven way to do this. Since our platform does not require any manual training or rules, customers can get started in just a few minutes, and leverage RCaaS almost any kind of observability tool already in place.”