Gremlin, a pioneer in Chaos Engineering, has announced the launch of Reliability Intelligence — an AI-driven solution for analyzing and remediating reliability concerns in modern, complex systems. The solution uses a combination of automated fault injection experiments, continuous resilience analysis, and a Model Context Protocol (MCP) server for LLM integration to ensure the system is up and performing well.
“The Gremlin team has been managing complex online systems for decades – we know that you can’t just throw LLMs at the hard engineering problems involved with building and maintaining business-critical systems,” Kolton Andrus, CEO of Gremlin., said in the launch announcement. “Reliability Intelligence provides actionable recommendations based on a deep understanding of your systems architecture and its dependencies across various cloud providers and 3rd party services.”
According to the latest DORA report, code is being deployed to production 70% faster using AI coding assistants. This leaves digital businesses increasingly susceptible to reliability issues and major outages from errors, bugs, and other inefficiencies. Chaos Engineering and other forms of proactive reliability efforts can help address these issues and place healthy guardrails on the AI boom — but many of these practices require deep expertise that only a small number of SREs possess.
To help bridge the gap, Gremlin has been laser-focused on building out a platform that helps businesses maximize the value of their proactive reliability efforts. Recent product developments include Reliability Scoring, Intelligent Health Checks, Dependency Discovery, and Executive Reporting. With today’s launch of Reliability Intelligence, Gremlin is removing the high barrier of entry to maintain the reliability of complex systems in a fast-changing AI landscape.
Experiment Analysis: While automated testing has been part of Gremlin for years, the analysis of results and comparison to expected behavior was left to engineers to perform manually. Experiment Analysis compares test results against expected behavior based on past performance, detects anomalous behavior during the test, and uncovers why a test fails.
Recommended Remediation: By leveraging industry best practices and system behavior from millions of tests, Gremlin provides engineers with specific recommended actions after a failed test. These actions guide the user in resolving issues, which can include anything from adjusting code to fine-tuning observability alerts.
MCP Server: Explore your data with Gremlin’s MCP server integration. Connect your favorite LLM to query data, uncover insights, and create custom dashboards.
“In high-velocity environments reliability can’t be an afterthought,” said Arul Martin, Director of Performance Engineering at Sephora. “Reliability Intelligence equips SRE and performance teams with deep, real-time insights from telemetry and trace data — enabling early detection of reliability regressions, faster root cause isolation, and proactive remediation without disrupting release velocity.”