In 2014, Netflix developed a platform that would help them get better operational insights in their increasingly complex environment. Today, Mantis helps Netflix engineers better understand the behavior of their applications, which in turn helps them provide a better user experience.
Mantis is a platform on which real time stream processing applications can be built. Mantis applications (jobs) are deployed on the Mantis platform, which then provides the APIs to manage the life cycle of those jobs. The platform also managed the underlying resources needed by containerizing a common pool of service and allowing jobs to discover and communicate with other jobs.
The company recognizes that the challenges they faced that led them to create Mantis are experienced throughout the industry, which is why they are now open sourcing Mantis. “Today we’re excited to announce that we’re open sourcing Mantis, a platform that helps Netflix engineers better understand the behavior of their applications to ensure the highest quality experience for our members. We believe the challenges we face here at Netflix are not necessarily unique to Netflix which is why we’re sharing it with the broader community,” Jeff Chao, senior software engineer at Netflix, wrote in a post on behalf of the Mantis team.
According to Netflix, they created Mantis because the traditional way of working with metrics and logs wasn’t sufficient for their growing system. This is because metrics and logs require you to know what you want to answer ahead of time, Chao explained. But Mantis allows this to be sidestep this by allowing engineers to answer new questions without having to add new metrics.
When creating Mantis, Netflix had four guiding principles in mind:
- “We should have access to raw events
- We should be able to access these events in realtime
- We should be able to ask new questions of this data without having to add new instrumentation to your application.
- We should be able to do all of the above in a cost-effective way.”
At Netflix, engineers have built several applications on top of Mantis following those principles. These applications serve a variety of use cases, such as identifying issues, triggering alerts, and applying remediations.
“Where other systems may take over ten minutes to process metrics accurately, Mantis reduces that from tens of minutes down to seconds, effectively reducing our Mean-Time-To-Detect. This is crucial because any amount of downtime is brutal and comes with an incredibly high impact to our members — every second counts during an outage. As the company continues to grow our member base, and as those members use the Netflix service even more, having cost-efficient, rapid, and precise insights into the operational health of our systems is only growing in importance,” said Chao.