As site reliability becomes more important as software releases grow in frequency and complexity, a startup called Blameless today released an SRE platform that can handle the increasing velocity of code deployments while offering faster, more efficient incident resolution.
Ashar Rizqi, CEO of Blameless, said the company’s vision is to enable any modern software business to adopt SRE best practices. He explained that the name Blameless means a culture of no finger-pointing. Rizqi, who ran the SRE team at Box.com and then the container team at Mulesoft, said a blameless culture shifts the problems that users encounter to the system, and away from the people creating, delivering and maintaining software in production.
Today’s software user experience appears seamless, but users don’t see — or care about — the internal and external APIs and services that can lead to reliability issues. “The user experience is critical, so you want to push the envelope, but reliability can’t be compromised,” Rizqi told ITOps Times. So, the question becomes, how do you innovate while still ensuring reliability?
The SRE platform enables organizations to improve their incident resolution. The platform pulls data through integrations with application and network monitoring tools, and uses communication software such as Slack to assemble the proper team and use that data to resolve the issue. Then, through the use of post-mortems, teams can figure out how to prevent those kinds of issues from happening again, as the platform automatically delivers relevant data and SRE best practices.
The Blameless platform provides an SLO dashboard for viewing an error budget, which is a quantified amount of acceptable downtime or lower performance levels. “For continuous delivery, for example, the number of releases is the metric related to success,” Rizqi said. “In SRE, it’s the error budget. SLOs drive the internal bar for what you can achieve. With the platform, you can identify the customer pain points and quantify them to create a service level indicator.”
A more overarching Reliability Insights Dashboard offers a broad view of the forces underlying reliability, such as all incidents, post-mortems, action items, log events and change events, the company’s announcement explained. Event data can be queried by specific teams to find the signals they need from the huge stream of DevOps data, it said.
The Blameless SRE platform is available now. The startup is backed by Lightspeed Venture Partners and Accel.