Last week, Cloudflare suffered a major outage, bringing down a significant portion of the Internet. Users of websites that have implemented Cloudflare’s services were greeted with a 502 error for about 30 minutes last Tuesday morning.
“We recognize that an incident like this is very painful for our customers. Our testing processes were insufficient in this case and we are reviewing and making changes to our testing and deployment process to avoid incidents like this in the future,” Cloudflare wrote in a post mortem on its website following the incident.
IT disaster recovery planning can no longer be ignored
Robert Reeves, co-founder and CTO of Datical, explained that “bad software deploy” generally implies that there was a manual deployment that wasn’t tested properly. “In 2019, with automation software available for all deployment tasks, downtime caused by human error is simply no longer acceptable. It’s time for this to stop,” said Reeves.
Reeves explained that no matter how detailed a company’s testing process it, there will still be blind spots. However, he believes that as organizations become more dependent on third-party providers, they should also demand more from those providers.
Website owners working with third-party providers need to determine what they’re okay with in terms of downtime, said Monique Becenti, product and channel specialist at SiteLock. “If I had a website then I would want to ensure that my website was up 95% of the time.” Becenti also believes that this number will vary based on the website. For example, a small business might get a lot of traffic through their website, so having that website up is important, whereas a personal blogger not monetizing their site may not have uptime requirements that are as strict, she explained.
Organizations should have a plan in place for what to do in these types of scenarios. In this situation, removing Cloudflare’s services from their site might allow them to bring their site back up, but it could also open them up to vulnerabilities, Becenti explained. She recommends that organizations consider having another security option in place that they can switch over to if their main security method is down.