
Early Monday morning, an issue in AWS caused widespread outages across the internet for about three hours, starting around 3 AM ET.
In an update to its status page, AWS said that the issue was related to DNS resolution in an API for its database, DynamoDB. Throughout the three hours while the issue was being resolved, 142 different AWS services in the US-EAST-1 Region were affected.
“Networking is certainly a foundational component of AWS services. When it stumbles in a region like US-East-1, the effects go way beyond, it ripples through EC2, S3, DynamoDB, RDS, and pretty much every service that depends on them,” said Corey Beck, director of cloud technologies at DataStrike, a company that offers managed services for DBA, cloud, and BI.
According to Jamil Ahmed, director and distinguished engineer at event-driven integration company Solace, this event serves as another reminder of the risks associated with single cloud deployments.
“Even the largest cloud hosts – Google, AWS (as we have just seen), Microsoft – suffer from outages. Having all your digital eggs in one cloud basket leaves businesses at risk of serious failure as we keep on seeing,” he said.
He explained that organizations should build fault tolerance into their infrastructure so that they can remain operational during major outages like this.
DataStrike’s Beck added that resilient systems aren’t about avoiding failure, but rather ensuring that customers don’t notice when an issue occurs.
“Real resilience takes planning, multi-region design, regular testing, and a mindset that assumes things will break. That’s what separates a minor hiccup from a full-blown outage,” he said.
Ahmed said this is one of the benefits of multi-cloud deployments, as one server can fail and the business can have everything switch to another so that there is minimal disruption.
“When a business is multi-cloud, the end user should never even be able to detect that a failure has occurred within the business. Any service downtime is avoided, as the failing is shielded by another cloud host,” he said.
Forrester analysts also wrote a blog post following the incident, in which they called out the overreliance on DNS, which was not designed for the modern demands of the cloud, as well as the issue of AWS having so many internal dependencies. “DynamoDB, the first service identified as impacted by the DNS issues, plays a central role in other AWS services for analytics, machine learning, search, and more,” they said.
The analysts offered up some actionable recommendations to help organizations improve their cloud resilience:
- Invest in infrastructure observability and analytics to get early visibility into outages
- Build an infrastructure automation platform that connects observability data to automated systems so that problems can be fixed when they are still small and manageable
- Use CDNs to cache static content at edge locations
- Develop application portability and additional clouds for key workloads
- Test infrastructure and application resilience regularly
- Map third-party critical dependencies
- Reevaluate third-party risk strategies
- Update vendor contracts to assign accountability during outages and establish time frames for vendors to patch and remediate systems
- Test vendor resilience plans
“Despite past outages, organizations that failed to address that complexity got a front-row seat as cascading issues disrupted systems, processes, and operations. The entrenchment of cloud, especially AWS, in modern enterprises, coupled with an interwoven ecosystem of SaaS services, outsourced software development, and virtually no visibility into dependencies, is not a bug — it’s a feature of a highly concentrated risk where even small service outages can ripple through the global economy,” the Forrester analysts wrote.