AWS outage traced to a bug in automation software affecting thousands of sites

AWS outage traced to a bug in automation software affecting thousands of sites

AWS outage traced to bug in automation software affecting thousands of sites

Story Highlight

– AWS outage caused by bug in automation software.
– Empty DNS record led to widespread service disruption.
– Over 2,000 companies, including Signal and Roblox affected.
– Manual intervention required to fix the DNS issue.
– Expert warns against dependency on major cloud providers.

Full Story

Amazon has identified the source of a recent significant outage affecting its AWS services, which caused widespread disruption to countless online platforms, including Signal and various smart home devices. The issue stemmed from a flaw in the automation software that manages domain name system (DNS) operations, leading to cascading failures across thousands of sites and applications hosted by AWS.

On Thursday, AWS detailed the incident, explaining that customers experienced difficulties connecting to DynamoDB, its database system. This occurred due to “a latent defect within the service’s automated DNS management system.” DynamoDB is responsible for maintaining a vast number of DNS records, utilizing automation for efficient monitoring to ensure regular updates, capacity adjustments, hardware failure management, and effective traffic distribution.

The outage was ultimately traced back to a missing DNS record for the US-East-1 datacenter located in Virginia. The automation intended to rectify such issues was unable to respond, necessitating manual intervention to resolve the situation. In response, AWS has temporarily disabled its global DynamoDB DNS automation while it addresses the underlying problems and implements additional safeguards.

Other AWS services also faced interruptions due to the primary issue. According to Downdetector, which tracks online outages, more than 8.1 million users reported problems across a network of roughly 2,000 affected businesses—including popular platforms like Snapchat, Roblox, and Duolingo, along with banking sites and Ring doorbell services.

While services were restored within a few hours, the outage had a noticeable impact. Customers of Eight Sleep, a company that produces internet-connected smart beds, found themselves unable to manage features like bed temperature or incline through its app. Matteo Franceschetti, the company’s CEO, issued an apology on X and announced a new update enabling users to control essential functions via Bluetooth during connectivity issues.

Dr. Suelette Dreyfus, a lecturer in computing and information systems at the University of Melbourne, commented on the broader implications of such outages, highlighting global dependency on a limited number of service providers. “That single point isn’t just AWS—they are the largest cloud provider with around 30% of the market—but rather the cloud ecosystem itself, largely dominated by three companies,” she stated.

Dr. Dreyfus pointed out that while the internet was originally designed to be resilient, reliance on a few major tech firms has diminished its redundancy, potentially increasing vulnerability to widespread disruptions.