Amazon on Friday published an explanation of the Web Services outage that knocked parts of the internet offline for several hours on December 7th — and promised more clarity if this happens in the future. As CNBC reports, Amazon revealed an automated capacity scaling feature led to “unexpected behavior” from internal network clients. Devices connecting that internal network to AWS were swamped, stalling communications.
“An automated activity to scale capacity of one of the AWS services hosted in the main AWS network triggered an unexpected behavior from a large number of clients inside the internal network,” the company wrote in a post on its website. As a result, devices connecting an internal Amazon network and AWS’ network became overloaded.
The nature of the failure prevented teams from pinpointing and fixing the problem, Amazon added. They had to use logs to find out what happened, and internal tools were also affected. The rescuers were “extremely deliberate” in restoring service to avoid breaking still-functional workloads, and had to contend with a “latent issue” that prevented networking clients from backing off and giving systems a chance to recover.
The AWS division has temporarily disabled the scaling that led to the problem, and won’t switch it back on until there are solutions in place. A fix for the latent glitch is coming within two weeks, Amazon said. There’s also an extra network configuration to shield devices in the event of a repeat failure.
Downtime can hurt the perception that cloud infrastructure is reliable and ready to handle migrations of applications from physical data centers. It can also have major implications on businesses. AWS has millions of customers and is the leading provider in the market.
Amazon’s own retail operations were brought to a standstill in some pockets of the U.S. Internal apps used by Amazon’s warehouse and delivery workforce rely on AWS, so for most of Tuesday employees were unable to scan packages or access delivery routes. Third-party sellers also couldn’t access a site used to manage customer orders.
During the outage, AWS tried to keep customers aware of what was happening, but the cloud ran into trouble updating its status page, known as the Service Health Dashboard.
“As the impact to services during this event all stemmed from a single root cause, we opted to provide updates via a global banner on the Service Health Dashboard, which we have since learned makes it difficult for some customers to find information about this issue,” AWS said.
In addition, customers couldn’t create support cases for seven hours during the disruption. AWS said it’s now taking action to address both of those issues.
“We expect to release a new version of our Service Health Dashboard early next year that will make it easier to understand service impact and a new support system architecture that actively runs across multiple AWS regions to ensure we do not have delays in communicating with customers,” AWS said.
It’s not the first time for AWS to change the way it reports issues.
In 2017, an outage that hit the popular AWS S3 storage service prevented engineers from showing the right color to indicate uptime on the Service Health Dashboard. Amazon posted banners and went to Twitter to release new information.
“We have changed the SHD administration console to run across multiple AWS regions,” Amazon said in a message about that episode.
You might have an easier time understanding crises the next time around. A new version of AWS’ service status dashboard is due in early 2022 to provide a clearer view of any outages, and a multi-region support system will help Amazon get in touch with customers that much sooner.
These won’t bring AWS back any faster during an incident, but they may eliminate some of the mystery when services go dark — important when victims include everything from Disney+ to Roomba vacuums, Amazon’s Ring security cameras, and other internet-connected devices like smart cat litter boxes and app-connected ceiling fans were also taken down by the outage.