AWS: The Cloud Is Falling
Aug 26, 2013 2:20 PM PT
Amazon Web Services were impacted in some areas on Sunday, six days after Web stores on the service in the United States and Canada went down for about 30 minutes, causing losses estimated at up to US$45 million.
The outage lasted about an hour and impacted Facebook's Instagram app, Twitter's Vine app and Airbnb, among others.
Amazon blamed the problem on a networking device that caused packet losses, which it replaced.
The Amazon Web Services service health dashboard indicated that Amazon's Elastic Compute Cloud and its relational database service in Northern Virginia were affected.
"I believe their shared systems are structural points of exposure that take customers all down together," Brian Hierholzer, CEO of Artisan Infrastructure, told the E-Commerce Times.
Networking and system design can be choke points in large distributed systems such as Amazon's, remarked Joe Clabby, president of Clabby Analytics.
What Happened on Sunday
Amazon's service health dashboard indicated that there was a problem with storage services offered in its Elastic Block Store in Northern Virginia. Some EBS volumes suffered degraded performance, and elevated EBS-related API and EBS-backed instance launch errors in a single AZ in the US-EAST-1 region, according to Amazon. Those problems were due to network packet loss.
Amazon also experienced connectivity issues for its Elastic Load Balancing service and its Relational Database Services, also in Northern Virginia.
All the problems were eventually fixed. Amazon did not respond to our request for further details.
Stumbling Toward Growth
Amazon Web Services is the largest cloud computing vendor in the business; Gartner estimates it has five times the combined capacity of the next 14 of the top 15 providers.
That makes it difficult to understand why the service continues to suffer from outages and interruptions.
In 2012 alone, it had four major outages at its Northern Virginia data center, which is the company's most heavily trafficked complex.
Amazon reportedly blamed the latest outage on a developer who had accidentally deleted some key data, and stated the disruption affected its ELB service.
Load balancing is key for services such as Amazon's -- it distributes workloads among servers equally to prevent their being swamped.
Still, "previous outages haven't seemed to slow the uptake of AWS, so perhaps Amazon's customers believe that the value they receive somehow balances out poor reliability and quality of service," Charles King, principal analyst at Pund-IT, told the E-Commerce Times.
The Blame Game
Some suspicion has fallen on the NSA's surveillance, suggesting its programs are causing these multiple service interruptions and crashes, but that is "most likely a very low probability," Artisan Infrastructure's Hierholzer said. Other possibilities are unexpectedly large numbers of people trying to access the services and distributed denial of service, or DDoS, attacks.
"There should be more failure domains and ways to isolate these service providers and the resources they consume," Hierholzer suggested.
It could be that Amazon is growing on the cheap. The company reportedly spent about $12 billion on real and virtual revenue-generating IT assets since 2005, while Microsoft spent close to $18 billion and Google forked over nearly $21 billion.
"If you're doing cookie-cutter large volumes, sometimes you don't build in enough redundancy because your attitude is that if one server drops out, you can slot in another and don't have to worry about failure," Clabby Analytics' Clabby told the E-Commerce Times.
On the other hand, the problem could lie mainly with the networks.
"There are so many factors in these complex networks, including the servers, that nothing will ever be perfect," Jim McGregor, principal analyst at Tirias Research, told the E-Commerce Times. "As a result, you have to build in more redundancy, just as the communications vendors have always been required to do."