A Typo Took Down The Internet
SAN FRANCISCO — The major outage that hit tens of thousands of websites using Amazon's AWS cloud computing service on Tuesday has a resolve. Who knew that a simple typo of one incorrectly entered command could take down Amazon Cloud Services for 4 hours.
The four-hour outage at Amazon Web Services' S3 system, a giant provider of backend services for close to 150,000 websites, caused disruptions, slowdowns and failure-to-load errors across the United States.
Massive Amazon cloud service outage disrupts sites. Amazon's Simple Storage Service (S3) lets companies use the cloud to store files, photos, video and other information they serve up on their website. It contains literally trillions of these items, known as "objects" to programmers. During this outage, noone was able to access websites, photos, logos, lists, data and various other systems. Many also had broken links and were only partially functional.
Today, Amazon published a public letter saying what happened:
"On Tuesday morning, an Amazon team was investigating a problem that was slowing down the S3billing system.
At 9:37 am Pacific time, one of the team members executed a command that was meant to take a few of the S3 servers offline.
"Unfortunately," Amazon said in its posting, one part of that command was entered incorrectly — i.e. it had a typo.
That mistake caused a larger number of servers to be taken offline than they'd wanted. Two of those servers ran some important systems for the whole East Coast region, such as the ones that let all those trillions of files be placed into customers' websites.
To get it back, both systems required a full restart, which takes a lot longer than simply rebooting your laptop.
All of this wasn't just affecting Amazon's S3 customers, it was also hitting other Amazon cloud customers as well — because it turns out those systems use S3, too.
While Amazon says it designed its system to work even if big parts failed, it also acknowledged that it hadn't actually done a full restart on the main subsystems that went offline "for many years."
During that time, the S3 system had gotten a whole lot bigger, so restarting it, and doing all the safety checks to make sure its files hadn't gotten corrupted in the process, took much longer than expected.
It wasn't until 1:54 pm Pacific time, four hours and 17 minutes after the mistyped command was first entered, that the entire system was back up and running.
To make sure the problem doesn't happen again, Amazon has rewritten its software tools so its engineers can't make the same mistake, and it's doing safety checks elsewhere in the system.
Amazon apologized to its customers for the event, saying it "will do everything we can to learn from this event and use it to improve our availability even further."
Thanks for checking in on iComEx to find out the latest news that may be of interest to you as a business owner. We appreciate your business, and remember if you have questions please feel free to contact any of our staff for any additional questions you may have.
To learn how iComEx can help your your business grow, please call 972-712-2100, or ask us to provide a quote. We proudly serve Dallas, Frisco, Plano, McKinney, Allen, Sherman, and Denison for all your web needs.