
From Reddit and Snapchat to Delta and Zoom, more than a thousand websites around the globe were knocked offline early Monday—and continued to have problems into the afternoon—after a major outage at Amazon Web Services, the single largest cloud service provider in the world.
A Northeastern University cybersecurity and cloud expert says the outage is emblematic of just how fragile the internet has become in the decades since its founding—particularly in regards to data infrastructure and resilience.
“There’s been a massive amount of centralization in terms of our dependence on a small number of cloud providers,” says David Choffnes, a professor in the Khoury College of Computer Sciences and director of Northeastern University’s Cybersecurity and Privacy Institute.
“When they go down, so much of what we depend on goes down,” he adds.
Amazon Web Services—which makes up 30% of the cloud computing market—first reported latency and error rates issue Monday morning at multiple AWS data centers in Northern Virginia.
Hours later, Amazon Web Services pinpointed the problem, attributing it to a Domain Name System, or DNS, issue impacting one of its biggest databases, DynamoDB.
The Domain Name System was developed to transform human words into characters that computers can understand, Choffnes explains.
“DNS is the thing that translates human readable names like Amazon.com into names that computers use to address each other on the internet,” he says.
That’s where Internet Protocol addresses, or IP addresses come from, he says.
“Many people have probably heard of IP addresses. Your computer gets one when it’s connected to the internet,” he says. “Everything connected to the internet has an IP address.”
A DNS system is essentially a big spreadsheet of names and corresponding IP addresses, says Choffnes.
In the case of the DynamoDB database, the entry for that specific service likely disappeared for a while, which in turn caused the outage issue, he says.
“So computers would ask, ‘How do I get to the DynamoDB database?’ The internet would come back and say, ‘I have no response for you. I don’t know what the IP address is.'”
Fixing that kind of issue can be a “huge” mess, says Choffnes, given the number of different internet services that relied on that database.
Downdetector, an online service that tracks internet outages, received more than 6.5 million reports Monday.
Amazon Web Services initially sent out a statement Monday morning after addressing the DNS issue, noting that its systems were recovering.
Yet it quickly had to walk that back a few hours later after reporting that users continue to have connectivity issues. In particular, users were having issues accessing Amazon Elastic Cloud Computing Cloud (EC2) network, which allowed companies to run their own virtual servers.
Choffnes isn’t surprised.
“In my experience with large systems like this, it’s a domino effect,” he says. “You can fix the problem. You can put the mapping back into DNS and now computers when they look up DynamoDB, they can figure out where on the internet to communicate with. … But a lot of those systems might have crashed. They might have gone into some error state. And they might need to get rebooted.”
Having so many systems taking advantage of the network at once, could certainly overwhelm and overload Amazon’s servers and other services, he says.
Monday afternoon, Amazon released another statement saying its servers were again recovering, but it would likely be a while before services are back to normal.
But when exactly will that be?
That’s the question, says Choffnes, but it’s very hard to predict.
“It’s sort of like Whack a Mole. You fix a problem and you think, ‘Good, we’re done.’ And then all of the sudden that fix means something else starts happening,” he says.
“I’m sure there are people at Amazon who have been awake since the early hours of the morning trying to fix one problem, running into the next one, and working to address what they can.”
Written by Cesareo Contreras, Northeastern University.