AWS outage post-mortem fingers DNS as the culprit that took out a chunk of the internet and services for days — automation systems race and crash

16 hours ago 14
Network tech entangled in cables
(Image credit: Getty Images)

The recent Amazon Web Services outage that took out a significant portion of the internet, games, and even smart home devices for days, was extensively covered in the news. Cloud services' distributed architecture should protect customers from failures like this one, so what went wrong? Amazon published a detailed technical post-mortem of the failure, and as the famous haiku poem goes: "It's not DNS. / There's no way it's DNS. / It was DNS."

As a rough analogy, consider what happens when there's a car crash. There's a traffic jam that stretches for miles, in an accordion-like effect that lasts well after the accident scene has been cleared. The very first problem was fixed relatively quickly, with a three-hour outage from October 19 at 11:48 PM until October 20 at 2:40 AM. However, as with the traffic jam example, dependencies started breaking, and didn't fully come online until much later.

The specific technical issue behind the DNS failure was a programmer's "favorite" bug: a race condition, in which two repeating events keep re-doing or undoing each other's effects — the famous GIF of Bugs Bunny and Daffy Duck with the poster is illustrative.

Get Tom's Hardware's best news and in-depth reviews, straight to your inbox.

Bruno Ferreira is a contributing writer for Tom's Hardware. He has decades of experience with PC hardware and assorted sundries, alongside a career as a developer. He's obsessed with detail and has a tendency to ramble on the topics he loves. When not doing that, he's usually playing games, or at live music shows and festivals.

Read Entire Article