Ohmygod it was DNS
I can’t tell if that site is dynamic or not. I may never be able to tell.
Enterprise tribal knowledge is vital. It’s intangible and not on anybody’s scorecard, but I’m certain this outage was related to layoffs or quiet firings (RTO).
Much of my experience before I left and from what I’ve heard isn’t that Amazon as a whole is losing engineers. They’re losing them from critical services.
Some of those “foundational services” can have some teams that have really brutal on call and ops load. They get to a point where it’s so bad you can’t fix it because you’re constantly in a cycle of spending so much time keeping the lights on that you can’t automate anything (and for those teams leadership will never hear of decreasing work load). And you constantly have a churn of people so you need to spend time training the new guys on top of keeping the lights on.
Then the team fails and directors fold the responsibilities onto a new team and the cycle starts again.
And that’s where the brain drain hits the hardest.
It also doesn’t help that there’s a massive amount of red tape you need to clear to build anything. It’s easier in AWS than retail but it’s still an awful amount compared to other places.
At AWS’s scale, all of their issues are complex; this isn’t going to be a simple issue that someone should have caught, just because they’ve already hit similar issues years ago and ironed out the kinks in their resilience story.
Imagine if it was a coffee spill or a bad git push to production servers.
This is The Register, a respected journalistic outlet. As a result, I know that if I publish this piece as it stands now, an AWS PR flak will appear as if by magic, waving their hands, insisting that “there is no talent exodus at AWS,” a la Baghdad Bob.
🤣 I love reading The Register for these
Here’s to hoping the next big “oh shit, everything is down” happens this year. Maybe another CrowdStrike in december - if it’s not big enough, it won’t beat the inertia of “well, I’m sure it was an isolated incident”
It’s always either DNS or IAM in us-east-1. You would think they would find a way to avoid those bottlenecks.
DNS over HTTP will solve all our problems!
/s