On February 28th, the internet broke. Again. Wired published this piece on it.
I had similar thoughts and wanted to discuss solutions and mitigation for future issues. Apparently, S3 wasn't the only service affected on AWS.
Train Analogy of the Internet
Let me kick this off by explaining the strength of the internet as web. Specifically the strength of having information distributed.
Internet in Series
The train, comprised of multiple cars, is a regular old train. Each connected to one car in the front and one car in the back. A car can be compromised by being disconnected or packed with zombies. A compromised car causes the entire train to be fragmented. You can no longer travel from one end of the train to the other. They are connected in series.
Internet in Parallel
Now, take the same train and have a direct connection from each car to every other car so that if any car gets disconnected or is packed with zombies, we’re still able to get from our car to the end of the train. Sounds doable.
If we have 2 cars we only need 1 connection. But as we add more cars, the rate of direct connections required increases. 3 cars, 3 connections. 4 cars, 6 connections. 5 cars, 10 connections. If you have
n cars we get
n * (n - 1) / 2. This is a polynomial relationship. They are connected in parallel.
Problem: If we want to add one more car to the train we have to add
n more direct connections to support that car in this setup.
How do we optimize this?
Internet as Web
We can reduce the number of connections required so that while it’s not as robust as having every node connected in parallel, there are enough redundancies in place to prevent catastrophic failure. We can get from one car to any other car through various proxies. Distributed risk. This way if we add a new car, it only requires a singular connection.
I know I used a terrible example, but hopefully, you get the picture. This is the concept of the World Wide Web. We can mitigate risk and keep things feasible by distributing connections, allowing every node to communicate to every other node despite outages, whether directly or through proxies.
Fast forward to two days ago and the S3 blackout. Yes, Amazon uses multiple data centers distributed globally. They definitely have redundancies in place that I can’t even imagine. But what did it do to prevent a whole host of services and products going offline during the S3 outage?
Having a large portion of the internet hosted with AWS and relying on their services creates a massive dependency and risk on Amazon's uptime.
When I started seeing broken images across a few sites I regularly browse, I was curious. Sending a direct request for the images, I saw Cloudfront in the URL. I was sitting across from my Co-Founder and said, “Cloudfront is down.” He replied “S3 is down.” Then we noticed Quora was timing out. Coincidentally, Stack Exchange was under attack the night before. I instantly thought of IoT DDoS. But as it turns out they weren’t related and the issue was a “flaw” with S3.
These are the chokepoints of the internet, whether they are based on resolving domains or serving static files. Correct, no one can provide perfect uptime and most hosting providers claim 99.9% uptime. We’ve even had our blog go down for a couple hours in the one month we’ve been running it due to a fried server.
The problem becomes larger when an outage affects services that other websites and services rely on.
Every single company that relies on Github
Every single company that relies on Trello
The blast radius is now compounded. What was originally supposed to be a connected graph becomes even more disconnected as a consequence of service dependencies.
Imagine how disastrous it would be as a marketing company serving a client's ad and using their budget, only to have broken images displayed on the publisher end.
So how do we avoid what happened a few nights ago?
First of all, tell your large tech teams to have a plan around this. Disaster recovery and failovers.
Second, here's some steps to get you started:
- Set up replication across different geographic regions for failover.
- Get your own health checks for all your providers.
- Consider using a CDN that has separate servers from your static file host. CDN goes down? Reroute DNS to hit your file store directly. File host goes down? Cache for longer on your CDN if you can catch it before your TTL expires. We use CloudFlare.
- Use multiple A-records in your DNS as a lazy way for resolving to your mirrored hosts.
- Look at multi-provider replication. AWS + Google Cloud.
- Hire a proper DevOps team.
- Migration to big hosting platforms can't be stopped at this point. Google Cloud, Azure, AWS. Take your pick.
- People will point to this and bash on the cloud, but you should still have resiliency for on-premise hosting.
- Maybe the sites that went down had bad budgets to deal with and it didn't make sense to invest in a solution.
For those companies that didn't even flinch during the downtime: What plans do you have in place?
I'll let Big Sean play us out:
Last night took an L but tonight I bounce back.
If you a real winner you know how to bounce back.