For 45 minutes in the UK morning of June 8, a significant chunk of the world wide web did not work. People who tried to visit a huge array of websites, including CNN, Reddit, Amazon, Hulu etc., received a blank white page and an error message telling them the connection was unavailable.
A report from The Guardian confirmed the errors were focused on large websites with substantial traffic, but weren’t universal: users in some places, such as Berlin, Germany, reported no problems throughout the outage.
It happened there was a problem with the “edge cloud” provider Fastly that caused the outage. “Within a few minutes, the company admitted on a status page that it was experiencing problems. With the exception of a few providers, including the BBC, which had backup systems in place, every affected website had to wait for Fastly to fix the error before they could restore service”, The Guardian’s report explained.
About Fastly
Fastly, who is a major middleman in Internet Traffic, offers a content delivery network service, or CDN. When it works, a CDN is supposed to improve the speed and reliability of the internet. Rather than visitors to a website all having to connect to servers run by that company – which might not even be in the same country they are – they instead contact Fastly, which runs huge server farms all around the world that host copies of their clients’ websites.
That means that the page loads faster for the user, because the physical signals don’t have to travel as far. It also improves the reliability of the website, by ensuring that if there’s a big spike in traffic, it first hits Fastly’s servers, which are designed to handle a lot of traffic.
What happened on Tuesday?
According to The Verge’s report, Fastly abruptly took a nap for around an hour early Tuesday morning, and effectively knocked a huge number of major websites offline in the process. The company is now claiming the issue stemmed from a bug and one customer’s configuration change.
“We experienced a global outage due to an undiscovered software bug that surfaced on June 8 when it was triggered by a valid customer configuration change,” Nick Rockwell, the company’s SVP of engineering and infrastructure wrote in a blog post on Tuesday night. “This outage was broad and severe, and we’re truly sorry for the impact to our customers and everyone who relies on them.”
Apparently, whatever the request was, it triggered a bug that had only been introduced to Fastly’s systems by an update in mid-May. Clearly the fault lies with Fastly for not catching that it broke its own code, but we have to ask: Who did it? Valid change or no, some specific, unnamed customer triggered the bug.
It’s obvious why Fastly has not named the customer or the exact set of circumstances that created this undesirable outcome, but we just want to know: What does it feel like to take down half the internet by accident?