To say the cloud is important to how we work, live, and play is a major understatement – the cloud is now a critical element of tech infrastructure. However, what underlies these outages is often a mystery. That’s why I was intrigued by a recent webcast from ThousandEyes, which looked under the covers at the major cloud outages of 2023.
Hosted by Brian Tobia, ThousandEyes’ Lead Technical Marketing Engineer, the webcast included a look at the anatomy of an outage. “It’s important to understand the different types of outages we see,” he said. “Understanding them can help you understand how to mitigate some of their impact.” He said outages could vary in the blast radius, whether they’re planned or unplanned, and their mean time to recovery.
Let’s take a look at what caused the year’s major cloud outages – and what we can learn from these unfortunate incidents.
TABLE OF CONTENTS
Different Types of Cloud Outages
“The distributed architecture of today’s applications means there are a lot of different moving parts that need to be orchestrated for something to actually work,” Tobia said. “And a lot of these parts are often single points of failure. Because they’re reused in multiple applications, like an API or a common service, we can see the impact of an outage more widely felt, despite it being a single service.”
Tobia noted that tracking cloud computing outages can help teams identify patterns and prevent customer service disruptions.
Looking at the ThousandEyes report from 2023, Tobia said there were many different types of outages. “Overall, we still saw the most common type being ISP-related outages,” he said. “But we saw that there was an increase in CSP outages in 2023 compared to the previous years.”
In 2023, the number of US-centric outages increased from 34% to 37% and minor outages became more common. “We’re seeing that these smaller, more contained outages are becoming more common,” he said. “Before, there were traditionally a lot of bigger network outages—like really big ones—that would take down a whole bunch of services. But now we’re also seeing smaller ones.”
But even a cloud outage that starts in the US due to maintenance activity at night can cascade into other geographies in the middle of their business days.
Application Outages On the Rise
Tobia said that application outages, which continued to rise in 2023, can have a greater impact. A network outage will affect a single provider, but not so for applications. “The application outage really cascades because a bunch of people are relying on that one application,” he said. “It doesn’t matter what network you’re coming from.”
He then moved on to look at some outage examples from 2023, focusing on how ThousandEyes works. “We’re able to collect all this data through ThousandEyes,” he said. “Being able to correlate that and collect all this data, it’s really important to get the end-to-end picture of where an outage might occur. And then, also really important, correlate that across every layer.”
He added that ThousandEyes can show users every layer of a connection, whether it’s related to border gateway protocol (BGP), networks, applications, HTTP errors, or page load times.
Top Cloud Outages of 2023
Tobia detailed the list of 2023 outages, including:
- A 90-minute outage for Microsoft on January 25: This was due to BGP changes that caused network issues. “This was total chaos from a BGP perspective,” Tobia said.
- A two-hour outage for Outlook on February 7: This resulted in service unavailable/application errors. “The last outage was more around some changes on their ISP routers and other WAN routers,” he said. “This may have been more on the application side.”
- A seven-hour Virgin Media outage on April 4: This outage arose because of a BGP route withdrawal that caused network traffic loss. “It was kind of similar to what we saw on the on the Microsoft side, when those BGP changes were occurring,” he said. “Without a route to the Virgin Media UK network, a lot of the Internet and transit providers dropped the traffic.”
- A two-hour AWS cloud outage on June 14: This outage caused latency, server timeouts, and HTTP errors. “They eventually identified the issue as being part of their capacity management system located within US-EAST-1,” he said. “And this impacted services like Lambda API gateway, and the actual management console itself, Global Accelerator.”
- A two-hour Slack cloud outage on August 2: As a result of this outage, users couldn’t send or receive messages. “Network paths were totally fine,” he said. “We didn’t see any packet loss, latency, or anything like that. So it was purely an application or client issue.”
- An 18-hour Square cloud outage on September 8: This outage resulted in app errors and backend transactions failing. “This outage prevented it from processing transactions,” he said. “So end users were actually able to submit a transaction – some sellers who were using this to receive payments were successful, [but some users] reported connections dropping out or things not working.”
- A 36-hour Workday/Cloudflare outage that started on November 2: A complete power failure at a Cloudflare data center caused application and service issues. “Cloudflare was the provider and Workday was an application that runs on Cloudflare infrastructure,” he said. He noted that DR resources took 6 hours to come online. “So there was a complete outage until that facility came online,” he added. “It was able to serve requests at a diminished rate and then full resolution didn’t happen until more than 36 hours later.”
Bottom Line: Need for Monitoring to Prevent Cloud Outages
It was a busy year! Clearly, Tobia’s examples of cloud outages were sobering. Who thought a full power outage was possible today with the sophisticated data centers that providers like Cloudflare run? But they still had to deal with the arduous process of getting their DR up and running and working with a power company that may not move at the fastest pace.
Tobia’s presentation also underscored the importance of monitoring resources to understand what happens when a service goes down so that one can learn and avoid repeating the same mistakes.
Unfortunately, for most businesses, having a backup for every cloud service the organization uses would be fiscally challenging. For support, IT leaders can use data from companies like ThousandEyes to make uptime part of the evaluation criteria.
For a complete guide to the cloud computing sector, see our in-depth coverage: Top Cloud Service Providers and Companies