Though the mechanics behind many cloud outages are eventually revealed, some of the issues might recur because of tradeoffs made by providers for the sake of cost and profitability.
The majority of cloud outages boil down to software updates or configuration changes gone wrong, says Kurt Seifried, chief blockchain officer and director of special projects with the Cloud Security Alliance. He and other experts see the cloud growing increasingly complex with new features rolled out to meet demand and expectations for innovation, yet the drive to release updates can lead to some corners being cut. “Ultimately, that’s a human failure in that they should have tested it out more,” Seifried says, though he acknowledges that when changes are made to a major system, at some point testing must stop and the updates must be deployed.
Knowing the Problem Does Not Always Fix the Problem
He says though major problems that lead to outages are relatively known, the ubiquity and necessity of the cloud for modern commerce mean there is little choice but to go along with the practices of current providers. “Most businesses make the tradeoff because what are customers going to do? Leave? That’s part of the problem,” Seifried says. “The cost of these outages is largely externalized.”
In early July, Rogers Communications suffered an outage that lasted some 19 hours and affected commerce, including banking and other vital services. Rogers, which has some 2.25 million retail internet customers and more than 10 million customers on wireless, initially offered an automatic credit to its customers that was the equivalent of five days service fees. More recently the company announced it would spend $7.74 billion US in the coming three years to bolster testing and leverage AI to avoid future outages.
The incident led to the Canadian government ordering a probe into the matter, with calls for new protocols to keep the public better informed, but market-driven motivations can hamper the resiliency of the cloud.
“Do you want the network fast, or do you want it reliable, or do you want it cheap? You can pick two,” Seifried says. The tendency, he says, is for customers to opt for fast and cheap.
No Cloud Means No Business
The reliance on the cloud continues to escalate, says Workspot CEO Amitabh Sinha, whose customers deploy cloud PCs around different parts of the world and require access to the cloud. “If it’s not available, people don’t do work,” he says.
Outages can result in an average loss in productivity of $150 per hour, per user among Workspot’s clientele with cloud PCs, Sinha says, if the cloud is down.
With the bad driver updates frequently the culprit in cloud outages, rather than natural disasters or cyberattacks, Sinha says providers have gotten more adept at preparing for such issues. “Cloud providers have learned one thing, which is ‘Don’t push an update worldwide on day one,’” he says.
Instead, those updates may be pushed to a region to start. Even with the damage restricted to regions, the severity of the issue can increase if it is a bad fabric update rather than just a driver issue, he says. “If you push a bad update to your fabric, it affects the whole fabric,” Sinha says. “Those are slightly more catastrophic.” Bad fabric can take down every customer in one region, he says, and may take six to 24 hours to roll back fabric updates. “It doesn’t happen very often — once every year at least.”
Outages in certain, densely packed regions though can bring down major services such as Netflix, which tends to draw significant notice from the public, Sinha says. “When that region goes down, it feels like the world has come to a halt.” He still feels the overall cloud network is resilient, though regional failures might appear more frequently. “They’re not global failures,” Sinha says. “The cloud providers have a good model of making sure that failures are detected early and fixed early.”
Regional Outages Can Cause Wide Ripples
That still did not reduce the disruption of the Rogers outage, which Seifried says also revealed the reach of communications providers. “We all learned that Rogers owns Interac, which is our primary payment processing network here for debit cards,” he says. When Rogers went down, it left Interac debit and other services unavailable to the public. That opened up a deeper political discussion, Seifried says, about the provider being forthcoming about its sway and impact on Canada. “It’s pretty clear they made a 3 a.m. maintenance window whoopsie and killed their network for a day,” he says.
Seifried compares that with Cloudflare’s handling of outages, which he says will have reports posted within half an hour to one hour of an incident followed by a day later posting a full cause analysis with the remedies taken to ensure such incidents do not repeat. “A lot of companies are scared to be honest about why they screwed up,” he says.
Phone companies that are cloud providers, Seifried says, may be reluctant to lay out initially what precipitated an outage. “They’re not going to tell you the truth anytime soon without spinning it because they don’t want to get sued,” he says. “We need to get to that more mature space because this is everything now.”
The majority of cloud outages may stem from cloud provider mistakes, but Seifried says there have obviously been malicious actors in some outlier cases. For example, when the Mirai botnet attack struck in 2016, launching distributed denial-of-service attacks on Dyn and OVH, Seifried says it triggered a panic of nation-state cyber-attacks being underway with fears that the entire web was at risk. “It turned out to be three people in their 20s doing Minecraft server shenanigans,” he says. “Essentially, they were running a protection racket. They were doing this out of a dorm room basically.”
Still, most known outages stem from providers, Seifried says, such as the BGP (border gateway protocol) outage last October, which disrupted Facebook, Instagram, WhatsApp, and other sites for some six hours. BGP is how networks connect to other networks of the internet. “You break that and you’ve broken everything,” he says.
Facebook reported that the outage was “triggered by the system that manages our global backbone network capacity. The backbone is the network Facebook has built to connect all our computing facilities together, which consists of tens of thousands of miles of fiber-optic cables crossing the globe and linking all our data centers.”
In earlier days, such an outage may have affected a smaller digital footprint but now the interconnectivity of the cloud means outages are less ignorable. “It used to be, ‘Oh, the internet’s down. No big deal,’” says Seifried. “Now it’s like, ‘Internet’s down. Nobody can buy food.’”
A Massively Complex Problem
Companies such as AWS and Cloudflare are woven into the makeup of the cloud and often must create newer and bigger innovations, Seifried says, for scaling up and out — and the severity of outages can be tied to the increasing complexity. “These are horrendously large, complex-scale systems that are also constantly changing and evolving,” he says.
Security and safety measures may be compromised, Seifried says, as new capabilities are deployed, though providers do a pretty good job covering their bases. “When Cloudflare goes down, that’s like 30% of the world’s internet. Cloudflare usually fixes it within 30 to 40 minutes,” he says.
In some ways, the pace of change in the cloud has also led to an inverse of the legacy, tech debt issue. Instead of companies scrambling to find engineers versed in maintaining older systems, it is getting harder and harder to keep up with the latest systems. “In the past, you deployed a computer system and used it for 10 years,” Seifried says. “Now, can you realistically think of a company deploying a computer system as-is and not majorly upgrading or changing it over the next 10 years?”
This raises questions about the future of cloud resiliency as providers face systems that continue to scale up, exponentially increasing the digital parts they need to monitor for outages and fixes. “Where do you learn to do stuff at Amazon-scale other than at Amazon? You can’t just learn this in your basement,” Seifried says. “My biggest fear is that we’re getting to the point of complexity where you can’t learn this without doing an apprenticeship. There’s no way a university can teach you to handle a system with 100 million compute notes spanning the globe.”
What to Read Next:
How to Architect for Resiliency in a Cloud Outages Reality
Reliance on Cloud Requires Greater Resilience Among Providers
Outage and Recovery: What Comes Next After AWS Disruption
5 Lessons from Facebook, Instagram, WhatsApp Outage