Cloud Reliability Requires Rethinking Dependencies

Enterprises are overestimating the reliability of their cloud providers and need to rethink their cloud strategy, according to Sam Barker, vice president of telecoms market research at Juniper Research.

He maintained in a company blog that enterprises over-rely on a single provider for their cloud services, although that may change in the wake of the Amazon Web Services (AWS) outage last month that disrupted a key database service and led to many services that rely on AWS also to suffer outages, including Disney+, Fortnite, HBO Max, Robinhood, Roblox, Slack, Venmo and Zoom.

“Despite the disruption, Amazon’s stock remained relatively stable, suggesting continued investor confidence in the company’s long-term market leadership,” Barker wrote. “However, the incident could accelerate demand for multicloud orchestration tools, edge computing, and services that increase the overall resilience of cloud services.”

“Overall, we expect the outage to initiate enterprises to explore new solutions or business models to increase the uptime of their services,” he added.

While recent outages at AWS and Microsoft Azure caused degraded performance and downtime for many organizations, noted Gartner Vice President Analyst Lydia Leong, “These events highlight an important truth: cloud disruptions happen, but they are not evidence that the cloud is inherently unreliable.”

In an article published on the Gartner website, she warned that moving workloads out of hyperscale providers (repatriation) or to smaller sovereign clouds (geopatriation) won’t eliminate outage risk. “In fact,” she wrote, “these moves often introduce new risks and may even slow down your recovery when things do go wrong.”

“It’s tempting to think multicloud is the answer,” she continued. “But Gartner research shows that pursuing multicloud resilience can cost more than it saves, introducing technical complexity without truly eliminating systemic risk.”

“Cloud outages make headlines because they affect so many people at once, but context matters,” she added. “Every major provider has experienced similar events, from Microsoft Azure to Google Cloud Platform. The real differentiator is how well your organization plans for and recovers from inevitable disruption.”

Table of Contents

Can’t Engineer Away Risk

The past few years have shown just how fragile the digital world can be, observed Shawn Michels, vice president of product management at Akamai Technologies, a content delivery network service provider in Cambridge, Mass. “From cloud platform outages to undersea cable cuts, even the most sophisticated systems can experience failures,” he told TechNewsWorld.

“A lot of organizations still assume that because something runs in the cloud, it’s automatically resilient, but that’s not the case,” he said. “Even the biggest clouds don’t have perfect uptime.”

“What separates the best from the rest is how well a system reacts to small failures to prevent a larger outage,” he continued. “You can’t stop every component from breaking, but you can design systems to recover so quickly that customers barely notice.”

He added that outages remind us that you can’t engineer away all risk. “The most resilient organizations are rethinking their architectures by using phased rollouts, automated rollback capabilities, and continuous observability to make sure problems are caught and contained early,” he explained. “True resilience is as much about culture as it is about technical architecture. It’s how people prepare for failure, respond under stress, and learn from every incident.”

While the major hyperscale providers are extremely reliable, they’re not equally reliable, contended Rich Mogull, chief analyst at the Cloud Security Alliance, a not-for-profit organization dedicated to cloud best practices. “Enterprises tend to gloss over these differences,” he said.

“For example,” he continued, “AWS rarely has cross-region failures, and when they do, they tend to be limited. You can largely plan around this potential. Azure, by comparison, is more likely to experience global failures due to how their infrastructure is designed.”

No Immunity To Downtime

Enterprises absolutely overestimate cloud reliability, often assuming that global cloud infrastructure is inherently immune to downtime due to redundancy, maintained Ensar Seker, CISO of SOCRadar, a threat intelligence company, in Newark, Del.

“In reality, redundancy mitigates risk, but it doesn’t eliminate it,” he told TechNewsWorld. “Even hyperscalers like AWS or Azure operate in a complex web of dependencies across regions, zones, and third-party services. An issue in one layer — like identity federation, DNS propagation, or load balancer routing — can still ripple out and break critical functionality, even if core compute nodes are up.”

“What’s critical for enterprises to internalize is that cloud outages are inevitable, not hypothetical,” he said. “The question isn’t if, but how often and how prepared your organization is.”

“The AWS outage in June 2023, for example, disrupted everything from banking portals to hospital systems — not because AWS lacked redundancy, but because enterprises hadn’t built their apps to withstand regional or service-specific degradation,” he added.

“The day that there are clouds that have a 100% uptime is the day when all problems in this world are eliminated,” declared John Strand, of Strand Consulting, a consulting firm with a focus on telecom, in Denmark.

“Right now, everyone — and especially hyperscalers — is building tons of new data centers across the world,” he told TechNewsWorld. “The size and complexity of these centers is exploding, and when that happens, the risk of something going wrong increases. I’m sure that many of these problems will be eliminated over time, while new problems will arise.”

Misreading Meaning of Reliability

Enterprises don’t overestimate cloud reliability; they just misread what it really means, contended Sergiy Balynsky, vice president of engineering at Spin.AI, a cybersecurity company that specializes in protecting SaaS applications from ransomware, data loss, insider threats, and compliance risks, in Palo Alto, Calif. “The cloud isn’t a silver bullet,” he told TechNewsWorld. “It’s a shared responsibility model.”

He noted that the AWS outage illustrates that perfectly. “Cloud providers offer highly resilient building blocks — regions, availability zones, failover mechanisms — but it’s up to the enterprise to design for resilience and continuity,” Balynsky explained.

“That’s exactly what Business Continuity Planning (BCP) and strong architecture or SRE practices are for. BCP and SRE teams plan for failure, spread the risk, and keep critical systems running during outages. Relying on a single region or skipping redundancy isn’t a provider failure. It’s an architectural oversight,” he said.

If a customer is concerned about reliability, they can hedge their concerns by duplicating what they do in one region in another region, noted David Stone, director in the office of the CISO for Google Cloud.

“Customers can absolutely design in resiliency by using different data centers in other regions, deploying it into different zones in those regions, and being able to build out that architected framework, even to the point where they can build out applications spanning multicloud environments for resiliency,” he told TechNewsWorld.

Srini Srinivasan, founder and CTO of Aerospike, a real-time NoSQL database company in Mountain View, Calif., added that cloud providers offer a variety of capabilities that allow any enterprise to deliver extremely high availability. “I mean like four nines,” he told TechNewsWorld.

“There is no reason that, using any of the existing cloud provider features and capabilities, an enterprise cannot achieve that,” he said. “The fallacy people have is that the cloud provider will solve everything for them.”

Scale Does Not Equal Invulnerability

However, Aykut Duman, a partner in the digital and analytics practice at the global strategy and management consulting firm Kearney, pointed out that during the AWS outage, despite deploying workloads across multiple availability zones, organizations experienced complete downtime due to a DNS resolution failure that disrupted core services such as DynamoDB and EC2.

“This incident revealed that reliability depends as much on workload architecture and distribution as it does on provider infrastructure,” he told TechNewsWorld. “Enterprises often assume redundancy at the provider level guarantees uptime, but resilience must be deliberately engineered at the application level.” “Enterprises overestimate cloud reliability, because they often equate cloud scale with invulnerability,” he said. “While hyperscalers like AWS, Microsoft, and Google offer impressive uptime, no system is immune to failure.”

“Enterprises tend to underestimate how complex interdependent cloud services are, and how quickly cascading failures can occur across distributed systems,” he continued. “Reliability is high, but not absolute. The recent AWS outage exposed the misconception that cloud-native automatically means resilient.”

Source link

What's Hot

Epic Games CEO calls Google’s antitrust settlement a win for Android’s ‘vision as an open platform’

Google Maps Is About To Get Way Easier To Use While Driving

Cloud Reliability Requires Rethinking Dependencies