Skip to navigation Skip to main content Skip to footer

Hiding In Plain Sight

By Andrew Wade

27 April 2026

Executive Summary

Recent cloud outages (e.g. AWS Gulf data centers damaged during drone attacks, Azure’s Oct 2025 global outage) and strategic infrastructure threats (e.g. Russia-linked sabotage of undersea cables) have exposed overlooked vulnerabilities in cloud deployments. The core lesson of these recent upsets, and those that are sure to come, is not to scatter everything everywhere, but to recognise and remove single points of failure - whether that represents a datacenter, a cable, or an authority that your organisation relies upon just a little too heavily.

Identify which systems truly need cross-region or cross-country redundancy, and ensure they have tested recovery paths. Identify the critical dependencies in your cloud architecture (regional, control-plane, legal) and test your backup and failover plans for those cases. Treat sovereignty and provider boundaries as constraints to your design, not afterthoughts to be regretted. A targeted, prioritised, workload-specific approach beats panic migrations every day of the week.

Cloud services are robust, but they’re not magic shields against regional crises or geopolitical risk. Below, we use recent events to illustrate and highlight underlying principles to ensure no key resource is vulnerable to any one failure mode.

Resilience Obstacles are Hiding in Plain Sight

…but recent incidents make the pattern easier to see. In March 2026, AWS reported drone strikes on its Bahrain and UAE data centers [1]. AWS warned customers to activate disaster recovery (DR) plans and migrate away from those regions. On 20 Oct 2025 AWS had a multi-service outage from an internal DynamoDB DNS bug [2] and on 29 Oct 2025 Azure Front Door and Azure CDN suffered a multi-region incident after a sequence of valid configuration changes generated incompatible metadata that triggered a latent data-plane defect [3]. And this is nothing new; 2021 brought us a networking issue that caused cascading failures across multiple AWS services [4] and 2017 saw a major S3 outage during routine maintenance [5] that took out Slack, Trello, Quora, and others [6]. Separately, Western governments have warned about Russian vessel activity near undersea infrastructure [7], and Red Sea cable cuts have already caused regional disruption [8], although attribution for specific incidents is often contested.

These incidents demonstrate different failure modes all similarly rooted in singular dependencies. A missile can physically take out a datacenter; a misconfiguration can disable a control plane; an undersea cable cut can isolate an entire region. Critically, undersea cables carry nearly all long-distance Internet traffic [9] - an attack on them can materially degrade or isolate connectivity even when the cloud region itself remains healthy. The common thread is that one point of failure (region, link, or authority) can cascade.

Treat these triggers as drills. Ask where your architecture has analogous weak points. If all your critical apps are in one region, move one to a second region before a crisis. If a particular network link or ISP is essential, plan an alternate. Don’t wait for disasters to recover; simulate a region failure or cable cut and ensure your system recovers as gracefully as you expect it to.

Zonal, Regional, & Geopolitical Resilience

Cloud redundancy works on layers. Within a region, multiple availability zones primarily reduce exposure to datacenter-scale failures such as power, cooling, or networking faults [10]. Across regions, you can design for region-level disaster recovery, but that still does not automatically address legal, identity, or geopolitical dependencies. For example, two availability zones (AZs) in California still fail if an earthquake or cyberattack knocks out the entire state grid. Confoundingly, however, replicating data to another country avoids that single point of failure, but might violate local privacy laws.

If your organisation is debating “zonal redundancy vs multi-region vs sovereign cloud,” you are likely mixing threat models. Use this taxonomy to force clarity:

  • Operational resilience: Staying up through routine faults (instance failure, a rack issue, an AZ impairment).
  • Regional disaster recovery: Recovering when an entire region cannot meet your Recovery Time Objective (RTO) and Recovery Point Objective (RPO).
  • Geopolitical resilience: Managing disruption from conflict, sanctions, instability, and targeted action, where the event is not “technical” but still breaks availability.
  • Sovereignty / jurisdictional resilience: Ensuring your recovery option remains lawful and usable under relevant legal regimes and regulatory approvals.
  • Provider platform resilience: What the hyperscaler designs (fault isolation boundaries, resilient control planes/data planes, region independence).
  • Customer workload resilience: What you build and test (replication, restore, runbooks, identity/key/orchestration independence).

This becomes a tool to clearly communicate risk to your board: “We are operationally resilient to Availability Zone loss; we have regional Disaster Recovery to Region B within X hours; we are jurisdictionally compliant to fail over for workload class Y.”

AWS and others architect regions to be independent fault domains (each with separate infrastructure) [11][12], but this is an engineering stance, not a guarantee against all risks. Regions rely on shared global services (like identity), and geopolitical events (and missiles) don’t respect datacenter boundaries.

Align each workload’s criticality with an appropriate strategy. If sub-minute RTO is non-negotiable (financial transactions, healthcare systems), use multi-region active/passive with automated failover, spanning different jurisdictions as compliance allows. If data must stay onshore for legal reasons, consider local-only DR even if it is slower. For everything else, a well-tested multi-AZ plan may suffice. Your CSP’s documentation (AWS DR options [13], Google’s best practices [14]) are key to guiding each choice, but you own documentation is going to define which systems belong in which “resilience tier.”

 

Resilience Option Cost / Complexity Sovereignty Fit RTO / RPO Control-Plane Dependency Use-Case Examples
Multi-AZ (single region) Low / Low High (stays in country) Moderate (minutes) Low (only one region’s IAM) High-availability intra-country apps (e.g. banking).
Multi-Region (same provider) Medium / Medium Variable (can be in-country or cross-border depending on region choice) High (seconds-minutes) Medium (depends on identity and control-plane design) Global services requiring fast failover (e.g. public web apps).
Cross-Jurisdiction High / High Variable (multinational) High (if active-active) High (multi-cloud IAM) Very critical data (national infra, finance) needing geo-diversity.
Hybrid/Offline (on-prem or alternate cloud) Medium / High High (choose location) Variable (often hours) Low (manual processes) Ransomware recovery, regulatory archives (air-gapped backups).

TABLES

Shared Responsibility

A recovery environment is only useful if you can still authenticate into it, decrypt what you need, and trigger the recovery path. Shared responsibility varies by service model, but customers still own key responsibilities for data, identities, access, and workload configuration [15]. This split matters when their infrastructure fails you.

For instance, if AWS’s IAM or Azure AD were knocked offline, even a healthy backup region could become unreachable when there’s no way for you to get access tokens. The Oct 2025 Azure outage was a stark example: failure of the Azure Front Door (control-plane) halted data-plane traffic globally. You can have perfectly replicated servers, but if you lose the management plane, failover may not trigger.

Don’t neglect the control plane. Where supported, use regional STS and SAML paths [16] and maintain break-glass access outside your normal identity path. Review whether single-region key dependencies are acceptable, and use multi-Region or alternate key-recovery designs where appropriate [17]. Maintain “break-glass” credentials in a separate region or account, and consider an external identity provider for emergency access. Test your failover by simulating control-plane loss: for example, remove your primary admin key and ensure your scripts can still run with backups.

Dependency Failure Mode Mitigation
Identity (IAM) Locked out (cannot authenticate) Multi-region identity (global and regional endpoints), offline break-glass admin account.
Key Management (KMS) Keys locked in failed region Replicate keys to another region or backup key; use external escrow.
Orchestration (APIs) Cannot redeploy if API is down Keep IaC code in an external repo; pre-provision DR-region resources/quota.
Backups/Logs No recent backup or logs if locked out Cross-region or offline immutable backups; test restores.
DNS/Networking DNS failover doesn’t propagate Use multi-cloud DNS (Route53 health checks, Cloudflare load balancing) for automated rerouting.

Key Management

The UK’s NCSC explicitly encourages users to trust cloud KMS if they trust a cloud with their workloads [18], and customer-managed keys (CMKs) empower you with key control in a cloud KMS. Major providers let you create and rotate CMKs, often with import or HSM options. Customer-managed keys can improve policy control, auditability, and some compliance outcomes, but if the keys remain within the provider’s KMS they do not eliminate dependence on that provider’s legal and operational environment. If a provider is forced to surrender keys (e.g. by law enforcement), having a CMK in their service doesn’t eliminate that risk. Region-specific key dependencies can also complicate recovery; if the KMS service in one region is down, any data encrypted with that key could be locked until the service comes back.

Use CMKs for critical data, but design for key failure. For instance, replicate or copy keys to a second region’s KMS, or maintain a copy in an external HSM. Plan your key-rotation policy and practice key recovery drills. If a key is lost or compromised, how quickly can you generate and re-encrypt? How easily can you decrypt a backup using an alternate key? Remember: CMKs add a layer of control, but they are only part of a broader resilience strategy.

Sovereignty & Regulatory Constraints

Disaster recovery is not just a technical question; it is only a real recovery option if it remains lawful and operationally available during the incident. Even after careful technical planning, data laws can override your tech design. The US CLOUD Act (2018) can require disclosure of data within the possession, custody, or control of a provider subject to U.S. jurisdiction, even when the data is stored abroad [19]. GDPR and local privacy laws (e.g. Brazilian LGPD, Saudi Arabia’s PDPL) have strict data-export rules [20][21][22]. Providers offer “sovereign” options (EU Data Boundary [23], Azure Government [24], AWS Sovereign Cloud [25], etc.), but sovereign offerings differ materially: some are policy/data-boundary controls, some are physically isolated national clouds, and some are separately operated sovereign environments. Read the scope carefully.

You must plan with regulations, not around them. For example, moving EU citizen data to an out-of-Europe DR site could violate GDPR unless appropriate third-country transfer mechanisms apply, and following the recent AWS-targeted drone strikes at least one insurance firm experienced significant disruption as they awaited emergency waivers from the UAE Central Bank to use out-of-region cloud backups [26]. And even with careful sovereignty planning, U.S. authorities could still invoke the CLOUD Act on a European AWS region [27] (cf. [28]), so “in-country” is not a legal guarantee.

Build a compliance map and design your DR architecture to fit. For each workload, list its legal constraints and approved geographic zones. If laws forbid cross-border sync, set up in-country standby sites. If cross-border sync is allowed only in emergencies, document the process for obtaining waivers and expect delays. Treat “sovereign cloud” claims skeptically: read the fine print and factor them into your risk analysis, rather than treating them as catch-all solutions.

Data Flows & Supply Chains

Clouds don’t exist in a vacuum, and most Internet and data flows depend on a few critical pieces of infrastructure. Nearly 100% of intercontinental data crosses undersea cables [9]. Recent intelligence reports show Russia is developing and deploying naval capabilities to threaten these cables [7]. And other conflicts have already demonstrated the danger of Red Sea fiber cable cuts, causing regional Internet outages [8].

A cut cable can be like losing a region. Even if your cloud is healthy, customers can be cut off if the physical net between them breaks at this scale. And don’t forget other supply links: Critical DNS/CDN providers, major ISPs, and even electricity and water for data centers can compromise access at scale. 

Map these flows as rigorously as you map cloud dependencies. Identify which cables your traffic uses and where alternate routing exists. For critical data flows, use redundant CDNs or multiple transit providers to avoid any one undersea link. Include satellite backup or other out-of-band links if cable risk is high. In your DR exercises, simulate a cable cut: reroute traffic via alternate paths and ensure your operations continue. Remember, resilience is not just in the cloud; it’s in every step along the way.

Hyperscaler Architecture Philosophy

By now it should be clear: Providers want you to architect resilience of your own. Hyperscalers provide meaningful fault-isolation primitives, but resilience behavior is service-specific. Azure region pairs support some services and some aspects of disaster recovery [29], and some Google multi-regional services are designed to withstand the loss of an entire region [30].

So, do trust the design to a point. Leverage regional services, but verify them yourself. Align your design with these principles. Wherever possible, call region-specific endpoints (e.g. use a region’s SNS topic, or a region-local Redis cluster) so failures don’t ripple [12]. Read each cloud’s DR documentation closely and do practice runs to understand first-hand how it’s likely to affect you when the unexpected happens. Shut off a region’s nodes, see what breaks, and fix those gaps. Stay up-to-date on provider resiliency best-practices (the Azure/AWS postmortems are highly educational [2][3][4][5]).

Hybrid & Offline Recovery

Some contingencies require stepping outside the cloud entirely. CISA and NCSC recommend keeping offline and even air-gapped backups of critical data [31][32]. A hybrid model (cloud + on-prem or alternate cloud) can mitigate worst-case risks (total provider outage, legislative cut-off), but it adds cost and complexity. It can also violate sovereignty rules if it’s mishandled (e.g. replicating EU health data to a US site).

Prepare for complete cloud loss scenarios, even if they’re unlikely in their totality or longevity. Maintain copies of data and code offline (encrypted and stored securely). For critical systems, have a bare-metal or alternate-cloud failover plan (e.g., spin up VMware images from backups on a private cluster). Ensure that third-party SaaS tools you rely on have emergency access methods (some offer “break glass” token codes). Document and test these fallbacks at least annually.

Five Questions To Ask Before The Next Incident

  1. What are our single-region or single-cable dependencies? Identify any service, database, or network link that exists in only one place.
  2. Can we truly fail over? When was the last time we rebuilt from backup in another region/account? Does DNS failover or traffic routing work under fire?
  3. How do we handle identity and key loss? If the primary IAM/KMS goes offline, do we have alternate logins or key access?
  4. What legal boundaries apply? Which workloads and data are restricted to certain countries or clouds? Are there regulatory approvals needed for emergency moves?
  5. What offline options exist? Do we have recent offline backups or a cold-standby site? Could we run a minimal version of services without Internet (e.g., at least maintain essential ops)? 

Answering these will almost certainly expose gaps in your DR plan. For example, you might discover a mission-critical API only runs in “us-east-1” (Q1), or that your legal team hasn’t approved the planned DR site (Q4).

Conclusion

The most dangerous dependencies are often the ones outside the obvious workload design. No organisation should be surprised by these issues; they’ve been hiding in plain sight this whole time. The cloud industry designs for resilience, but the ultimate responsibility lies with you. The key takeaway: Don’t let any single region, cable, or control-plane be your Achilles’ heel. By mapping dependencies, respecting jurisdictional limits, and rehearsing recoveries now, you ensure a smoother outcome when real disruptions occur. Remember, it’s better to find a problem in a drill than in a crisis.

References:

[1] Reuters, "Amazon cloud unit flags issues after Bahrain, UAE data centers were hit amid Iran strikes" https://www.reuters.com/world/middle-east/amazon-cloud-unit-flags-issues-bahrain-uae-data-centers-amid-iran-strikes-2026-03-02/

[2] About Amazon, "Update - AWS services operating normally" https://www.aboutamazon.com/news/aws/aws-service-disruptions-outage-update

[3] Microsoft Azure Status History, "Post Incident Review - Azure Front Door / Azure CDN connectivity issues across multiple regions" https://azure.status.microsoft/en-us/status/history/?trackingId=YKYN-BWZ

[4] AWS, "Summary of the AWS Service Event in the Northern Virginia (US-EAST-1) Region" https://aws.amazon.com/message/12721/

[5] AWS, "Summary of the Amazon S3 Service Disruption in the Northern Virginia (US-EAST-1) Region" https://aws.amazon.com/message/41926/

[6] BBC News, "Major websites hit by Amazon outage" https://www.bbc.com/news/world-us-canada-39119089

[7] Reuters, "UK monitors Russian spy ship, steps up undersea cable protection" https://www.reuters.com/world/uk/uk-monitors-russian-spy-ship-steps-up-undersea-cable-protection-2025-01-22/

[8] Reuters, "Red Sea cable cuts disrupt internet across Asia and the Middle East" https://www.reuters.com/world/middle-east/red-sea-cable-cuts-disrupt-internet-across-asia-middle-east-2025-09-07/

[9] ITU, "Submarine cable resilience" https://www.itu.int/en/mediacentre/backgrounders/Pages/submarine-cable-resilience.aspx

[10] AWS, "Availability Zones - AWS Fault Isolation Boundaries" https://docs.aws.amazon.com/whitepapers/latest/aws-fault-isolation-boundaries/availability-zones.html

[11] AWS, "AWS Fault Isolation Boundaries" https://docs.aws.amazon.com/pdfs/whitepapers/latest/aws-fault-isolation-boundaries/aws-fault-isolation-boundaries.pdf

[12] AWS, "Global services - AWS Fault Isolation Boundaries" https://docs.aws.amazon.com/whitepapers/latest/aws-fault-isolation-boundaries/global-services.html

[13] AWS, “REL13-BP02 Use defined recovery strategies to meet the recovery” objectives https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/rel_planning_for_recovery_disaster_recovery.html

[14] Google Cloud, "Disaster recovery planning guide" https://docs.cloud.google.com/architecture/dr-scenarios-planning-guide

[15] Microsoft Learn, "Shared responsibility in the cloud" https://learn.microsoft.com/en-us/azure/security/fundamentals/shared-responsibility

[16] AWS, "Manage AWS STS in an AWS Region" https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_temp_enable-regions.html

[17] AWS KMS, "Multi-Region keys in AWS KMS" https://docs.aws.amazon.com/kms/latest/developerguide/multi-region-keys-overview.html

[18] NCSC, "Mythbusting cloud key management services" https://www.ncsc.gov.uk/blog-post/mythbusting-cloud-key-management-services

[19] U.S. Department of Justice, "CLOUD Act White Paper" https://www.justice.gov/d9/press-releases/attachments/2019/04/10/department_of_justice_cloud_act_white_paper_2019_04_10_final_0.pdf

[20] GDPR, "Chapter V - Transfers of personal data to third countries or international organisations" https://gdpr-info.eu/chapter-5/

[21] International Trade Administration, "Brazil's new rules on international data transfers" https://www.trade.gov/market-intelligence/brazils-new-rules-international-data-transfers

[22] IAPP, "Saudi PDPL’s first anniversary: Amendments, enforcement and ongoing developments" https://iapp.org/news/a/saudi-pdpl-s-first-anniversary-amendments-enforcement-and-ongoing-developments

[23] Microsoft Learn, "What is the EU Data Boundary?" https://learn.microsoft.com/en-us/privacy/eudb/eu-data-boundary-learn

[24] Microsoft Learn, "Azure Government documentation" https://learn.microsoft.com/en-us/azure/azure-government/

[25] AWS, "Opening the AWS European Sovereign Cloud" https://aws.amazon.com/blogs/aws/opening-the-aws-european-sovereign-cloud/

[26] Reuters, "India's Policybazaar UAE unit expects full recovery within 48 hours after AWS disruption" https://www.reuters.com/business/indias-policybazaar-uae-unit-expects-full-recovery-within-48-hours-after-aws-2026-03-06/

[27] AWS, “Clarifying Lawful Overseas Use of Data (CLOUD) Act” https://aws.amazon.com/compliance/cloud-act/

[28] Heise, “Canadian Court: OVHcloud from France must hand over user data” https://www.heise.de/en/news/Canadian-Court-OVHcloud-from-France-must-hand-over-user-data-11092029.html

[29] Microsoft Learn, "Azure region pairs and nonpaired regions" https://learn.microsoft.com/en-us/azure/reliability/regions-paired

[30] Google Cloud, “Patterns for scalable and resilient apps” https://docs.cloud.google.com/architecture/scalable-and-resilient-apps

[31] CISA, "Cybersecurity Performance Goals 2.0" https://www.cisa.gov/cybersecurity-performance-goals-2-0-cpg-2-0

[32] NCSC, "Data security" https://www.ncsc.gov.uk/collection/10-steps/data-security