04 November 2025
October saw Amazon Web Services (AWS) suffer one of its most dramatic outages in recent memory, knocking a huge swathe of the internet offline for hours.
The trouble began in the US-EAST-1 region (Northern Virginia), home to one of AWS’s most critical infrastructure hubs. The root cause, as AWS later disclosed, was a bug in its automated DNS management system tied to DynamoDB (AWS’s scalable NoSQL database). A DNS record was corrupted and failed to self-repair, which prevented many services from correctly resolving DNS queries. Cascading failures ensued.
During the height of the outage, major apps and services were affected: Snapchat, Fortnite, Ring, Roblox, cloud-hosted websites, and even smart devices like Amazon’s Alexa/Prime services.
AWS declared the underlying issue fully mitigated within hours of the outage, though backlogs and degradation in related subsystems lingered. In response, AWS disabled the faulty automation tool and began adding additional safeguards.
This outage has spotlighted a perennial worry: the fragility of over-reliance on a handful of cloud behemoths.
Stewart Laing, CEO at Asanti Data Centres, warns this is a wake-up call about “a single point in a single location, coupled with an over-reliance on one provider, and the absence of robust resilience planning.”
While the major cloud service providers do have their obvious advantages, “they also have some fairly significant drawbacks which can be lost in the noise generated by their marketing buzz,” explains Matt Seaton, Director, Netwise. “By deploying services with the likes of AWS, you’re placing your delicate eggs into a very large basket and handing over total operational control of said basket to an enormous, faceless corporate entity with little in the way of transparency.”
Indeed, the AWS outage shows what happens when you build everything on one foundation.
“This is not just about uptime,” says Laing. “It’s about resilience by design, and asking the hard question: where was your business continuity plan? It also again calls into question, the UK governments ‘cloud first’ policy.”
According to Kashif Nazir, Senior Technical Architect at Cloudhouse, many enterprises have redundancy plans, but too often these systems go untested - and when disaster strikes, backups fail.
“True resilience isn’t just about designing for failure, it requires actively validating disaster recovery and multi-cloud strategies,” continues Nazir. “Equally important is understanding your service supply chain. Even if you’re not a direct customer of a major provider like AWS, the services you rely on may be, leaving hidden single points of failure. Simulating failures through chaos engineering helps organisations test these vulnerabilities in practice, ensuring continuity and resilience when it really matters.”
“This incident is a timely reminder that resilience should be built into every layer of data centre infrastructure, especially the physical equipment powering them,” concludes Alice Oakes, Service and Support Manager at Arfon. “With billions set to be invested in UK data centres over the coming years, operators have a golden opportunity to futureproof their facilities. Predictive maintenance should be cornerstone of both new build and retrofit facilities to adapt to ensure continuity in a sector where downtime simply isn’t an option.”



