Reading through the US-EAST-1 Service Disruption Summary Report
On October 19 and 20, 2025, the AWS North Virginia (us-east-1) region faced a disruption that took down services worldwide, from small websites to large e-commerce sites (including Amazon itself), banks and government services. According to AWS, the event started at 11:48 PM PDT on October 19 and ended at 2:20 AM PDT on October 20. Shortly after the end of the event, AWS released a summary of the event which might serve both as a way of getting a sneak peek into the inner workings of AWS and as a case study of how complex systems fail. This article will be an attempt at unraveling AWS’ event summary, going through the original event's timeline and adding personal comments to it. I’ll be quoting the original document from AWS throughout the article, editing out redundant portions, but avoiding any changes in content or meaning to the original text.
Glossary
To make the text a bit more accessible to a wider audience, I’ll put here a brief explanation of some of the terms used below:
IP address - Logical address of a system on the Internet or similar internal networks. This is what computers use to reference other computers when trying to connect to them.
DNS server - Domain Name System server. This maps names to IP addresses. For example: www.google.com → <IP address of systems providing Google's web search services>. The first step in any attempt to access a service on the Internet consists in checking a DNS server for what is the appropriate IP address to access when trying to reach that service. Only then computers try to reach the IP address returned from this inquiry.
Load Balancer - A server which acts as a proxy for other services behind it, redirecting connections according to availability and capacity. You can imagine going to a laundry service and handing your clothes to a person who's gonna put it in one of the available laundry machines and then return them back to you. This person doesn’t do the laundry and you don’t choose the machine, nor can you infer how many machines there are. But from the customer's perspective, this person does the laundry, in the sense that you give them dirty clothes and receive clean clothes back.
Stack - A data structure that works as an ordered pile of things: you put stuff on top of it and you pick stuff from the top of it. The last item put on top of it is the next item that's going to be picked up, unless someone puts more stuff on it before someone picks up the last item put there. You can imagine a pile of dirty dishes, for example.
Stochastic Process - A process whose behavior is randomly determined, presenting a pattern that may be analyzed statistically but may not be predicted precisely.
Long Tail Events - Probabilistic events with a very small likelihood, to the point of often being considered to not occur at all.
Lock - A synchronization primitive used to prevent a record from being modified by multiple systems at the same time to prevent inconsistencies. For example: imagine two people trying to transfer cash to the same bank account at the same time. Each tries to apply: new balance = old balance + transfer value. The desired result is new balance = old balance + first value + second value, but without the proper controls, it might end up being either new balance = old balance + first value or new balance = old balance + second value.
EIP - Elastic IP Address. An IP address within AWS's network that can be dynamically attached to servers on their network.
Hypervisor - A software that creates and runs virtual machines by abstracting and allocating a single physical server's resources like CPU, memory, and storage to multiple guest operating systems.
DynamoDB Request Routing Failure
Between 11:48 PM PDT on October 19 and 2:40 AM PDT on October 20, customers and other AWS services with dependencies on DynamoDB were unable to establish new connections to the service.
The incident was triggered by a latent defect within the service’s automated DNS management system that caused endpoint resolution failures for DynamoDB.
The root cause of this issue was a latent race condition in the DynamoDB DNS management system that resulted in an incorrect empty DNS record for the service’s regional endpoint (dynamodb.us-east-1.amazonaws.com) that the automation failed to repair.
To explain this event, we need to share some details about the DynamoDB DNS management architecture. The system is split across two independent components:
The DNS Planner, monitors the health and capacity of the load balancers and periodically creates a new DNS plan for each of the service’s endpoints consisting of a set of load balancers and weights.
The DNS Enactor, enacts DNS plans by applying the required changes in the Amazon Route53 service. The DNS Enactor operates redundantly and fully independently in three different Availability Zones (AZs). Each of these independent instances of the DNS Enactor looks for new plans and attempts to update Route53 by replacing the current plan with a new plan.
As users, we can see DynamoDB as the abstraction of an infinite NoSQL database, which is highly durable, highly available and highly performant. Here AWS shares a bit of how this is done:
Each region has a dynamically large amount of servers running DynamoDB instances.
Servers get mapped to load balancers, through an undisclosed process — either by self-registering or by some orchestration mechanism. The end result is that each load balancer is responsible for a set of servers.
A planning system, the DNS Planner, monitors the load balancers health and decides if they should be part of the regional DynamoDB fleet, and how much they should contribute to it. DNS Planner establishes a plan, but doesn’t execute it.
The plan is essentially a weighted list of load balancer IP addresses. And a deployment system, the DNS Enactor, acts on the Planner's plans by atomically deploying it to Route53 so that users know where they should connect to reach DynamoDB.
Under normal operations, a DNS Enactor picks up the latest plan and begins working through the service endpoints to apply this plan. This process typically completes rapidly and does an effective job of keeping DNS state freshly updated. Before it begins to apply a new plan, the DNS Enactor makes a one-time check that its plan is newer than the previously applied plan. As the DNS Enactor makes its way through the list of endpoints, it is possible to encounter delays as it attempts a transaction and is blocked by another DNS Enactor updating the same endpoint. In these cases, the DNS Enactor will retry each endpoint until the plan is successfully applied to all endpoints.
Here we can infer three things:
The communication path between the DNS Planner and the DNS Enactor behaves like a stack. Whenever a DNS Planner runs, it puts its new plan on top of the stack; whenever the DNS Enactor runs, it executes whatever is the newest plan on top of the stack.
The job of the DNS Planner is to pull the latest plan and deploy it through several DNS server instances, which should be quickly updated to start answering queries based on the latest plan. So the DNS Enactor is an orchestrator for a distributed DNS system.
An endpoint gets an update lock when an Enactor is working on it. And if a second Enactor reaches it when it's locked, it'll keep retrying the update until it's able to perform the task.
Right before this event started, one DNS Enactor experienced unusually high delays needing to retry its update on several of the DNS endpoints. As it was slowly working through the endpoints, several other things were also happening.
This, in my view, is the root cause. Unfortunately, people at AWS felt it wasn’t appropriate to comment on why it experienced unusually high delays. Maybe to preserve some industry secret, maybe because the process is intrinsically stochastic and they hit a long tail. But here is where the chain reaction starts: one instance of the DNS Enactor being stuck for an unexpected time applying a plan.
First, the DNS Planner continued to run and produced many newer generations of plans. Second, one of the other DNS Enactors then began applying one of the newer plans and rapidly progressed through all of the endpoints.
The timing of these events triggered the latent race condition. When the second Enactor (applying the newest plan) completed its endpoint updates, it then invoked the plan clean-up process, which identifies plans that are significantly older than the one it just applied and deletes them.
I can imagine here that when this system was first designed, the architect didn’t envision a situation where it'd make sense that a second Enactor would start after the first one, skip the line and finish first. Earlier we learned that the Enactor tries to really make sure it's applying the last plan before it starts, and that a conflict in two Enactors trying to change the same endpoint results in one waiting for the other's lock release. I can also imagine that in the architect's mind an Enactor applying a newer plan could end up waiting for the first Enactor to finish updating the endpoint so that it could apply its plan, but what seemed to happen was the opposite: for some reason, the first Enactor (with the older plan) ended up stuck waiting for lock releases from the second Enactor (with the newer plan). And then the second Enactor, in this unexpected state, performs this perfectly reasonable — within the assumptions — housekeeping job of deleting the plan being applied by the other Enactor.
At the same time that this clean-up process was invoked, the first Enactor (which had been unusually delayed) applied its much older plan to the regional DDB endpoint, overwriting the newer plan.
The second Enactor’s clean-up process then deleted this older plan because it was many generations older than the plan it had just applied.
The meaning of “many generations” is unclear, but I can totally see someone observing testing / statistical data where there is never a case where a plan older than N generations is ever needed, and hardcoding many = 2 * N or something — and then we hit a long tail.
As this plan was deleted, all IP addresses for the regional endpoint were immediately removed. Additionally, because the active plan was deleted, the system was left in an inconsistent state that prevented subsequent plan updates from being applied by any DNS Enactors.
It isn’t clear to me what they mean by saying “IP addresses for the regional endpoint were removed”. Does it mean that the IP addresses were disconnected from the DynamoDB instances running the service? Or that the DNS servers rely on an outside source of data for the plan that vanished under them? When I first read this, my interpretation was the latter, though my perception now is that only the first one makes sense:
If the DNS endpoints use an outside source of data, why would they need to run a process updating them one by one instead of just updating this data source? Plus, this would create an extra point of failure.
On the other hand, it does make sense that the instances get an EIP, this EIP gets tied to a plan and the EIP can only be deallocated from the instance and reused after the resource using it (the plan) doesn’t exist anymore.
The problem was: the plan was being used. There seems to be a missing piece here tracking plan usage (instead of assuming it wasn’t being used due to being many generations behind the latest applied plan — though being many generations behind could also lead to a collapse later, even if the rug wasn’t pulled from under the DNS endpoints.
When this issue occurred at 11:48 PM PDT, all systems needing to connect to the DynamoDB service in the N. Virginia (us-east-1) Region via the public endpoint immediately began experiencing DNS failures and failed to connect to DynamoDB.
Fair.
EC2 Instance Management Failure
DropletWorkflow Manager
Between 11:48 PM PDT on October 19 and 1:50 PM PDT on October 20, customers experienced increased EC2 API error rates, latencies, and instance launch failures in the N. Virginia (us-east-1) Region.
During this period new instance launches failed with either a “request limit exceeded” or “insufficient capacity” error.
To understand what happened, we need to share some information about a few subsystems that are used for the management of EC2 instance launches:
The DropletWorkflow Manager (DWFM) is responsible for the management of all the underlying physical servers (a.k.a. “droplets”) that are used for the hosting of EC2 instances.
The Network Manager is responsible for the management and propagation of network state to all EC2 instances and network appliances.
Each DWFM manages a set of droplets within each Availability Zone and maintains a lease for each droplet currently under management. This lease allows DWFM to track the droplet state, ensuring that all actions from the EC2 API or within the EC2 instance itself, such as shutdown or reboot operations originating from the EC2 instance operating system, result in the correct state changes within the broader EC2 systems.
As part of maintaining this lease, each DWFM host has to check in and complete a state check with each droplet that it manages every few minutes.
Here we’re learning about the systems that live on the interface between bare metal and virtual machines on EC2. At first it seems like DWFM is tracking the physical server health and availability, but later it seems like it also tracks the general state of the virtual machines running on it. So it seems to me that the DWFM is an orchestrator of hypervisors. And that the same way a Load Balancer does health checks on services under its responsibility — or lease, in the DWFM language — the DWFMs do the same on the hypervisors under their responsibility.
Starting at 11:48 PM PDT on October 19, these DWFM state checks began to fail as the process depends on DynamoDB and was unable to complete. While this did not affect any running EC2 instance, it did result in the droplet needing to establish a new lease with a DWFM before further instance state changes could happen for the EC2 instances it is hosting. Between 11:48 PM on October 19 and 2:24 AM on October 20, leases between DWFM and droplets within the EC2 fleet slowly started to time out.
Here's the root cause of failure on EC2: DWFMs lose their connection to the hypervisors they manage due to dependency on DynamoDB for this.
This shows a bit of how resiliency works at EC2: instances and hosts are able to work independently from any management as long as there are no state changes needed.
And why does it matter to us? If your recovery playbook — manual or automatic — includes restarting or recreating instances on a failure event, it's best to review it, as you can pretty easily move from a recoverable state to an unrecoverable state if EC2 orchestration systems are having any issues.
At 2:25 AM PDT, with the recovery of the DynamoDB APIs, DWFM began to re-establish leases with droplets across the EC2 fleet. Since any droplet without an active lease is not considered a candidate for new EC2 launches, the EC2 APIs were returning “insufficient capacity errors” for new incoming EC2 launch requests.
Fair: a lack of hypervisors leased to DWFMs feels exactly the same as a lack of physical servers available on the data center to launch enough instances.
Before this, I'd be pretty astonished to get an “insufficient capacity error” when trying to launch an EC2 instance. I'd probably wonder where I configured such a capacity limit and which capacity limit I was hitting. I don’t know if it'd dawn on me that AWS just didn’t have enough servers available to run my workload. And to be fair, this is totally to AWS' merit: they usually do such a good job of keeping these systems working that we — or I, at least — never have to consider that their physical capacity isn’t infinite.
(With the recovery of the DynamoDB APIs) DWFM began the process of reestablishing leases with droplets across the EC2 fleet; however, due to the large number of droplets, efforts to establish new droplet leases took long enough that the work could not be completed before they timed out. Additional work was queued to reattempt establishing the droplet lease. At this point, DWFM had entered a state of congestive collapse and was unable to make forward progress in recovering droplet leases.
From the DWFMs’ perspective, this process must have looked like the ramp-up of not one data center, but all data centers in all availability zones in the entire region. And that while the droplets were already running the customers’ workloads, the region was already receiving requests and a queue of requests that had built up over the last many hours.
I can undestand why the DWFM wasn’t designed for this specific scenario. Though it's unclear to me why it’d enter “a state of congestive collapse” and become “unable to make forward progress in recovering droplet leases”. My best guess is that the APIs used by the DWFMs were receiving the equivalent of an internal DDoS attack.
Here we have the root cause of the problem “EC2 fleet is unable to self heal”: the DWFM fleet was unable to handle the scenario where an out-of-specification number of droplets connections need to be reestablished at the same time.
Since this situation had no established operational recovery procedure, engineers took care in attempting to resolve the issue with DWFM without causing further issues. After attempting multiple mitigation steps, at 4:14 AM engineers throttled incoming work and began selective restarts of DWFM hosts to recover from this situation. Restarting the DWFM hosts cleared out the DWFM queues, reduced processing times, and allowed droplet leases to be established.
Maybe the text was just too vague for this analysis, but if selective restarts of DWFMs solved the situation, then it looks like the “equivalent of an internal DDoS attack” wasn't the reason why they couldn’t recover, but something more internal to how they’re architected.
In any case, my perception was that the personnel working on the case took a bold and wise decision to throttle the incoming work.
Network Manager
When a new EC2 instance is launched, a system called Network Manager propagates the network configuration. Shortly after the recovery of DWFM, Network Manager began propagating updated network configurations to newly launched instances and instances that had been terminated during the event. Since these network propagation events had been delayed by the issue with DWFM, a significant backlog of network state propagations needed to be processed. As a result, Network Manager started to experience increased latencies in network propagation times as it worked to process the backlog of network state changes. While new EC2 instances could be launched successfully, they would not have the necessary network connectivity due to the delays in network state propagation. Engineers worked to reduce the load on Network Manager to address network configuration propagation times and took action to accelerate recovery.
This is a bit of AWS' magic: if you ever worked with on-prem equipment, adding a server to a network usually consists of setting up its IP address, mask, gateway IP address and DNS servers; then you plug it into the right switch port and check that it works. But on AWS the network is software defined, the server is physically at one place, what you see as a server is a virtual machine and the network it needs to connect to doesn’t have a physical switch port you can plug it into.
Given that an availability zone is a pool of data centers, the network you’re connecting to (i.e., the other systems you'd like to access) may be not only in a different data center, but in an entirely different city. But through what I’ll call a smart “tunneling” mechanism this is all abstracted away and all systems in your availability zone behave as if they were connected to the same switch you are.
My understanding is that Network Manager is the system in charge of setting up all the proper configurations so that this abstraction works. Unfortunately and, at the same time, understandably, not much detail was given by this piece of the system, or how they “reduced the load” on the Network Manager.
Network Load Balancer
The delays in network state propagations for newly launched EC2 instances also caused impact to the Network Load Balancer (NLB) service.
NLB provides load balancing endpoints and routes traffic to backend targets. The architecture also makes use of a separate health check subsystem that regularly executes health checks against all nodes within the NLB architecture and will remove any nodes from service that are considered unhealthy.
During the event the NLB health checking subsystem began to experience increased health check failures. This was caused by the health checking subsystem bringing new EC2 instances into service while the network state for those instances had not yet fully propagated.
This meant that in some cases health checks would fail even though the underlying NLB node and backend targets were healthy. This resulted in health checks alternating between failing and healthy. This caused NLB nodes and backend targets to be removed from DNS, only to be returned to service when the next health check succeeded.
From this, it looks like the NLB is composed of at least two systems running on top of EC2:
The load balancer itself, which appears to be a specially configured EC2 virtual machine
The health check subsystem, which decides to bring NLB nodes (the load balancers themselves) online or offline by adding or removing their IP addresses from the DNS names.
A third (not covered) subsystem at AWS dynamically scales up or down the number of NLB instances — or when NLB instances need to be taken down due to some failure, their network configuration needs to be propagated by the Network Manager. And if this takes too long to sync up, the health check for a certain instance being brought up might end up failing.
So you could have an EC2 virtual machine which didn’t change state and was perfectly setup and reachable, but the NLB nodes responsible for forwarding traffic to it, for some reason, needed to be replaced or scaled up. In that case, the time it took for their network configuration to propagate led to health check failures (that is: check of the NLB VMs health, not the underlying targets), leading to an inability to add NLB capacity to the fleet.
The alternating health check results increased the load on the health check subsystem, causing it to degrade, resulting in delays in health checks and triggering automatic AZ DNS failover to occur. For multi-AZ load balancers, this resulted in capacity being taken out of service. In this case, an application experienced increased connection errors if the remaining healthy capacity was insufficient to carry the application load.
It's interesting to note how modern day multi-layered load balancing and health checking ended up making this specific system less reliable; in 1986 the DNS record MX was introduced with this format:
Domain TTL Class Type Priority Host
example.com. 1936 IN MX 10 onemail.example.com.
example.com. 1936 IN MX 10 twomail.example.com.
In this example, it specifies that the email servers handling emails for example.com are onemail.example.com and twomail.example.com, equally balanced within them (through the priority field). When querying the MX records for example.com, both hosts are returned, and it’s the responsibility of the client trying to access the system to randomly choose one of the two servers. Failure is also handled client-side by retrying a different email server if the connection fails.
This is what powers all email systems since then, and doesn’t have a checking or load balancing system to fail — if there is a healthy underlying target (in NLB’s language), the delivery will happen. Big email providers like AWS itself and Gmail probably actually have some sort of load balancer as their listed email hosts, instead of the email servers themselves. But at least for smaller scale services, this poor man's DNS-based load balancing works well even today.
Other AWS Services
Lambda, ECS, EKS, Fargate, Amazon Connect
NLB health check failures triggered instance terminations leaving a subset of Lambda internal systems under-scaled.
We see here that Lambda Functions run on EC2 behind NLB.
Customers experienced container launch failures and cluster scaling delays across both Amazon Elastic Container Service (ECS), Elastic Kubernetes Service (EKS), and Fargate in the N. Virginia (us-east-1) Region.
Fair: container services’ capacities are provided by EC2, either directly through EC2 instances or through managed services, like Fargate.
Amazon Connect customers experienced elevated errors handling calls, chats, and cases. Following the restoration of DynamoDB endpoints, most Connect features recovered. Starting at 7:04 AM, customers again experienced increased errors which was caused by impact to the NLBs used by Connect as well as increased error rates and latencies for Lambda function invocations.
AWS does eat its own dog food: Amazon Connect runs on EC2, behind NLBs accessing DynamoDB and calling Lambda functions.
IAM and STS
Customers experienced AWS Security Token Service (STS) API errors and latency in the N. Virginia (us-east-1) Region. STS recovered at 1:19 AM after the restoration of internal DynamoDB endpoints. Between 8:31 AM and 9:59 AM, STS API error rates and latency increased again as a result of NLB health check failures. By 9:59 AM, we recovered from the NLB health check failures, and the service began normal operations.
AWS customers attempting to sign into the AWS Management Console using an IAM user experienced increased authentication failures due to underlying DynamoDB issues. Customers with IAM Identity Center configured in N. Virginia (us-east-1) Region were also unable to sign in using Identity Center. Customers using their root credential, and customers using identity federation configured to use signin.aws.amazon.com experienced errors when trying to log into the AWS Management Console in regions outside of the N. Virginia (us-east-1) Region. As DynamoDB endpoints became accessible, the service began normal operations.
I’d imagine IAM and STS would get their own independent systems, given that they’re also used to control access to DynamoDB, which they're making use of.
IAM and STS being dependent on N. Virginia DynamoDB raises one interesting reliability issue: if you're using any AWS system, in any region, that depends on IAM, your service may be affected when us-east-1 has issues, particularly during extended outages.
It’s reasonable to be multi-region to minimize latency, to meet regulatory data location requirements and to multiple copies of data with greater physical separation (in case of geopolitical events, for example). But as long as you depend on IAM (as most stuff running on AWS does), if N. Virginia goes down, your service may suffer as well, regardless of the region you set it up on.
Granted that I’m a mere outside observer, if anything, as an AWS customer, I’d really prefer that they did some work on IAM architecture so that workloads in different regions wouldn’t depend on authentication being available at N. Virginia.
Redshift
Redshift query processing relies on DynamoDB endpoints to read and write data from clusters. As DynamoDB endpoints recovered.
Redshift processing depends on DynamoDB for processing: fair.
Redshift automation triggers workflows to replace the underlying EC2 hosts with new instances. With EC2 launches impaired, these workflows were blocked, putting clusters in a “modifying” state that prevented query processing and making the cluster unavailable for workloads.
Redshift compute also comes from EC2: fair.
Amazon Redshift customers in all AWS Regions were unable to use IAM user credentials for executing queries due to a Redshift defect that used an IAM API in the N. Virginia (us-east-1) Region to resolve user groups. As a result, IAM’s impairment during this period caused Redshift to be unable to execute these queries. Redshift customers in AWS Regions who use “local” users to connect to their Redshift clusters were unaffected.
As a customer, this inter-region dependency due to IAM is something I’d rather AWS worked on.
Event Response
We are making several changes as a result of this operational event. We have already disabled the DynamoDB DNS Planner and the DNS Enactor automation worldwide. In advance of re-enabling this automation, we will fix the race condition scenario and add additional protections to prevent the application of incorrect DNS plans.
I don't see how they could just disable this automation. Maybe they reactivated some older, non-concurrent system not susceptible to the race condition?
It seems fair to assume that DynamoDB DNS entries need continuous updates as servers enter and exit the fleet, and that the scale of this is in the order of at least hundreds per day (if there are thousands of DNS records, hundreds of updates per day seems conservative).
If that’s the case, I don’t believe this could be stopped or turned into a manual process overnight.
For NLB, we are adding a velocity control mechanism to limit the capacity a single NLB can remove when health check failures cause AZ failover. For EC2, we are building an additional test suite to augment our existing scale testing, which will exercise the DWFM recovery workflow to identify any future regressions. We will improve the throttling mechanism in our EC2 data propagation systems to rate limit incoming work based on the size of the waiting queue to protect the service during periods of high load.
LGTM.
Final Comments
In my view, AWS did an amazing job with this report, sharing great details of its inner workings, as a way of showing respect and commitment to customers affected by this outage. The fact that about a third of all Internet services run on top of it is a testament to the market's confidence they’ve built. And it amazes me that such systems do exist and are able to power the Internet despite the complexity of them.
Despite some critical comments written above, AWS — as other cloud providers — is a feat of engineering. And I can hardly imagine how much pressure their personnel felt in dealing with this incident while large portions of the Internet were unavailable. These are certainly remarkable professionals.