Leandro Lima

Reading through the US-EAST-1 Service Disruption Summary Report

Leandro Lima — Sun, 02 Nov 2025 21:44:01 GMT

On October 19 and 20, 2025, the AWS North Virginia (us-east-1) region faced a disruption that took down services worldwide, from small websites to large e-commerce sites (including Amazon itself), banks and government services. According to AWS, the event started at 11:48 PM PDT on October 19 and ended at 2:20 AM PDT on October 20. Shortly after the end of the event, AWS released a summary of the event which might serve both as a way of getting a sneak peek into the inner workings of AWS and as a case study of how complex systems fail. This article will be an attempt at unraveling AWS’ event summary, going through the original event's timeline and adding personal comments to it. I’ll be quoting the original document from AWS throughout the article, editing out redundant portions, but avoiding any changes in content or meaning to the original text.

Glossary

To make the text a bit more accessible to a wider audience, I’ll put here a brief explanation of some of the terms used below:

IP address - Logical address of a system on the Internet or similar internal networks. This is what computers use to reference other computers when trying to connect to them.
DNS server - Domain Name System server. This maps names to IP addresses. For example: www.google.com → . The first step in any attempt to access a service on the Internet consists in checking a DNS server for what is the appropriate IP address to access when trying to reach that service. Only then computers try to reach the IP address returned from this inquiry.
Load Balancer - A server which acts as a proxy for other services behind it, redirecting connections according to availability and capacity. You can imagine going to a laundry service and handing your clothes to a person who's gonna put it in one of the available laundry machines and then return them back to you. This person doesn’t do the laundry and you don’t choose the machine, nor can you infer how many machines there are. But from the customer's perspective, this person does the laundry, in the sense that you give them dirty clothes and receive clean clothes back.
Stack - A data structure that works as an ordered pile of things: you put stuff on top of it and you pick stuff from the top of it. The last item put on top of it is the next item that's going to be picked up, unless someone puts more stuff on it before someone picks up the last item put there. You can imagine a pile of dirty dishes, for example.
Stochastic Process - A process whose behavior is randomly determined, presenting a pattern that may be analyzed statistically but may not be predicted precisely.
Long Tail Events - Probabilistic events with a very small likelihood, to the point of often being considered to not occur at all.
Lock - A synchronization primitive used to prevent a record from being modified by multiple systems at the same time to prevent inconsistencies. For example: imagine two people trying to transfer cash to the same bank account at the same time. Each tries to apply: new balance = old balance + transfer value. The desired result is new balance = old balance + first value + second value, but without the proper controls, it might end up being either new balance = old balance + first value or new balance = old balance + second value.
EIP - Elastic IP Address. An IP address within AWS's network that can be dynamically attached to servers on their network.
Hypervisor - A software that creates and runs virtual machines by abstracting and allocating a single physical server's resources like CPU, memory, and storage to multiple guest operating systems.

DynamoDB Request Routing Failure

Between 11:48 PM PDT on October 19 and 2:40 AM PDT on October 20, customers and other AWS services with dependencies on DynamoDB were unable to establish new connections to the service.

The incident was triggered by a latent defect within the service’s automated DNS management system that caused endpoint resolution failures for DynamoDB.

The root cause of this issue was a latent race condition in the DynamoDB DNS management system that resulted in an incorrect empty DNS record for the service’s regional endpoint (dynamodb.us-east-1.amazonaws.com) that the automation failed to repair.

To explain this event, we need to share some details about the DynamoDB DNS management architecture. The system is split across two independent components:

The DNS Planner, monitors the health and capacity of the load balancers and periodically creates a new DNS plan for each of the service’s endpoints consisting of a set of load balancers and weights.

The DNS Enactor, enacts DNS plans by applying the required changes in the Amazon Route53 service. The DNS Enactor operates redundantly and fully independently in three different Availability Zones (AZs). Each of these independent instances of the DNS Enactor looks for new plans and attempts to update Route53 by replacing the current plan with a new plan.

As users, we can see DynamoDB as the abstraction of an infinite NoSQL database, which is highly durable, highly available and highly performant. Here AWS shares a bit of how this is done:

Each region has a dynamically large amount of servers running DynamoDB instances.
Servers get mapped to load balancers, through an undisclosed process — either by self-registering or by some orchestration mechanism. The end result is that each load balancer is responsible for a set of servers.
A planning system, the DNS Planner, monitors the load balancers health and decides if they should be part of the regional DynamoDB fleet, and how much they should contribute to it. DNS Planner establishes a plan, but doesn’t execute it.
The plan is essentially a weighted list of load balancer IP addresses. And a deployment system, the DNS Enactor, acts on the Planner's plans by atomically deploying it to Route53 so that users know where they should connect to reach DynamoDB.

Under normal operations, a DNS Enactor picks up the latest plan and begins working through the service endpoints to apply this plan. This process typically completes rapidly and does an effective job of keeping DNS state freshly updated. Before it begins to apply a new plan, the DNS Enactor makes a one-time check that its plan is newer than the previously applied plan. As the DNS Enactor makes its way through the list of endpoints, it is possible to encounter delays as it attempts a transaction and is blocked by another DNS Enactor updating the same endpoint. In these cases, the DNS Enactor will retry each endpoint until the plan is successfully applied to all endpoints.

Here we can infer three things:

The communication path between the DNS Planner and the DNS Enactor behaves like a stack. Whenever a DNS Planner runs, it puts its new plan on top of the stack; whenever the DNS Enactor runs, it executes whatever is the newest plan on top of the stack.
The job of the DNS Planner is to pull the latest plan and deploy it through several DNS server instances, which should be quickly updated to start answering queries based on the latest plan. So the DNS Enactor is an orchestrator for a distributed DNS system.
An endpoint gets an update lock when an Enactor is working on it. And if a second Enactor reaches it when it's locked, it'll keep retrying the update until it's able to perform the task.

Right before this event started, one DNS Enactor experienced unusually high delays needing to retry its update on several of the DNS endpoints. As it was slowly working through the endpoints, several other things were also happening.

This, in my view, is the root cause. Unfortunately, people at AWS felt it wasn’t appropriate to comment on why it experienced unusually high delays. Maybe to preserve some industry secret, maybe because the process is intrinsically stochastic and they hit a long tail. But here is where the chain reaction starts: one instance of the DNS Enactor being stuck for an unexpected time applying a plan.

First, the DNS Planner continued to run and produced many newer generations of plans. Second, one of the other DNS Enactors then began applying one of the newer plans and rapidly progressed through all of the endpoints.

The timing of these events triggered the latent race condition. When the second Enactor (applying the newest plan) completed its endpoint updates, it then invoked the plan clean-up process, which identifies plans that are significantly older than the one it just applied and deletes them.

I can imagine here that when this system was first designed, the architect didn’t envision a situation where it'd make sense that a second Enactor would start after the first one, skip the line and finish first. Earlier we learned that the Enactor tries to really make sure it's applying the last plan before it starts, and that a conflict in two Enactors trying to change the same endpoint results in one waiting for the other's lock release. I can also imagine that in the architect's mind an Enactor applying a newer plan could end up waiting for the first Enactor to finish updating the endpoint so that it could apply its plan, but what seemed to happen was the opposite: for some reason, the first Enactor (with the older plan) ended up stuck waiting for lock releases from the second Enactor (with the newer plan). And then the second Enactor, in this unexpected state, performs this perfectly reasonable — within the assumptions — housekeeping job of deleting the plan being applied by the other Enactor.

At the same time that this clean-up process was invoked, the first Enactor (which had been unusually delayed) applied its much older plan to the regional DDB endpoint, overwriting the newer plan.

The second Enactor’s clean-up process then deleted this older plan because it was many generations older than the plan it had just applied.

The meaning of “many generations” is unclear, but I can totally see someone observing testing / statistical data where there is never a case where a plan older than N generations is ever needed, and hardcoding many = 2 * N or something — and then we hit a long tail.

As this plan was deleted, all IP addresses for the regional endpoint were immediately removed. Additionally, because the active plan was deleted, the system was left in an inconsistent state that prevented subsequent plan updates from being applied by any DNS Enactors.

It isn’t clear to me what they mean by saying “IP addresses for the regional endpoint were removed”. Does it mean that the IP addresses were disconnected from the DynamoDB instances running the service? Or that the DNS servers rely on an outside source of data for the plan that vanished under them? When I first read this, my interpretation was the latter, though my perception now is that only the first one makes sense:

If the DNS endpoints use an outside source of data, why would they need to run a process updating them one by one instead of just updating this data source? Plus, this would create an extra point of failure.
On the other hand, it does make sense that the instances get an EIP, this EIP gets tied to a plan and the EIP can only be deallocated from the instance and reused after the resource using it (the plan) doesn’t exist anymore.

The problem was: the plan was being used. There seems to be a missing piece here tracking plan usage (instead of assuming it wasn’t being used due to being many generations behind the latest applied plan — though being many generations behind could also lead to a collapse later, even if the rug wasn’t pulled from under the DNS endpoints.

When this issue occurred at 11:48 PM PDT, all systems needing to connect to the DynamoDB service in the N. Virginia (us-east-1) Region via the public endpoint immediately began experiencing DNS failures and failed to connect to DynamoDB.

Fair.

EC2 Instance Management Failure

DropletWorkflow Manager

Between 11:48 PM PDT on October 19 and 1:50 PM PDT on October 20, customers experienced increased EC2 API error rates, latencies, and instance launch failures in the N. Virginia (us-east-1) Region.

During this period new instance launches failed with either a “request limit exceeded” or “insufficient capacity” error.

To understand what happened, we need to share some information about a few subsystems that are used for the management of EC2 instance launches:

The DropletWorkflow Manager (DWFM) is responsible for the management of all the underlying physical servers (a.k.a. “droplets”) that are used for the hosting of EC2 instances.

The Network Manager is responsible for the management and propagation of network state to all EC2 instances and network appliances.

Each DWFM manages a set of droplets within each Availability Zone and maintains a lease for each droplet currently under management. This lease allows DWFM to track the droplet state, ensuring that all actions from the EC2 API or within the EC2 instance itself, such as shutdown or reboot operations originating from the EC2 instance operating system, result in the correct state changes within the broader EC2 systems.

As part of maintaining this lease, each DWFM host has to check in and complete a state check with each droplet that it manages every few minutes.

Here we’re learning about the systems that live on the interface between bare metal and virtual machines on EC2. At first it seems like DWFM is tracking the physical server health and availability, but later it seems like it also tracks the general state of the virtual machines running on it. So it seems to me that the DWFM is an orchestrator of hypervisors. And that the same way a Load Balancer does health checks on services under its responsibility — or lease, in the DWFM language — the DWFMs do the same on the hypervisors under their responsibility.

Starting at 11:48 PM PDT on October 19, these DWFM state checks began to fail as the process depends on DynamoDB and was unable to complete. While this did not affect any running EC2 instance, it did result in the droplet needing to establish a new lease with a DWFM before further instance state changes could happen for the EC2 instances it is hosting. Between 11:48 PM on October 19 and 2:24 AM on October 20, leases between DWFM and droplets within the EC2 fleet slowly started to time out.

Here's the root cause of failure on EC2: DWFMs lose their connection to the hypervisors they manage due to dependency on DynamoDB for this.

This shows a bit of how resiliency works at EC2: instances and hosts are able to work independently from any management as long as there are no state changes needed.

And why does it matter to us? If your recovery playbook — manual or automatic — includes restarting or recreating instances on a failure event, it's best to review it, as you can pretty easily move from a recoverable state to an unrecoverable state if EC2 orchestration systems are having any issues.

At 2:25 AM PDT, with the recovery of the DynamoDB APIs, DWFM began to re-establish leases with droplets across the EC2 fleet. Since any droplet without an active lease is not considered a candidate for new EC2 launches, the EC2 APIs were returning “insufficient capacity errors” for new incoming EC2 launch requests.

Fair: a lack of hypervisors leased to DWFMs feels exactly the same as a lack of physical servers available on the data center to launch enough instances.

Before this, I'd be pretty astonished to get an “insufficient capacity error” when trying to launch an EC2 instance. I'd probably wonder where I configured such a capacity limit and which capacity limit I was hitting. I don’t know if it'd dawn on me that AWS just didn’t have enough servers available to run my workload. And to be fair, this is totally to AWS' merit: they usually do such a good job of keeping these systems working that we — or I, at least — never have to consider that their physical capacity isn’t infinite.

(With the recovery of the DynamoDB APIs) DWFM began the process of reestablishing leases with droplets across the EC2 fleet; however, due to the large number of droplets, efforts to establish new droplet leases took long enough that the work could not be completed before they timed out. Additional work was queued to reattempt establishing the droplet lease. At this point, DWFM had entered a state of congestive collapse and was unable to make forward progress in recovering droplet leases.

From the DWFMs’ perspective, this process must have looked like the ramp-up of not one data center, but all data centers in all availability zones in the entire region. And that while the droplets were already running the customers’ workloads, the region was already receiving requests and a queue of requests that had built up over the last many hours.

I can undestand why the DWFM wasn’t designed for this specific scenario. Though it's unclear to me why it’d enter “a state of congestive collapse” and become “unable to make forward progress in recovering droplet leases”. My best guess is that the APIs used by the DWFMs were receiving the equivalent of an internal DDoS attack.

Here we have the root cause of the problem “EC2 fleet is unable to self heal”: the DWFM fleet was unable to handle the scenario where an out-of-specification number of droplets connections need to be reestablished at the same time.

Since this situation had no established operational recovery procedure, engineers took care in attempting to resolve the issue with DWFM without causing further issues. After attempting multiple mitigation steps, at 4:14 AM engineers throttled incoming work and began selective restarts of DWFM hosts to recover from this situation. Restarting the DWFM hosts cleared out the DWFM queues, reduced processing times, and allowed droplet leases to be established.

Maybe the text was just too vague for this analysis, but if selective restarts of DWFMs solved the situation, then it looks like the “equivalent of an internal DDoS attack” wasn't the reason why they couldn’t recover, but something more internal to how they’re architected.

In any case, my perception was that the personnel working on the case took a bold and wise decision to throttle the incoming work.

Network Manager

When a new EC2 instance is launched, a system called Network Manager propagates the network configuration. Shortly after the recovery of DWFM, Network Manager began propagating updated network configurations to newly launched instances and instances that had been terminated during the event. Since these network propagation events had been delayed by the issue with DWFM, a significant backlog of network state propagations needed to be processed. As a result, Network Manager started to experience increased latencies in network propagation times as it worked to process the backlog of network state changes. While new EC2 instances could be launched successfully, they would not have the necessary network connectivity due to the delays in network state propagation. Engineers worked to reduce the load on Network Manager to address network configuration propagation times and took action to accelerate recovery.

This is a bit of AWS' magic: if you ever worked with on-prem equipment, adding a server to a network usually consists of setting up its IP address, mask, gateway IP address and DNS servers; then you plug it into the right switch port and check that it works. But on AWS the network is software defined, the server is physically at one place, what you see as a server is a virtual machine and the network it needs to connect to doesn’t have a physical switch port you can plug it into.

Given that an availability zone is a pool of data centers, the network you’re connecting to (i.e., the other systems you'd like to access) may be not only in a different data center, but in an entirely different city. But through what I’ll call a smart “tunneling” mechanism this is all abstracted away and all systems in your availability zone behave as if they were connected to the same switch you are.

My understanding is that Network Manager is the system in charge of setting up all the proper configurations so that this abstraction works. Unfortunately and, at the same time, understandably, not much detail was given by this piece of the system, or how they “reduced the load” on the Network Manager.

Network Load Balancer

The delays in network state propagations for newly launched EC2 instances also caused impact to the Network Load Balancer (NLB) service.

NLB provides load balancing endpoints and routes traffic to backend targets. The architecture also makes use of a separate health check subsystem that regularly executes health checks against all nodes within the NLB architecture and will remove any nodes from service that are considered unhealthy.

During the event the NLB health checking subsystem began to experience increased health check failures. This was caused by the health checking subsystem bringing new EC2 instances into service while the network state for those instances had not yet fully propagated.

This meant that in some cases health checks would fail even though the underlying NLB node and backend targets were healthy. This resulted in health checks alternating between failing and healthy. This caused NLB nodes and backend targets to be removed from DNS, only to be returned to service when the next health check succeeded.

From this, it looks like the NLB is composed of at least two systems running on top of EC2:

The load balancer itself, which appears to be a specially configured EC2 virtual machine
The health check subsystem, which decides to bring NLB nodes (the load balancers themselves) online or offline by adding or removing their IP addresses from the DNS names.

A third (not covered) subsystem at AWS dynamically scales up or down the number of NLB instances — or when NLB instances need to be taken down due to some failure, their network configuration needs to be propagated by the Network Manager. And if this takes too long to sync up, the health check for a certain instance being brought up might end up failing.

So you could have an EC2 virtual machine which didn’t change state and was perfectly setup and reachable, but the NLB nodes responsible for forwarding traffic to it, for some reason, needed to be replaced or scaled up. In that case, the time it took for their network configuration to propagate led to health check failures (that is: check of the NLB VMs health, not the underlying targets), leading to an inability to add NLB capacity to the fleet.

The alternating health check results increased the load on the health check subsystem, causing it to degrade, resulting in delays in health checks and triggering automatic AZ DNS failover to occur. For multi-AZ load balancers, this resulted in capacity being taken out of service. In this case, an application experienced increased connection errors if the remaining healthy capacity was insufficient to carry the application load.

It's interesting to note how modern day multi-layered load balancing and health checking ended up making this specific system less reliable; in 1986 the DNS record MX was introduced with this format:

Domain            TTL   Class    Type  Priority      Host
example.com.    1936    IN        MX        10         onemail.example.com.
example.com.    1936    IN        MX        10         twomail.example.com.

In this example, it specifies that the email servers handling emails for example.com are onemail.example.com and twomail.example.com, equally balanced within them (through the priority field). When querying the MX records for example.com, both hosts are returned, and it’s the responsibility of the client trying to access the system to randomly choose one of the two servers. Failure is also handled client-side by retrying a different email server if the connection fails.

This is what powers all email systems since then, and doesn’t have a checking or load balancing system to fail — if there is a healthy underlying target (in NLB’s language), the delivery will happen. Big email providers like AWS itself and Gmail probably actually have some sort of load balancer as their listed email hosts, instead of the email servers themselves. But at least for smaller scale services, this poor man's DNS-based load balancing works well even today.

Other AWS Services

Lambda, ECS, EKS, Fargate, Amazon Connect

NLB health check failures triggered instance terminations leaving a subset of Lambda internal systems under-scaled.

We see here that Lambda Functions run on EC2 behind NLB.

Customers experienced container launch failures and cluster scaling delays across both Amazon Elastic Container Service (ECS), Elastic Kubernetes Service (EKS), and Fargate in the N. Virginia (us-east-1) Region.

Fair: container services’ capacities are provided by EC2, either directly through EC2 instances or through managed services, like Fargate.

Amazon Connect customers experienced elevated errors handling calls, chats, and cases. Following the restoration of DynamoDB endpoints, most Connect features recovered. Starting at 7:04 AM, customers again experienced increased errors which was caused by impact to the NLBs used by Connect as well as increased error rates and latencies for Lambda function invocations.

AWS does eat its own dog food: Amazon Connect runs on EC2, behind NLBs accessing DynamoDB and calling Lambda functions.

IAM and STS

Customers experienced AWS Security Token Service (STS) API errors and latency in the N. Virginia (us-east-1) Region. STS recovered at 1:19 AM after the restoration of internal DynamoDB endpoints. Between 8:31 AM and 9:59 AM, STS API error rates and latency increased again as a result of NLB health check failures. By 9:59 AM, we recovered from the NLB health check failures, and the service began normal operations.

AWS customers attempting to sign into the AWS Management Console using an IAM user experienced increased authentication failures due to underlying DynamoDB issues. Customers with IAM Identity Center configured in N. Virginia (us-east-1) Region were also unable to sign in using Identity Center. Customers using their root credential, and customers using identity federation configured to use signin.aws.amazon.com experienced errors when trying to log into the AWS Management Console in regions outside of the N. Virginia (us-east-1) Region. As DynamoDB endpoints became accessible, the service began normal operations.

I’d imagine IAM and STS would get their own independent systems, given that they’re also used to control access to DynamoDB, which they're making use of.

IAM and STS being dependent on N. Virginia DynamoDB raises one interesting reliability issue: if you're using any AWS system, in any region, that depends on IAM, your service may be affected when us-east-1 has issues, particularly during extended outages.

It’s reasonable to be multi-region to minimize latency, to meet regulatory data location requirements and to multiple copies of data with greater physical separation (in case of geopolitical events, for example). But as long as you depend on IAM (as most stuff running on AWS does), if N. Virginia goes down, your service may suffer as well, regardless of the region you set it up on.

Granted that I’m a mere outside observer, if anything, as an AWS customer, I’d really prefer that they did some work on IAM architecture so that workloads in different regions wouldn’t depend on authentication being available at N. Virginia.

Redshift

Redshift query processing relies on DynamoDB endpoints to read and write data from clusters. As DynamoDB endpoints recovered.

Redshift processing depends on DynamoDB for processing: fair.

Redshift automation triggers workflows to replace the underlying EC2 hosts with new instances. With EC2 launches impaired, these workflows were blocked, putting clusters in a “modifying” state that prevented query processing and making the cluster unavailable for workloads.

Redshift compute also comes from EC2: fair.

Amazon Redshift customers in all AWS Regions were unable to use IAM user credentials for executing queries due to a Redshift defect that used an IAM API in the N. Virginia (us-east-1) Region to resolve user groups. As a result, IAM’s impairment during this period caused Redshift to be unable to execute these queries. Redshift customers in AWS Regions who use “local” users to connect to their Redshift clusters were unaffected.

As a customer, this inter-region dependency due to IAM is something I’d rather AWS worked on.

Event Response

We are making several changes as a result of this operational event. We have already disabled the DynamoDB DNS Planner and the DNS Enactor automation worldwide. In advance of re-enabling this automation, we will fix the race condition scenario and add additional protections to prevent the application of incorrect DNS plans.

I don't see how they could just disable this automation. Maybe they reactivated some older, non-concurrent system not susceptible to the race condition?

It seems fair to assume that DynamoDB DNS entries need continuous updates as servers enter and exit the fleet, and that the scale of this is in the order of at least hundreds per day (if there are thousands of DNS records, hundreds of updates per day seems conservative).

If that’s the case, I don’t believe this could be stopped or turned into a manual process overnight.

For NLB, we are adding a velocity control mechanism to limit the capacity a single NLB can remove when health check failures cause AZ failover. For EC2, we are building an additional test suite to augment our existing scale testing, which will exercise the DWFM recovery workflow to identify any future regressions. We will improve the throttling mechanism in our EC2 data propagation systems to rate limit incoming work based on the size of the waiting queue to protect the service during periods of high load.

LGTM.

Final Comments

In my view, AWS did an amazing job with this report, sharing great details of its inner workings, as a way of showing respect and commitment to customers affected by this outage. The fact that about a third of all Internet services run on top of it is a testament to the market's confidence they’ve built. And it amazes me that such systems do exist and are able to power the Internet despite the complexity of them.

Despite some critical comments written above, AWS — as other cloud providers — is a feat of engineering. And I can hardly imagine how much pressure their personnel felt in dealing with this incident while large portions of the Internet were unavailable. These are certainly remarkable professionals.

It's Time to Write Tests

Leandro Lima — Thu, 11 Sep 2025 21:04:08 GMT

Introduction

I've been working on a reminder assistant that operates through WhatsApp. Essentially, you ask to be reminded of something at a certain point in time, and you receive a message when that time arrives. There are other features like recurring reminders, but that's not the focus here. One thing that proved surprisingly challenging is how humans refer to and interpret time references. For computers, you usually want something strict and well-defined: a specific date, time, or a pattern you can match against, like a cron expression. But humans communicate time rather ambiguously—and it works!

Let's start with a simple example: “remind me to pick up my son at 9.” Is it 9 AM or 9 PM? Today? Tomorrow? Every day? We might think of a simple rule to resolve this: the next occurrence of 9, whether AM or PM. So if it's 8 PM now, we mean 9 PM, but if it's 10 PM now, it's 9 AM tomorrow. And it's recurring if we say something like “every” or “every day.” But what if the reminder is for “at 3”? Assuming it's past 3 PM, does this mean picking up my son at 3 AM? Unlikely. We can improve the rules for that scenario... and I've tried. But for every rule you add, a simple counter-example can be easily found in something humans say and understand naturally. There's a lot of “common sense” that goes into figuring it out.

Because of that, I ended up using an LLM to translate what the customer means into something a computer can work with. More specifically, I tried Anthropic's Haiku 3.5, which worked most of the time but not always, and ended up settling on Sonnet 4, which in my tests was able to figure it out properly for all the samples I could come up with. But note what I just wrote: "in my tests... all the samples I could come up with."

My Take on Software Tests

I'm an electrical engineer, and I had a fascinating software development course in college with a strong focus on testing. The course involved developing software to control an elevator. It had the usual interface: users outside could push buttons to call the elevator, users inside could push buttons to select their floor, buttons could be pushed multiple times, and the elevator needed to switch directions, accelerate, decelerate, etc. We were tasked with building it modularly, with unit tests for each function and method, checking that for each interaction, the outputs and internal state were correct.

After college, I ended up in a trainee program at Embraer, where I had the opportunity to observe an aircraft fuselage under test for seven years. For airplanes, their lifespans are measured in cycles of pressurization and depressurization. Regulations stipulate that no aircraft can fly with more cycles than those tested at the factory (the tests don't have to be completed before the first delivery, just stay ahead of the operators). So if an aircraft will accumulate, say, 60,000 cycles over its lifetime, it should undergo at least 60,000 pressurization cycles at the factory, with instrumentation and regular checks for cracks and material fatigue. This leads to maintenance and correction bulletins being written for the aircraft.

But the two examples have a few differences from how most software development happens in real life:

There's usually no clear and fixed specification for how most software should behave in all scenarios. There are cases where this does exist—for example, a vehicle controller, a video encoder following a certain specification, or the implementation of an API that should conform to a specific standard. But for most projects, the software specification is a living thing, which is why we have OTA updates, continuous deployment, A/B testing, etc.
Modern software is usually built from many parts, each with numerous possible states and failure modes. Think of a TCP connection: there are 11 possible states, packet losses, latency issues, etc. Building upon this, we have higher-level protocols, services, and entire applications behind them, turning this into a recursive problem. The result is that most applications have essentially an infinite state space once all of their parts are composed.
Most application failures aren't life-threatening. Let's say Google goes offline—I can't remember the last time that happened. Google is pretty important, and Google Search has almost utility status in modern society. People just expect it to be there, available and running. But it's unlikely that anyone has ever died or will ever die from Google Search being unavailable. If Tesla's autopilot crashes at the wrong time, though, people could be in real danger—even more so with a flight control system. The majority of applications have neither Google Search's utility status nor a flight control system's criticality.
Most software applications are more business-sensitive to lack of innovation than to lack of long-tail reliability.
Most applications aren't standardized like aircraft, where you can create and evolve one standard set of tests that over time can be used to increase the reliability of the entire industry.

There are probably other differences we could enumerate, but because of the differences above, I believe that for most software, using tests to cover its state space is rather cumbersome—that is, possibly of infinite cost—and inadequate given the consequences associated with the risk of failure.

On the other hand, the code base our applications run on is finite, so one might argue that test suites should aim for high coverage. But even with 100% coverage, formal logic dictates that if a test fails, there is an error either in the test or in the software under test, but if it passes, there might or might not be an error in the code being tested, the test itself, or both.

# Program with 100% test coverage

def is_even(number: int) -> bool:
    return number == 2

def test_is_even_true() -> None:
    assert is_even(2) is True

def test_is_even_false() -> None:
    assert is_even(3) is False

So, should we just abandon the notion of correct, reliable software? In my view, no. As illustrated above, if we write small, clear, well-defined functions with specific purposes, logical inspection is a superior way of ensuring software quality than test coverage.

On a small note regarding such heresy, I don't intend to say that all tests are useless in all projects. In FOSS projects, for example, with a large number of contributors where each author has limited understanding of the whole, but the correct behavior is well agreed upon, and developer skill levels vary, a growing body of tests is, in my view, a good way to prevent the introduction of bugs previously envisioned or corrected by earlier developers.

But AI is Different

If regular software can be inspected for logic errors, AI systems cannot. Despite all the interpretability efforts, AI systems are usually considered black boxes—probabilistic systems that we know work most of the time for a certain subset of problems, but that largely can't be inspected for output errors. Remember when Google Photos used to misclassify pictures of humans as animals? Or when Gemini would generate ethnically diverse images of German soldiers in World War II? These are just well-known examples of large-scale failures from a company that certainly doesn't lack technical resources or personnel for developing some of the best technology in the world.

But failures in AI deployment aren't constrained to such obvious examples.

Above is what should be a montage of recent complaints about degraded performance for Claude Code. The montage was created using Google AI Studio, and interestingly enough, has several errors 😂.

In any case, Anthropic has since added a notice to their status page:

And I totally buy it. I don't think Anthropic had any intention of degrading their models. If anything, Anthropic is, in my view, one of the most transparent AI labs, publishing extensive research on their models, educating people on their limitations and how to better use them, even when it doesn't necessarily favor them in some aspects.

Unfortunately, to my knowledge, there's no publicly available information on what caused this quality degradation, but one word catches my attention: "quality."

Was there an outage? No.
Was there an increased number of "errors," in the sense of API 500 errors, for example? No.
Did the models provide a completion to users' inputs? Yes.

Through my engineering background, I understand the general concept of "quality" as how much a manufactured good adheres to a given specification. Intuitively, we know that quality goes beyond manufacturing, and we have a tacit understanding of what it means. But in the brief research I did for this article, I found it quite interesting how difficult it is to find a suitable definition of quality on Wikipedia that matches this problem:

“Quality often focuses on manufacturing defects during the warranty phase”
“Inherent degree of excellence”
“Conformance to requirements or specifications at the start of use“
“Fraction of product units shipped that meet specifications“
“Number of warranty claims during the warranty period”
“Non-conformance with a requirement (e.g., basic functionality or a key dimension)”

These are mostly Six Sigma-related definitions, but definitions found in "software quality" articles seemed to me equally or more inadequate.

From the above, the best fit came from "inherent degree of excellence," which is also the fuzziest and least applicable. And the notice from Anthropic shows this, as they mention their monitoring includes "reports of degradation."

Different Tests for Different Reasons

Back to the reminder service I’m running: around the same time such “quality” issues were reported with Claude, I noticed a few reminders I had requested being interpreted in an odd manner, such as “pick up my son at 9” changing from a single event to a recurring event. To be fair, these also happened around the same time I made some system prompt changes to fix a different issue—though seemingly unrelated to this matter.

Unfortunately, I currently don't have any specific tests in place to monitor this precisely. So did the interpretation of some time inputs change because my prompt changed or because something in the model changed? While I did minor testing when I changed the system prompt, I don't have a comprehensive battery of tests, and fortunately my service is still small enough that I'm able to notice the degradation and manually investigate it.

But I'm taking a lesson from this: if you're running LLMs in production, there must be some constructed metric that determines whether a completion is within your application's definition of correct, and tests for this should be run at both regular time intervals (to catch statistical deviations from the model provider) and whenever any input changes are made, even seemingly unrelated ones (to catch statistical deviations due to the change).

Unlike strict software correctness tests, these should be more like field sobriety tests, where we're not measuring if the person has complete dexterity or absolutely correct pace, but whether the model is behaving statistically within what we consider to be normally accepted behavior for the application.

Conclusion

Regardless of which camp you're in regarding software quality assurance methods, AI technologies bring us to a different arena, where we're no longer dealing with mostly deterministic systems. Variability is expected, just like with human beings, and systems may subtly escape fuzzy definitions of quality. We must then strive to develop solutions to monitor this variability so that we can identify problems and their sources before they impact customers, and be able to openly disclose malfunctions to customers when they happen.

A path for improving LLM coding tools

Leandro Lima — Tue, 09 Sep 2025 20:11:24 GMT

Introduction

One thing you might notice when working with LLM-based coding tools is how much they struggle to get the right information into their context windows. LLMs usually know almost everything there is to know about libraries, code patterns, algorithms, language syntax, etc. When you ask them to plan and build a greenfield application—provided you specify correctly what you want—they usually do a fairly good job. However, if you ask them to change a specific feature in a somewhat large codebase, they often struggle to find their way around, frequently introducing bugs and duplicating code.

When using Claude Code, I've noticed two patterns: it either searches for information by reading entire files, or it reads small chunks using tail/head/grep. Now imagine if any human had to work like this—constantly choosing between bringing entire documents into your mind (your "context window") or sipping bits of information through trial and error as you try to find your way around. In this mode, there's no opportunity to quickly scan documents with your eyes, spend more time on one section, quickly discard what you've just read, or compress previously read information so you know where to return if needed. And I'm not even trying to be exhaustive about this problem.

Not only an LLM struggle

Now, I don't know about other developers, but I don't think LLMs are the only ones to struggle with code organization and navigation. We've come a long way with search engines and all, but with code it still feels like we're using abstractions that were adequate back when the first higher-level programming languages were created, some 30 or 40 years ago. As a developer who has used many editors—from Notepad in the '90s to PyCharm nowadays—I should say I got an immense productivity boost when I adopted JetBrains tools. While a range of IDE and text editor options exist nowadays, one thing about JetBrains IDEs keeps me locked in: their amazing capability for indexing and tracking references in a codebase. It's obviously a good idea to stay organized, especially when collaborating with others—or with your future self—but once you use a tool like that and give the IDE the proper syntactic clues (type annotations and such), navigating a codebase and finding what calls what becomes mostly a matter of ⌘-clicking function calls to see where they go, often even through library code.

A different abstraction

But thinking about this, what I believe the IDE is doing is essentially emulating an abstraction where we're no longer using files. There are variables, functions, and classes, and they do need to be written somewhere in a specific order for the compiler or interpreter, but when navigating them, we're jumping between them as if they were entities not tied to any specific location—simply nodes in a graph. The code could just as well be written in a single file, tape, or context window—as long as you can avoid name collisions (through some sort of namespacing) and quickly find and retrieve the right 'item,' you're done.

But why should only humans get to have nice things? Sure, taking this reasoning to its limits, we might conclude that the problem lies with modern programming languages themselves. We have modules, packages, crates, and all sorts of file-based code organization, but to my knowledge, none require each node to have its own file while being easily referenceable through the filesystem. Following this logic, only by creating a new language free of such design flaws could we solve the problem. Or perhaps this problem isn't even worth solving, or solving it would create even less manageable problems than the one described above: imagine the entire Linux kernel source code after preprocessing, in a single file, with every contributor working on part of this monolithic system. I'm certainly not proposing that.

On the other hand, there does exist a system where a "single file" isn't messy at all. Consider an SQL database—I'll use Postgres since it's the one I'm most familiar with. You have a database that contains schemas (think namespaces), which contain a myriad of objects: tables, views, procedures, constraints, indexes, etc. You may have a single-digit number of any of those, or hundreds of them. Within each table you have rows that might reference other rows, and within procedures you can reference other procedures in the same or different schemas, tables, and so on. None of this is tied to a "file"—you never grep a row and end up getting nearby rows as well, or end up getting just part of the row. Everything is directly addressable.

Obviously, this abstraction isn't free—there are files underneath it, along with a planner, indices, and sophisticated saving, syncing, and all sorts of machinery to keep this abstraction running. But compared to the work an LLM has to do by searching code through grep/head/tail and sending the results over the web, where GPU clusters process this through millions or billions of weights just to figure out that the function we need to edit is on line 27 of something.py, extending down to line 42—a relational database seems like a pretty cheap abstraction to me.

And if we're going down the rabbit hole, we could eventually even have the entire AST mapped through relationships, so that renaming anything could be easily achieved with an UPDATE, and bad references could be prevented by foreign keys.

Back to reality

But this thought experiment is going too far, and going that deep will certainly reveal problems I'm not even aware of—problems that someone with compiler experience will probably spot instantly.

But maybe there's a version of the problem that solves a large chunk of the problem with only a small subset of the complexity: what if we just indexed namespaces?

Suppose an LLM could query:

SELECT method FROM classes WHERE class_name = 'MyClass' AND module = 'mypackage.mymodule';

Or even SELECT body for that class if that's what the model wanted to see? Or select line numbers?

Conclusion

I believe that while the file abstraction might be too ingrained in our work to move away from, and while pulling the entire AST into a relational database might be too much to start with, indexing namespaces in a relational database might be a good start. I also wonder why JetBrains, which has already done much of the hard work, doesn't provide such an interface to LLMs working within their IDEs.

In any case, making this kind of index available through an MCP server doesn't seem like a huge endeavor, and it might be something I try in the near future—which makes me think: why doesn't this exist yet? Maybe I'll find out when I try to implement it.

AWS Lambda Cold Starts: Real-World Cost Optimization

Leandro Lima — Sat, 07 Jun 2025 18:11:51 GMT

Note on AI usage: this blog post was initially sent in a conversation format to a friend through Signal and converted into a blog post by Claude Opus 4. The content of it was created by myself, though the wording and the blog post format is by Claude. If this doesn’t work for you, you’re free to skip it.

When working with serverless architectures, one of the key performance considerations is the cold start problem. I'll share a practical example from a housekeeping Lambda function that revealed some useful insights about balancing performance and cost in AWS Lambda.

The Setup: A Simple Housekeeping Function

I recently implemented a Lambda function that performs routine housekeeping tasks in our system. I configured it to run every 3 minutes using Amazon EventBridge (formerly CloudWatch Events). The function is configured with minimal resources - just 128 MB of memory - since the task is relatively lightweight.

Measuring Cold Start vs Warm Start Performance

After deploying the function, I collected performance metrics from CloudWatch Logs. The results showed a significant difference:

Cold Start Performance:

REPORT RequestId: 09684470-a709-4e0a-9bbf-e6b5fa67b808 
Duration: 2304.58 ms 
Billed Duration: 2305 ms 
Memory Size: 128 MB 
Max Memory Used: 104 MB 
Init Duration: 702.59 ms

Warm Start Performance:

REPORT RequestId: 09684472-c309-4e0a-9bbf-e6b5fa67b808 
Duration: 69.60 ms 
Billed Duration: 70 ms 
Memory Size: 128 MB 
Max Memory Used: 104 MB

The difference is notable: cold starts take approximately 2,300ms, while warm starts complete in just 70ms. That's a 33x performance difference.

The Bimodal Nature of Lambda Scheduling

This performance characteristic creates an interesting optimization problem. AWS Lambda keeps functions "warm" (ready to execute without initialization) for a limited time after execution - typically around 3-5 minutes, though this isn't guaranteed.

This means we have two distinct scheduling strategies:

Frequent Execution (Every 3 minutes): The function stays warm, executing in 70ms each time
Infrequent Execution (Every 1-2 hours): The function experiences a cold start each time, taking 2,300ms

Here's the interesting part: from a billing perspective, running the function 33 times with warm starts costs roughly the same as running it once with a cold start (2,300ms ÷ 70ms ≈ 33).

The Cost Analysis

Let's break down the actual costs for running this function every 3 minutes:

Monthly execution count:

30 days × 24 hours × 20 executions per hour = 14,400 executions per month

AWS Lambda Pricing (outside free tier):

Compute: $0.0000133334 per GB-second
Requests: $0.20 per million requests

Monthly cost calculation:

For compute charges:

14,400 executions × 70ms × (1s/1000ms) × 128MB × (1GB/1024MB) × $0.0000133334/GB-s 
= $0.00168 per month

For request charges:

14,400 × $0.20/1,000,000 
= $0.00288 per month

Total monthly cost: $0.00456

That's less than half a cent per month for 14,400 executions.

Key Takeaways and Optimization Strategies

The analysis reveals a fundamental characteristic of Lambda scheduling that has direct implications for cost optimization:

Lambda execution is bimodal: You essentially have two choices - run your function frequently enough to keep it warm (every ~3 minutes), or accept cold starts and run it infrequently. There's no middle ground that makes economic sense.
Cost equivalence between strategies: Due to the 33x performance difference between cold and warm starts, running a function every 3 minutes with warm starts costs approximately the same as running it every 1.6 hours with cold starts. This creates an interesting economic equivalence where you can choose based on your needs rather than cost.
The intermediate interval trap: The worst possible choice is to schedule your function at intermediate intervals like 15 minutes. At this frequency, the function will have gone cold between executions, so you're paying for cold starts on every run while still executing frequently. You get neither the benefit of warm execution nor the reduced frequency of accepting cold starts.
Practical scheduling decisions: Given this bimodal nature, your scheduling strategy should be binary:
- If you need consistent low latency: Schedule every 2-3 minutes to maintain warm state
- If latency isn't critical: Schedule every 1-2+ hours and accept cold starts
- Never schedule in the 5-30 minute range unless you have a specific reason

The mathematics here are straightforward but the implications are significant. Understanding that Lambda pricing creates these two distinct optimal operating modes allows you to make informed decisions rather than picking arbitrary intervals that might seem reasonable but are actually inefficient.

Conclusion

This real-world example demonstrates a counterintuitive truth about AWS Lambda: there are really only two cost-effective ways to schedule your functions. The 33x performance difference between cold and warm starts creates a bimodal optimization landscape where intermediate scheduling intervals are economically inefficient.

The math reveals that running a function every 3 minutes (warm) costs the same as running it every 1.6 hours (cold). This cost equivalence means you should make a binary choice: either commit to keeping your function warm with frequent executions, or space them far enough apart to justify the cold start overhead. Anything in between - like the seemingly reasonable 15-minute interval - gives you the worst of both worlds.

For this housekeeping function, I chose to run it every 3 minutes to keep it warm. The decision was easy for several reasons:

First, at less than half a cent per month for 14,400 executions, the cost is negligible. But more importantly, running the housekeeping task frequently provides significant operational benefits. The more often I run it, the less work accumulates between runs, which means each execution processes fewer items. This distributes the database load more evenly throughout the day instead of creating periodic spikes. Since the task syncs user data, more frequent runs also mean users see more consistent, up-to-date information across the system.

The lesson here goes beyond simple cost optimization. Understanding the bimodal nature of Lambda execution helps you see that sometimes what appears to be "over-scheduling" is actually the optimal choice when you consider the full system impact. When the economics make frequent and infrequent execution equivalent in cost, you're free to choose based on what benefits your application and users the most.

Making Error Paths Visible: Learning from Rust's Type System

Leandro Lima — Sun, 15 Dec 2024 22:46:30 GMT

Introduction

Back when I first started with Python, my first web framework wasn’t Django or Flask — it was Tornado Web. I’m not sure of all the exact reasons why I started with it, but I’m thankful to this day that I did.. Tornado had this unique way of handling asynchronous operations, way before Python's asyncio came through.

In Tornado, you'd write async code like this:

@gen.coroutine
def async_function():
    result = yield some_async_operation()
    raise gen.Return(result)

Tornado's approach was a rather clever exploit of the language, but it revealed something deeper to me: return and raise are fundamentally the same thing. One is just conventionally used for success cases and the other for errors. You could think of one as the generalization of another, and a language would only need one of them. Let's examine two contrasting patterns that highlight this duality.

Return-only syntax

def div(a: float, b: float) -> float | ZeroDivisionError:
    if b == 0:
        return ZeroDivisionError("division by zero")
    else:
        return a / b

def main() -> None:
    a = float(input())
    b = float(input())
    ans = div(a, b)
    if isinstance(ans, ZeroDivisionError):
        print("Oops, can't divide by zero!")
    else:
        print(f"Result: {ans}")

While the return-based approach offers explicit error handling, we can achieve similar results using Python's traditional exception mechanism. Here's how the same logic looks using raise.

Raise-only syntax

class Return(Exception):
    def __init__(self, ans: float) -> None:
        self.ans = ans

def div(a: float, b: float) -> None:
    if b == 0:
        raise ZeroDivisionError("division by zero")
    else:
        raise Return(a / b)

def main() -> None:
    a = float(input())
    b = float(input())
    try:
        ans = div(a, b)
    except ZeroDivisionError:
        print("Oops, can't divide by zero!")
    except Return as e:
        print(f"Result: {e.ans}")

As you can see above, raise can be thought of one of multiple possible return types, and, similarly, return can be thought just one more exception condition possible.

As this realization stayed with me, and as Python incorporated type annotations, it bothered me that while there was a clear syntax for annotating return types, there wasn’t one for annotating exceptions. Given how ubiquitous exceptions are in Python (and not really something exceptional) and pretty much the same thing from a more theoretical perspective, not having it in the signature of the function is similar as having the return type annotated with T | Any:

def div(a: float, b: float) -> float | Any:
    if b == 0:
        return ZeroDivisionError("division by zero")
    else:
        return a / b

def main() -> None:
    a = float(input())
    b = float(input())
    ans = div(a, b)
    if isinstance(ans, float):
        print(f"Result: {ans}")
    else:
        print(f"Oops! Who know what happened?")

Rust's Elegant Solution: A Unified Type System

These patterns reveal the fundamental similarity between returns and exceptions, but neither approach feels completely satisfactory. This is where Rust's type system offers an elegant solution. Instead of having separate mechanisms for success and error cases, or leaving error cases invisible in type signatures, Rust unifies everything into a single type system concept. It provides two main types for this purpose: Option for simple success/failure cases, and Result for cases where you want to specify what went wrong.

The Option Type: Simple Success or Failure

Let's start with the simpler case. Sometimes you just need to express "it worked" or "it didn't" without additional detail. In Python, you might write:

def safe_div(a: float, b: float) -> float | None:
    if b == 0:
        return None
    return a / b

Rust makes this pattern more explicit and type-safe with Option:

enum Option {
    None,
    Some(T),
}

fn safe_div(a: f64, b: f64) -> Option<f64> {
    if b == 0.0 {
        None
    } else {
        Some(a / b)
    }
}

Result: When You Need More Detail

Remember our Python example where we returned either a float or a ZeroDivisionError? Rust's Result type captures this pattern perfectly:

enum Result {
    Ok(T),
    Err(E),
}

fn div(a: f64, b: f64) -> Result<f64, &'static str> {
    if b == 0.0 {
        Err("division by zero")
    } else {
        Ok(a / b)
    }
}

Unlike our Python examples where we had to choose between using return or raise, or where error conditions weren't visible in type signatures, Rust's approach:

Makes all possible outcomes explicit in the type signature
Forces handling of both success and error cases
Unifies error handling into a single, consistent pattern

The Power of Exhaustive Matching

Where this really shines is in how Rust forces you to handle all cases:

match div(a, b) {
    Ok(value) => {
        println!("Division result: {}", value);
    },
    Err(message) => {
        println!("Error occurred: {}", message);
    },
}

If you forget to handle either case, the compiler will refuse to compile your code. This eliminates the kind of bugs we saw in our Python example where we had to remember to check the return type with isinstance().

The Duality of Returns and Errors: A Unified Perspective

These examples reveal a fundamental truth: functions can return multiple types of values - some representing success, others representing failure. The only real difference is in how we encode and handle these different paths.

We could do in Python the same as in Rust by using generics:

from typing import Generic, TypeVar

T = TypeVar('T')
E = TypeVar('E')

class Result(Generic[T, E]):
    def __init__(self, value: T | E, is_ok: bool) -> None:
        self._value = value
        self._is_ok = is_ok

    @classmethod
    def ok(cls, value: T) -> 'Result[T, E]':
        return cls(value, True)

    @classmethod
    def err(cls, error: E) -> 'Result[T, E]':
        return cls(error, False)

def div(a: float, b: float) -> Result[float, str]:
    if b == 0:
        return Result.err("division by zero")
    return Result.ok(a / b)

Similarly, for simpler cases where we just care about success or failure:

class Option(Generic[T]):
    def __init__(self, value: T | None) -> None:
        self._value = value

    @classmethod
    def some(cls, value: T) -> 'Option[T]':
        return cls(value)

    @classmethod
    def none(cls) -> 'Option[T]':
        return cls(None)

def safe_div(a: float, b: float) -> Option[float]:
    if b == 0:
        return Option.none()
    return Option.some(a / b)

The Power of Making Paths Explicit

This approach of encoding all possible outcomes in types, whether implemented in Python, Rust, or any other language, has several benefits:

The function signature tells you everything about what the function might return, including error cases.
The compiler or type checker can verify that all cases are handled.
Error handling becomes a first-class concern in your code's architecture
There's no hidden control flow through exceptions

The result is code that's safer, clearer and more maintainable.

Pattern Matching Makes it Clean

One reason Rust's implementation of this pattern is particularly elegant is its pattern matching syntax, but we could implement something similar in Python using match statements (Python 3.10+):

def div(a: float, b: float) -> Result[float, DivisionByZeroError | OverflowError]:
    if b == 0:
        return Err(DivisionByZeroError())
    if abs(a) > 1e308:  # Python's float max
        return Err(OverflowError())
    return Ok(a / b)

match div(a, b):
    case Ok(value):
        print(f"Result: {value}")
    case Err(DivisionByZeroError()):
        print("Can't divide by zero")
    case Err(OverflowError()):
        print("Number too large")

Type Systems as Developer Tools

The real power of encoding error paths in types becomes apparent when working with modern development tools. Consider a large code base with multiple layers of abstraction:

# Low level database function
def fetch_from_db(user_id: str) -> Result[dict, ConnectionError | NotFoundError]:
    # Simulating DB access...
    if user_id == "":
        return Err(ConnectionError())
    if user_id == "404":
        return Err(NotFoundError())
    return Ok({"id": user_id, "name": "John"})

# Mid level business logic
def validate_user(data: dict) -> Result[dict, ValidationError]:
    if "name" not in data:
        return Err(ValidationError())
    return Ok(data)

# High level workflow
def get_user(user_id: str) -> Result[dict, ConnectionError | NotFoundError | ValidationError]:
    # The IDE will help us handle all error cases from lower levels
    db_result = fetch_from_db(user_id)
    match db_result:
        case Ok(data):
            return validate_user(data)  # ValidationError automatically becomes part of our error type
        case Err(ConnectionError()):
            return Err(ConnectionError())  # Pass through the lower level error
        case Err(NotFoundError()):
            return Err(NotFoundError())

# Usage with pattern matching
match get_user("some_id"):
    case Ok(user):
        print(f"Found user: {user}")
    case Err(ConnectionError()):
        print("Could not connect to database")
    case Err(NotFoundError()):
        print("User not found")
    case Err(ValidationError()):
        print("Invalid user data")

With proper type annotations:

IDE Support: Tools like PyCharm can:
- Warn if you forget to handle an error case
- Show you exactly what types of errors each function might return
- Provide autocomplete for error handling patterns
- Track error types through complex call chains
Refactoring Safety: When you change an error type in one function, the IDE can highlight every place that needs to be updated to handle the new error type.
Documentation at Your Fingertips: Hover over any function to see not just what it returns on success, but all the ways it might fail.

Contrast with Traditional Exceptions

Consider how this differs from traditional exception handling in a large code base:

# Traditional exception approach
def fetch_from_db(user_id: str) -> dict:
    # What can be raised here? Need to check implementation or docs
    if user_id == "":
        raise ConnectionError("Database unavailable")
    if user_id == "404":
        raise KeyError(f"User {user_id} not found")
    return {"id": user_id, "name": "John"}

def validate_user(data: dict) -> dict:
    if "name" not in data:
        raise ValueError("Invalid user data")
    return data

def get_user(user_id: str) -> dict:
    try:
        data = fetch_from_db(user_id)
        return validate_user(data)
    except (ConnectionError, KeyError, ValueError) as e:
        # Easy to miss an exception type
        # Easy to catch too many with 'except Exception'
        # Error handling tends to get condensed into a single case
        log.error(f"Failed to get user: {e}")
        raise  # What exactly are we raising here?

# Usage:
try:
    user = get_user("some_id")
    print(f"Found user: {user}")
except Exception as e:  # Often degrades to catch-all
    print(f"Something went wrong: {e}")

Without error types in the signatures:

IDEs can't help you identify possible errors
Documentation about error cases tends to get out of date
It's easy to accidentally catch too many or too few exceptions
Error handling patterns become inconsistent across a code base
Refactoring error handling becomes risky and time-consuming

By making error paths explicit in our types, we turn the type system into a powerful tool for managing complexity in large code bases. The compiler and IDE become active participants in maintaining consistent and complete error handling throughout the system.

Beyond Language Boundaries

While Rust's implementation of this pattern through Option and Result is particularly well-designed, the underlying concept is universal. Any language with a type system can implement this pattern, and doing so brings many of the same benefits:

Error cases become visible in function signatures
The type system helps ensure proper error handling
Control flow becomes more explicit and easier to follow
Code becomes more self-documenting

Whether we're working in Python, TypeScript, Java, or any other language, we can learn from this approach and apply its principles to write more reliable and maintainable code.

Practical Challenges in Adoption

However, adopting this pattern isn't without its trade-offs. Teams with developers deeply familiar with traditional exception handling might find this approach initially counterintuitive. The learning curve can be particularly steep for junior developers who are already grappling with basic programming concepts. Additionally, introducing a new error handling pattern in an established codebase can lead to inconsistency if not implemented systematically across the entire project. Teams need to carefully weigh these practical considerations against the long-term benefits of more explicit error handling.

Conclusion

Throughout this exploration of error handling patterns, we've seen how making exceptions part of function signatures transforms them from invisible control flow into explicit, manageable parts of our program's type system. This approach isn't just about cleaner code—it's about building more reliable systems where error cases receive the same careful consideration as success paths.

The benefits of this approach extend beyond individual functions to entire code bases. When errors are part of our type signatures, we gain powerful tools for static analysis, better IDE support, and clearer documentation. Our type checkers and development tools become active participants in ensuring we handle errors consistently and comprehensively.

While the implementation details may vary across languages, the core principle remains: errors are not exceptional, they're essential parts of our program's logic. By making them explicit in our function signatures, we not only make our code more maintainable but also create systems that are more robust and easier to reason about.

Whether you're working in Python, Rust, or any other language, consider how making error types explicit in your function signatures might improve your code's reliability and maintainability.

Happy Easter!

Leandro Lima — Sun, 31 Mar 2024 18:04:50 GMT

Evolving CI/CD: From Manual Automation to GitHub Actions

Leandro Lima — Mon, 18 Mar 2024 02:21:14 GMT

Introduction

In a dynamic development environment, the efficiency and automation of deployment processes are key to the speed and reliability of software releases. This post details my journey in facing and overcoming CI/CD challenges at the company I work for, culminating in the adoption of GitHub Actions for a more efficient and maintainable process.

The Starting Point

At the beginning of our journey to improve our deployment processes, we faced a scenario common to many startups and growing development teams. Our application consisted on a backend written in Python and an Single-Page Application frontend in React. The first being deployed at a Docker container running on Kubernetes and the second served as a static application through S3 and CloudFront.

Initially, I was responsible for developing and maintaining both parts of the application. This “jack-of-all-trades” approach is not uncommon in the early stages of a project but presents its own challenges, especially as the team grows in size and different skill sets.

Growing Pains

With the addition of a new developer to focus exclusively on the frontend, the process began to split. In the testing environment, I continued to package the backend and send it to the repository, while the new developer took on the responsibility of compiling the frontend and updating the S3 bucket. On the production environment, however, to ensure proper integration, I continued to handle both processes, compiling and packaging both parts before sending them to their respective destinations.

With the addition of a second — more junior — frontend developer to the team, we were faced with new problems: the lack of experience and a different operating system made it quite difficult to maintain the same process. The new team member worked on a Windows machine with WSL, which couldn't manage to run properly the tooling we've developed for our macOS and Linux environments. The time spent adjusting build and deployment scripts to work on Windows and WSL was unsustainable.

We Needed to Automate

It was at this point that I recognized the critical need to automate more of our deployment process. The initial solution was designed only for the frontend and was pretty much a cloud wrapper around the tooling that we already had.

It involved the creation of a pipeline that started with a code push to our GitHub repository. This event triggered a webhook call to a Lambda function on AWS. Once triggered, the Lambda function was responsible for instantiating a virtual machine on EC2 out of a custom AMI that I've built including all runtime dependencies and automation scripts.

To ensure isolation and reproducibility, the VM built a Docker image out of the code pulled from GitHub, compile it and sync the resulting assets with our S3 bucket.

Aiming for consistency, we decided on the early stages of our development to keep the backend and frontend on the same repository, so that a single commit hash could represent a version of the code base for the whole application, reducing the chance of a clash between versions due to API changes, for example.

The frontend deployment solution worked well enough that it made sense to generalize it so that the backend was built on a similar process. For this case, it involved downloading the latest code from the GitHub repository, building it into a Docker container and pushing the resulting image to the Amazon Elastic Container Registry (ECR).

After a successful upload, the script would automatically adjust the configuration of our Kubernetes to use the new Docker image, ensuring that the latest version of our backend was effectively deployed.

Each part would be built in parallel, so that the frontend and the backend were deployed at about the same time, maintaining their synchronization.

Recognizing the Problem

As our project and team grew, it became evident that automating our deployment process was not just a matter of convenience, but a critical necessity to maintain the efficiency of our development workflow.

The previous solution, despite being functional, carried too much operational complexity. It required someone to maintain the AMI and all the glue logic scripts. It also required our team to have at least one person who had the knowledge and experience to do so, which could become a problem for the company in my absence.

Taking into account that this was not an unique problem, that it wasn't in any way something that only we had, I decided to embark on a journey of simplification, looking out for a tool or service we could outsource this to, and avoid the hassle of keeping our custom solution.

The Choice for GitHub Actions

The natural choice for the service would be either GitHub Actions or AWS Code Pipeline, being the two major vendors we're already integrated with. Since we already have all of our infrastructure setup at AWS, it is a growing concern, that we become too dependent them, and that if we ever need to use another cloud service, that would cause too much of a disruption on our processes and a source of instability. For this reason, GitHub Actions started out as a preferred way. Add to that how easy and inexpensive it is, it really became a no brainer.

With it, we were able to define workflows directly in our project’s Git repository, using simple YAML configuration tightly coupled with the specific code revision.

The workflow consists of two main jobs that run in parallel: one for the backend and another for the frontend. This approach not only saves time but also allows for finer management of the dependencies and environments of each part of the application.

Similar to what we were doing before, the backend job involves building the Docker image and pushing it to the Amazon Elastic Container Registry (ECR). To overcome the challenge of cross-compiling for arm64 architectures, we use QEMU along with docker buildx. This allowed us to maintain compatibility with arm64 infrastructures, despite the current limitations of GitHub Actions in terms of native arm64 runners.

jobs:
    backend:
        runs-on: ubuntu-latest
        steps:
            - name: Checkout the code
              uses: actions/checkout@v4
            - name: Set up QEMU
              uses: docker/setup-qemu-action@v3
            - name: Set up Docker Buildx
              uses: docker/setup-buildx-action@v3
              with:
                  platforms: linux/arm64
                      - name: Configure AWS Credentials
                        uses: aws-actions/configure-aws-credentials@v4
              with:
                  aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
                  aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
                  aws-region: ...
            - name: Login to Amazon ECR
              id: login-ecr
              uses: aws-actions/amazon-ecr-login@v2
            - name: Run Buildx
              run: |
                  docker buildx build \
                      --platform linux/arm64 \
                      ...
                        --push

In parallel, the frontend job compiles the static assets and sends them to Amazon S3, from where they are served to users.

jobs:
    frontend:
        runs-on: ubuntu-latest
        steps:
        - uses: actions/setup-node@v4
          with:
              node-version: '20.11.1'
        - name: Checkout the code
          uses: actions/checkout@v4
        - name: Build Frontend
          working-directory: ./frontend
          run: |
              npm install
              node ...
        - name: Configure AWS Credentials
          uses: aws-actions/configure-aws-credentials@v4
          with:
              aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
              aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
              aws-region: ...
        - name: Deploy to S3
          working-directory: ./frontend
          run: aws s3 sync --delete ...

But there was still an issue

In our previous process, the backend pipeline ended with a call to Kubernetes to update the current image. Two solutions crossed my mind: just appending the kubectl command to the GitHub workflow to set the new image or to associate a Lambda function on AWS to be triggered when the image got pushed to ECR.

The first solution, though seemingly simple, isn't without downsides. The first of it is that our ECR would have to be open to the world. This isn't so much of a problem, specially given that this is actually the default configuration for it — and one that we've used for quite some time. But going forward, I'd like to keep the access to it restricted to our VPC (which we access using a VPN). Additionally, it's another access given to an AWS role (the one used by GitHub) which is already too powerful. And on top of it, it's a series of packages that I'd have to install (kubectl, awscli and dependencies) on the builder machine — which then, would have to be maintained.

Mostly due to my level of familiarity with adding glue logic with AWS Lambda functions, implementing one to accomplish this task didn't seem a big deal. It does carry the problem of not having automatic error reporting, like the GitHub Action does, but that's also a piece that once set up I've never seen it fail. As said, it allows the use of a role with limited powers to only access our EKS and it keeps all the communication within our VPC.

Creating this Lambda function ended up being a careful and interesting exercise. My goal was to keep the function as simple and free of external dependencies as possible. This led me to dive deep into the inner workings of AWS EKS authentication and how kubectl, the Kubernetes command-line tool, manages this integration. This investigation led to an elegant solution that authenticates with EKS using boto3 (already available at the Lambda environment) and communicates to Kubernetes directly through a REST API call.

By opting for a solution that utilizes the standard library and the internal capabilities of the Lambda environment, we managed to end up with an efficient and low-maintenance update process.

Conclusion

The migration to GitHub Actions allowed the simplification of our deployment process, reducing our workload and increasing the maintainability of the process. Configurable workflows gave us the flexibility to define specific deployment tasks for the backend and frontend, with the additional advantage of being integrated into the GitHub ecosystem, which increases the visibility of the process and facilitates collaboration.

We were able to reduce manual workload and improve the reliability and maintainability of our deployment process. At the time of this writing, we’re still struggling with the longer cross build times — an issue that we expect to see resolved as GitHub rolls out the ARM64 machines.

The journey to improve the CI/CD process reflects the importance of always learning and adapting. By adopting new tools and practices, we were not only able to solve immediate problems but also prepare our infrastructure for the future.

Database Migration Strategies

Leandro Lima — Sat, 17 Feb 2024 21:47:37 GMT

Introduction

Database schema management is essential as projects evolve over time.

A challenge arises when the schema, originally designed to accommodate a certain set of requirements, needs to change. These changes can include the introduction of new tables, columns, views, or functions, as well as the modification of existing ones.

And even though these schema changes are part of the natural progression of any database-driven application, they need to be managed carefully to ensure that the application’s functionality remains consistent and reliable.

While this article has been written with Python and PostgreSQL in mind, the concepts described probably apply to most combinations of SQL databases and general purpose programming languages out there.

Traditional Migration Approaches

Traditionally, frameworks such as Django provide tools that allow for database access by describing tables as classes, with its instances of it mapping almost directly into rows of the table.

Modifications to the underlying data model can then be made by changing those classes and running a tool that automatically translate them into a migration script, which, when applied, will emit the appropriate DDL for the database.

Django

In Django’s approach, the initial draft for the migration is generated by the tool and described in a very high level manner — focusing fundamentally in changes on the Python model of the data, and not on the database table itself. Based on that, the migration script dynamically generates the DDL to move from one state to another (forward and backwards) at applying time.

# Example of Django migration
# Original url: https://docs.djangoproject.com/en/5.0/topics/migrations/#migration-files

from django.db import migrations, models

class Migration(migrations.Migration):
    dependencies = [("migrations", "0001_initial")]

    operations = [
        migrations.DeleteModel("Tribble"),
        migrations.AddField("Author", "rating", models.IntegerField(default=0)),
    ]

However, this ORM-based approach has limitations. It assumes the database to be somewhat of a passive storage system, while in fact, SQL databases are powerful engines capable of complex data modelling and processing.

SQLAlchemy & Alembic

Python developers feeling this way often move to SQLAlchemy SQL toolkit, which, at the cost of abstracting a little less details, offer access to both abstractions: ORM and direct SQL constructs. And as a subproject, SQLAlchemy offers Alembic for schema management:

# Example of Alembic migration
# Original url: https://alembic.sqlalchemy.org/en/latest/tutorial.html#create-a-migration-script

revision = 'ae1027a6acf'
down_revision = '1975ea83b712'
branch_labels = None

from alembic import op
import sqlalchemy as sa

def upgrade():
    op.create_table(
        'account',
        sa.Column('id', sa.Integer, primary_key=True),
        sa.Column('name', sa.String(50), nullable=False),
        sa.Column('description', sa.Unicode(200)),
    )

def downgrade():
    op.drop_table('account')

While this approach certainly gives us more control about what's being done at the database, I've often found myself writing SQL directly and later trying to find a way to express it on Alembic's terms — only to have Alembic to regenerate SQL again.

The Raw Approach

In essence, a migration system is not a complex thing. It requires:

an ordered list of migration scripts
a list of which migrations have already been applied
a script to check and apply only pending migrations

Based on this idea, I've decided to create a simple script to handle migrations written in direct SQL:

a simple folder that contains a series of SQL files; each file is named following the pattern: _
.sql; this way, scripts can be executed in creation order (based on the timestamp) and can be easily found (based on the summary)
I've also avoided the up/down migration approach, as I felt rolling back changes is rarely safe and an up/down mechanism can easily destroy data
each application environment contains a table which stores which migrations have been previously applied:

create table migrations (
    id bigint primary key not null,
    applied_on timestamp with time zone not null default CURRENT_TIMESTAMP
);

create index id on migrations using hash (id);

a script to automate the process:

  $ migrations --help
  Usage: migrations [OPTIONS] COMMAND [ARGS]...

  Options:
      --help Show this message and exit.

  Commands:
      apply
      check
      init
      new
  $

Checking which migrations need to be applied

To determine which migrations need to be applied, no more than a simple query is needed:

select
    id
from
    (values
        (:timestamp_1),
        (:timestamp_2),
        ...
        (:timestamp_n)
    ) v (id)
left join migrations m using (id)
where
    m.id is null
order by id;

Applying migrations

Applying migrations is easy, by just piping them into psql:

$ psql -q -e -1 -v ON_ERROR_STOP=1

where,

psql: this is the command-line interface to interact with PostgreSQL
-q: this stands for “quiet” mode. It suppresses the printing of the welcome message, headers, and footers in the output.
-e: this echoes the queries that psql executes to the standard output. It helps you to see the exact query that is being run, which can be beneficial for debugging
-1 or --single-transaction: this wraps all the SQL commands that are run inside a single transaction. It’s equivalent to issuing “BEGIN” before the first command and “COMMIT” after the last command, provided there are no errors. If there is an error and `ON_ERROR_STOP` is set, as in this command, it would issue a “ROLLBACK” instead
-v ON_ERROR_STOP=1: this sets the ON_ERROR_STOP variable to 1, telling psql to stop execution immediately if a SQL command results in an error

Sustainability — Handling Migrations Bloat

Over time, the number of migration scripts will grow. That's just the nature of modern, ever-evolving software projects. Because of this, this method handles individual scripts as naturally ephemeral and they get removed from the repository once they've been applied to all active environments.

For new installations a master schema is kept by creating a dump (with pg_dump) of the current considered-to-be-correct schema, which is included in the directory.

This way we can easily install a new instance of the application, while maintaining only a short term backlog of applicable schema changes.

Limitations

This approach is certainly not without its problems.

First, it does maintain some information duplication:

The master schema, created from the dump.
The Python definitions of the schema, for the benefit of code inspections and correct data type conversion.
The migration scripts.

Also, in my experience, as with any code we write, migration scripts do carry their fair share of bugs. Some leeway in SQL allows for schema diversion, as with constraint names and column ordering, for example.

In addition, programming mistakes allow the expected schema not to match what is specified as Python code — not to mention the duplicated work of keeping both specifications.

Some Ideas for the Future

If you consider the application repository and the version control system history, the schema migration scripts and the schema dumps do carry some duplicate information. At the same time, they both need to exist, as one describes how to go from A to B, while the other specifies unequivocally where B is.

But if we use some automation, we can use one to help build the other, reducing the amount of errors. To go from migration to master, we can design a migration, apply it to a testing environment and dump the testing environment state — which is mostly what I do today.

But with the AI technologies such as GPT-4, another approach also could offer interesting results: by showing the LLM the git diff between two schemas and possibly commenting on it, we could get a draft of what a migration for it could look like that is not an rule-based approach as it could be done by Django or Alembic, but something that adds meaning considerations to how data is migrated between the two states. This draft would then be reviewed and adjusted by developers to suit the exact requirements, combining the benefits of automation with human expertise.

Conclusion

This article documents a novelty approach to managing complex database migrations — a method that has been in use for a few years now. The aim of it is to share this approach's history and rationale not only with my team, but with wider community for collective improvement in handling similar challenges.

AWS Lambda Overview

Leandro Lima — Fri, 02 Feb 2024 00:35:01 GMT

Introduction

In this post, we’ll delve into AWS Lambda, AWS's serverless computing platform. My aim is to give a broad overview over it, shed light on its nuances, and to create an easier path for other developers.

Understanding AWS Lambda

AWS Lambda is an event-based computing service. Events could be triggered through an API access, a scheduler like EventBridge, through AWS's API using a library like boto3 , through its integration with RDS and through many other services within AWS's ecosystem.

It can perform mid-sized asynchronous jobs, spanning through several minutes and using sizeable memory, as well as synchronous fast jobs, like transforming a CloudFront request or response.

Such flexibility also comes with some constraints and it usually requires a different approach to application development — one that is specifically tailored for its environment.

Namely, a developer expecting to deploy on Lambda should consider:

Cost
Running environment
Lifecycle
Time constraints
Network constraints
Resource access permissions

We'll approach each of these topics below.

Cost

The cost for AWS Lambda is primarily based on:

the number of requests you serve (US$0.20 per 1M requests)*
the amount of memory reserved times the number of seconds the function runs (~ US$0.000015 per GB-second)*

\ Prices for US East as of Jan '24*

And on that, you get allotted 1 vCPU for each 1569MB of memory, or fractional values (throttled on CPU time) for non-integer ratios of that.

This billing model is different from traditional availability-based pricing and requires a strategic approach to efficiency and execution time.

Do you expect your function to hang, waiting for some network answer while doing no work? Maybe Lambda isn't the best platform for it.

On the other hand, the same problem can be solved in different manners. And taking into account the environment where it's running can guide you towards the best way of adapting to it.

Take this task for example: download and unpack the Linux kernel.

If you're tight on memory, tight on CPU and loose on storage, this could be one way to do it:

import os
import tarfile
from tempfile import TemporaryDirectory
from urllib.request import urlopen

url = "https://cdn.kernel.org/pub/linux/kernel/v6.x/linux-6.7.tar.gz"

def lambda_handler(event: dict, context: object) -> None:
    with TemporaryDirectory() as temp_dir:
        local_file_path = os.path.join(temp_dir, "linux-6.7.tar.gz")

        # Download the file and save it to disk
        with urlopen(url) as response:
            with open(local_file_path, "wb") as f:
                while (chunk := response.read(1024)):
                    f.write(chunk)

        # Extract the file
        with tarfile.open(name=local_file_path, mode="r:gz") as tar:
            tar.extractall(path=temp_dir)

You open the file for download, read small chunks and write small chunks until the download is complete; then you unpack the file you just saved.

But if you're loose on memory, tight on CPU and tight on storage, a different way to achieve the goal could be:

import tarfile
from io import BytesIO
from tempfile import TemporaryDirectory
from urllib.request import urlopen

url = "https://cdn.kernel.org/pub/linux/kernel/v6.x/linux-6.7.tar.gz"

def lambda_handler(event: dict, context: object) -> None:
    with TemporaryDirectory() as temp_dir:
        # Download the file into memory
        with urlopen(url) as response:
            file_content = BytesIO(response.read())

        # Extract the file
        with tarfile.open(fileobj=file_content, mode="r:gz") as tar:
            tar.extractall(path=temp_dir)

This way, you first download all the file into memory, then you unpack it from memory into disk.

But if you're trying to minimize the time*memory it takes for the function to run, given the allocated resources, this could be a better solution:

import tarfile
from tempfile import TemporaryDirectory
from urllib.request import urlopen

url = "https://cdn.kernel.org/pub/linux/kernel/v6.x/linux-6.7.tar.gz"

def lambda_handler(event: dict, context: object) -> None:
    with TemporaryDirectory() as temp_dir:
        # Start file download
        with urlopen(url) as response:
            # Extract it while you download it
            with tarfile.open(fileobj=response, mode="r:gz") as tar:
                tar.extractall(path=temp_dir)

In this case, while you wait for the network buffer to fill up while the bytes come in from the network, you get going by using your allotted CPU time to unpack the file.

For reference, these are the empirical execution times and costs of the three versions on my laptop and with different amounts of memory on Lambda:

Method A: Writing to Disk in Small Chunks

Method B: Downloading into Memory then Unpacking

Method C: Unpacking while Downloading

Notice how the performance of each method is not only influenced by the available memory but also by the proportionate CPU power. For example, at 1569MB, where a full vCPU is available, the methods show improved performance compared to lower memory allocations with fractional vCPUs.

Given the pricing model of AWS Lambda, where both memory allocation and execution time contribute to the cost, selecting a method that efficiently uses both CPU and memory can lead to substantial cost savings, especially for applications that are scaled up to handle a large number of requests.

Understanding the relationship between memory allocation and CPU throttling is crucial in AWS Lambda. It impacts not only the feasibility of different approaches under various resource constraints but also the overall efficiency in terms of performance and cost.

In this small example methods A and Care less sensitive to file size, allowing the function to run even in the most constrained configuration. Meanwhile methods B and C showed about the same time and cost profile. This way, method C’s reliability regarding file size and memory requirements matched with its cost performance makes it a potentially more adequate method for this environment.

Running Environment

Lambda supports multiple programming languages through the use of runtimes.

The easiest runtime to use is the one provided by AWS. The underlying execution environment is an Amazon Linux distribution with a programming language interpreter selected by the user and some additional libraries made available by AWS. At the time when this article is written, this option is available for Node, Python, Java, .NET and Ruby. This option can be used by either editing the code file directly from the AWS console or uploading a ZIP file with the code — potentially including third-party libraries.

If more control is needed, a runtime can be custom-made by the user through container images. The containers can be built out of three ways (in order of least to most complex or less to more freedom):

based on base images provided by AWS;
using an AWS-supplied Lambda runtime interface client
implement the Lambda Runtime API

The image, containing both the environment and the application, can be then pushed to the Elastic Container Registry and selected as the code to he ran.

Lambda functions operate within a stateless, isolated environment, where each function execution is distinct and has restricted filesystem access.

The execution environment is constrained by resource limits, including a maximum of 10GB of RAM and 10GB of ephemeral storage (to be configured by the user and billed accordingly), and a 15-minute execution time cap, after which AWS will forcibly terminate the function.

The filesystem is essentially read-only, with temporary files stored in the /tmp directory. Lambda may re-use the execution environment from a previous invocation if one is available, or it can create a new execution environment, meaning that while data and state may persist during a function’s reactivation from hibernation, it could also be completely erased during a cold start.

This environment, designed for short-lived, independent operations, mandates an approach centered on statelessness and autonomy.

Lifecycle

Due to the serverless pay-per-use pricing structure, your function isn’t always running on AWS servers. Instead, it's deployed only when needed. Because of this, there may be a longer response time when the function is called: on the first invocation, the code is transferred, an execution environment is established and the function is called. After this, it maintains the execution environment for a certain duration. If you invoke your function again in this period, AWS reuses the existing environment, skipping the initialization phase. It should be noted, though, that you're only billed when the code is actually running (either on initialization or invocation); when the environment is dormant, waiting for a new invocation, no costs are incurred.

Lambda execution environment lifecycle

If another invocation happens while the already established environment is busy, AWS needs to span a new environment for it. Unlike a server that runs continuously and may serve multiple requests in parallel, each execution environment serves only one request at a time. While this may sound like a disadvantage, this is actually a reason to use Lambda: this model allows AWS to run as many instances as you need on demand, instantiating them as needed.

It used to be that, the latency involved in these cold starts was a significant hurdle, limiting Lambda’s applicability in time-sensitive scenarios. However, advancements in AWS Lambda technology have greatly reduced these start-up times, even in cold start situations, usually bringing them down to just a few seconds.

This improvement did expand the range of viable applications for Lambda, as long as the application architecture and function design took into account the different startup behaviors and planned accordingly.

Network Constraints

Lambda functions can run with or without network access. To access any network resources, Lambdas must run on a private subnet of a VPC — no public IP addresses are allowed. Because of that, they can’t directly access the internet unless a NAT gateway (or an equivalent EC2 instance) is setup to route the traffic on the used subnet. This additional setup is vital for functions that need to interact with the Internet and could create a scenario of unexpected errors for someone just starting with the platform.

Resource Access Permissions

Every Lambda function is associated with an IAM (Identity and Access Management) role, known as an execution role. Based on the role's associated policies, access is granted or denied to other AWS resources on your account. Your function should have, at least, the ability to access Amazon CloudWatch Logs for the purpose of streaming logs. Also, if you need your function to run on a VPC and access network resources, this must also be included in the role’s policies.

It's a good idea to start with an appropriate AWS managed policy for the desired use case and include other permissions as needed. For basic execution, the managed policy is AWSLambdaBasicExecutionRole, this allows the function to publish logs go CloudWatch. If network access is needed, then the starting point should be AWSLambdaVPCAccessExecutionRole. Other managed policies can be found in Lambda's official docs.

Development Approaches

In developing for Lambda, the simpler, the better. While this might be somewhat of a generic approach, that, in a sense, applies to any project, this is particularly true for this platform. The thing is that for a long-running server, you can often get away with a lot of complexity pushed towards the initialization time. After all, what are a few seconds of initialization time for a server that's meant to run for days, weeks or even months without a restart? And even if you're doing some CI/CD stuff where you software do get redeployed often, a bunch of strategies can be used so that users never notice the period where the new server is starting up.

On Lambda, on the other hand, you do need to count on your code starting up on user's requests. It's obviously not what we want, and likely not what's going to happen on every request; but if your user clicks on something and it takes five or ten seconds for the function to initialize, that will be noticed — and can even cause the impression of the application having hang. Because of that, Lambda functions are specially sensitive to unnecessary complexity and dependencies. If you load a heavy library but use only a tiny bit of it, you'll still be paying the cost of loading the library on at least some requests. So be mindful of that when adding a new dependency.

On the other hand, there's often a trade off between computing complexity and development complexity — specially in interpreted languages. For example: you're implementing a REST API end point and the user sends some parameter via a query argument; the simplest possible approach in terms of computing complexity would be to pick up that argument directly from the event passed to the lambda function and have it custom validated to your purposes. But if this parameter is, for example, a credit card number, now you need to implement your own credit card validator. And if you need to handle more arguments of more types, development complexity might grow uncontrollably — so the natural path there would be to use some library that handles that for you. But depending on how complete the library you chose is, and how much you'd like to outsource to libraries, you might end up loading a bunch of unused code and repeatedly paying for the memory, latency and CPU associated with it.

Because of that, it's important to understand that there's no one-size-fits-all solution when working with lambda functions. Every task has its own set of constraints and you should be mindful of them — sometimes optimizing for development speed; others for execution overhead; others for a balance between them.

Conclusion

As we’ve explored in this introduction to AWS Lambda, developers embarking on the journey of serverless architecture are presented with a plethora of considerations that can significantly impact the cost-effectiveness, performance, and scalability of their applications.

With its fine-grained pricing model, it encourages innovation and experimentation, allowing you to tailor your applications with precision to your use cases and to optimize for cost. Yet with flexibility comes the responsibility to understand the constraints and behaviors of this environment. It demands a thoughtful approach to managing resources, permissions, and application architecture. The serverless paradigm calls for applications designed for statelessness, event responsiveness, and autonomy, challenging traditional development models and encouraging inventive solutions.

In summary, AWS Lambda offers a powerful and versatile platform for serverless computing, but it demands a thoughtful approach to application development. Understanding its operational characteristics and limitations for building and deploying applications is key to leveraging its full potential. Whether for simple or complex tasks, AWS Lambda can be a scalable and efficient solution, provided we have its intricacies in mind.