It's Time to Write Tests
How working with AI systems bring new testing challenges
Introduction
I've been working on a reminder assistant that operates through WhatsApp. Essentially, you ask to be reminded of something at a certain point in time, and you receive a message when that time arrives. There are other features like recurring reminders, but that's not the focus here. One thing that proved surprisingly challenging is how humans refer to and interpret time references. For computers, you usually want something strict and well-defined: a specific date, time, or a pattern you can match against, like a cron expression. But humans communicate time rather ambiguously—and it works!
Let's start with a simple example: “remind me to pick up my son at 9.” Is it 9 AM or 9 PM? Today? Tomorrow? Every day? We might think of a simple rule to resolve this: the next occurrence of 9, whether AM or PM. So if it's 8 PM now, we mean 9 PM, but if it's 10 PM now, it's 9 AM tomorrow. And it's recurring if we say something like “every” or “every day.” But what if the reminder is for “at 3”? Assuming it's past 3 PM, does this mean picking up my son at 3 AM? Unlikely. We can improve the rules for that scenario... and I've tried. But for every rule you add, a simple counter-example can be easily found in something humans say and understand naturally. There's a lot of “common sense” that goes into figuring it out.
Because of that, I ended up using an LLM to translate what the customer means into something a computer can work with. More specifically, I tried Anthropic's Haiku 3.5, which worked most of the time but not always, and ended up settling on Sonnet 4, which in my tests was able to figure it out properly for all the samples I could come up with. But note what I just wrote: "in my tests... all the samples I could come up with."
My Take on Software Tests
I'm an electrical engineer, and I had a fascinating software development course in college with a strong focus on testing. The course involved developing software to control an elevator. It had the usual interface: users outside could push buttons to call the elevator, users inside could push buttons to select their floor, buttons could be pushed multiple times, and the elevator needed to switch directions, accelerate, decelerate, etc. We were tasked with building it modularly, with unit tests for each function and method, checking that for each interaction, the outputs and internal state were correct.
After college, I ended up in a trainee program at Embraer, where I had the opportunity to observe an aircraft fuselage under test for seven years. For airplanes, their lifespans are measured in cycles of pressurization and depressurization. Regulations stipulate that no aircraft can fly with more cycles than those tested at the factory (the tests don't have to be completed before the first delivery, just stay ahead of the operators). So if an aircraft will accumulate, say, 60,000 cycles over its lifetime, it should undergo at least 60,000 pressurization cycles at the factory, with instrumentation and regular checks for cracks and material fatigue. This leads to maintenance and correction bulletins being written for the aircraft.
But the two examples have a few differences from how most software development happens in real life:
There's usually no clear and fixed specification for how most software should behave in all scenarios. There are cases where this does exist—for example, a vehicle controller, a video encoder following a certain specification, or the implementation of an API that should conform to a specific standard. But for most projects, the software specification is a living thing, which is why we have OTA updates, continuous deployment, A/B testing, etc.
Modern software is usually built from many parts, each with numerous possible states and failure modes. Think of a TCP connection: there are 11 possible states, packet losses, latency issues, etc. Building upon this, we have higher-level protocols, services, and entire applications behind them, turning this into a recursive problem. The result is that most applications have essentially an infinite state space once all of their parts are composed.
Most application failures aren't life-threatening. Let's say Google goes offline—I can't remember the last time that happened. Google is pretty important, and Google Search has almost utility status in modern society. People just expect it to be there, available and running. But it's unlikely that anyone has ever died or will ever die from Google Search being unavailable. If Tesla's autopilot crashes at the wrong time, though, people could be in real danger—even more so with a flight control system. The majority of applications have neither Google Search's utility status nor a flight control system's criticality.
Most software applications are more business-sensitive to lack of innovation than to lack of long-tail reliability.
Most applications aren't standardized like aircraft, where you can create and evolve one standard set of tests that over time can be used to increase the reliability of the entire industry.
There are probably other differences we could enumerate, but because of the differences above, I believe that for most software, using tests to cover its state space is rather cumbersome—that is, possibly of infinite cost—and inadequate given the consequences associated with the risk of failure.
On the other hand, the code base our applications run on is finite, so one might argue that test suites should aim for high coverage. But even with 100% coverage, formal logic dictates that if a test fails, there is an error either in the test or in the software under test, but if it passes, there might or might not be an error in the code being tested, the test itself, or both.
# Program with 100% test coverage
def is_even(number: int) -> bool:
return number == 2
def test_is_even_true() -> None:
assert is_even(2) is True
def test_is_even_false() -> None:
assert is_even(3) is False
So, should we just abandon the notion of correct, reliable software? In my view, no. As illustrated above, if we write small, clear, well-defined functions with specific purposes, logical inspection is a superior way of ensuring software quality than test coverage.
On a small note regarding such heresy, I don't intend to say that all tests are useless in all projects. In FOSS projects, for example, with a large number of contributors where each author has limited understanding of the whole, but the correct behavior is well agreed upon, and developer skill levels vary, a growing body of tests is, in my view, a good way to prevent the introduction of bugs previously envisioned or corrected by earlier developers.
But AI is Different
If regular software can be inspected for logic errors, AI systems cannot. Despite all the interpretability efforts, AI systems are usually considered black boxes—probabilistic systems that we know work most of the time for a certain subset of problems, but that largely can't be inspected for output errors. Remember when Google Photos used to misclassify pictures of humans as animals? Or when Gemini would generate ethnically diverse images of German soldiers in World War II? These are just well-known examples of large-scale failures from a company that certainly doesn't lack technical resources or personnel for developing some of the best technology in the world.
But failures in AI deployment aren't constrained to such obvious examples.

Above is what should be a montage of recent complaints about degraded performance for Claude Code. The montage was created using Google AI Studio, and interestingly enough, has several errors 😂.
In any case, Anthropic has since added a notice to their status page:

And I totally buy it. I don't think Anthropic had any intention of degrading their models. If anything, Anthropic is, in my view, one of the most transparent AI labs, publishing extensive research on their models, educating people on their limitations and how to better use them, even when it doesn't necessarily favor them in some aspects.
Unfortunately, to my knowledge, there's no publicly available information on what caused this quality degradation, but one word catches my attention: "quality."
Was there an outage? No.
Was there an increased number of "errors," in the sense of API 500 errors, for example? No.
Did the models provide a completion to users' inputs? Yes.
Through my engineering background, I understand the general concept of "quality" as how much a manufactured good adheres to a given specification. Intuitively, we know that quality goes beyond manufacturing, and we have a tacit understanding of what it means. But in the brief research I did for this article, I found it quite interesting how difficult it is to find a suitable definition of quality on Wikipedia that matches this problem:
“Quality often focuses on manufacturing defects during the warranty phase”
“Inherent degree of excellence”
“Conformance to requirements or specifications at the start of use“
“Fraction of product units shipped that meet specifications“
“Number of warranty claims during the warranty period”
“Non-conformance with a requirement (e.g., basic functionality or a key dimension)”
These are mostly Six Sigma-related definitions, but definitions found in "software quality" articles seemed to me equally or more inadequate.
From the above, the best fit came from "inherent degree of excellence," which is also the fuzziest and least applicable. And the notice from Anthropic shows this, as they mention their monitoring includes "reports of degradation."
Different Tests for Different Reasons
Back to the reminder service I’m running: around the same time such “quality” issues were reported with Claude, I noticed a few reminders I had requested being interpreted in an odd manner, such as “pick up my son at 9” changing from a single event to a recurring event. To be fair, these also happened around the same time I made some system prompt changes to fix a different issue—though seemingly unrelated to this matter.
Unfortunately, I currently don't have any specific tests in place to monitor this precisely. So did the interpretation of some time inputs change because my prompt changed or because something in the model changed? While I did minor testing when I changed the system prompt, I don't have a comprehensive battery of tests, and fortunately my service is still small enough that I'm able to notice the degradation and manually investigate it.
But I'm taking a lesson from this: if you're running LLMs in production, there must be some constructed metric that determines whether a completion is within your application's definition of correct, and tests for this should be run at both regular time intervals (to catch statistical deviations from the model provider) and whenever any input changes are made, even seemingly unrelated ones (to catch statistical deviations due to the change).

Unlike strict software correctness tests, these should be more like field sobriety tests, where we're not measuring if the person has complete dexterity or absolutely correct pace, but whether the model is behaving statistically within what we consider to be normally accepted behavior for the application.
Conclusion
Regardless of which camp you're in regarding software quality assurance methods, AI technologies bring us to a different arena, where we're no longer dealing with mostly deterministic systems. Variability is expected, just like with human beings, and systems may subtly escape fuzzy definitions of quality. We must then strive to develop solutions to monitor this variability so that we can identify problems and their sources before they impact customers, and be able to openly disclose malfunctions to customers when they happen.