Liam DeVoe

Swarm testing

2026-06-03T00:00:00-04:00

Swarm testing is a technique for increasing behavioral diversity in randomized testing. It's conceptually simple, yet powerful, which makes it a favorite of mine. In this post, I describe a natural extension to swarm testing which yields an additional increase in behavioral diversity.

Traditional swarm testing

Consider a stack machine with three instructions (push, pop, add), and the corresponding stateful test¹:¹ In pseudocode, because I want to emphasize the behavior before any particular testing framework changes the distribution.

class StackMachineTest:

    @rule(integers())
    def push(self, value):
        ...

    @rule()
    def pop(self):
        # early-returns if less than one value on stack
        ...

    @rule()
    def add(self):
        # early-returns if less than two values on stack
        ...

Here is one simple approach to exercising this test. For each test case, sample the number of rules $n$ to run from some distribution centered on the desired average test case size. Then pick the next rule to run uniformly at random (from {push, pop, add}), until you've run $n$ rules total.

This testing strategy has a weakness for our test. Suppose the stack machine implementation has a bug, which only manifests when the stack size is large (say, > 10). We can visualize whether the test finds this bug by plotting the number of calls to push vs pop:

This plot shows the joint distribution of the number of calls to push and pop within a test case².² push and pop are individually normally distributed, because picking the next rule uniformly at random is a bernoulli trial, from which repeated draws form a normal distribution. Each "point" on the plot represents a single test case. The bug lives in the lower right corner, where push - pop ≥ 10. Because we expect to draw roughly as many pop rules as push rules, the stack is unlikely to grow large enough to trigger the bug.

This leads us to the following general observation: some features, like push and pop here, actively mask bugs when combined together. Conceptually, such bugs live along the axes of our plots:

And are unlikely to be triggered.

We would like a testing strategy which explores this part of the search space. The insight of swarm testing is that one can achieve this by randomly disabling certain features for an individual test case. For example, one might assign a 50% probability of disabling each rule³.³ This is the algorithm used in the swarm testing paper. Note however that this is a poor choice for other reasons: it is unlikely to disable either almost all, or almost no, rules as the number of rules grows. The fix is straightforward, but orthogonal to this article. For the interaction of Rule1 and Rule2, there are four equally-likely possibilities in a test case:

Both rules are enabled. As above.
Rule1 is enabled, but not Rule2. We see some exploration along $y = 0$.
Rule2 is enabled, but not Rule1. We see some exploration along $x = 0$.
Neither are enabled. No exploration; uninteresting.

We can visualize the resulting distribution as the sum of the first three cases:

This testing strategy now explores the previously unlikely state space that contains this type of bug. This testing strategy would easily find our push / pop bug, for example.

A problem

Up to this point, I've described traditional swarm testing. And it's great; we get some nice increase in diversity. Specifically, we can explore states which require one rule or more rules to be completely disabled.

But, as you may have noticed, some under-explored areas remain⁴:⁴ I am intentionally ignoring the search space represented by the upper right area. This area can easily be covered by increasing the average number of rules run in a test case.

The newly-highlighted area corresponds to when Rule1 is enabled, but substantially less likely than Rule2; or vice versa.

To give a concrete example of why we might care about this case, suppose our stack machine gains a new optimize opcode. When run, optimize looks at the execution history of the machine and performs a dynamic JIT-style optimization. Now suppose that optimize has a bug only when the execution history is sufficiently long, and there is the right ratio of pop calls to push calls; say, 5 to 1:

This bug has two conditions: that push and pop have the right ratio, and that both rules are enabled. It therefore won't be caught by either the original testing strategy (which is unlikely to produce the right ratio) or by the swarm testing strategy (which will fully disable one of the rules).

A simple extension

With this motivating example in mind, I propose a simple extension to swarm testing. Traditionally, each rule is disabled with 50% probability. Instead, I propose that for each test case, each rule $r$ is assigned an activation probability $r_p \in [0, 1]$, sampled uniformly. Then, whenever a rule would normally be run, it is instead skipped with probability $1 - r_p$.

It might be helpful to play around and see why this algorithm gives us coverage of the previously-rare regions:

Conceptually, we are letting the distribution "roam around" our graph uniformly. Because we're uniformly sampling pushp $\in [0, 1]$ and popp $\in [0, 1]$, we're equally likely to get a distribution centered on any point in the space of # calls to push vs # calls to pop.

Here, you can see that exploration in practice:

This testing strategy will easily find both the new optimize bug, and our original push / pop bug. I view it as a straightforward improvement on swarm testing.

One step further

I'll conclude with a teaser. Above, I said the activation probabilities are sampled from a uniform distribution on $[0, 1]$. Let's consider a program with more features; say, 10. Now suppose this program has a bug only in some particular configuration of relative feature probabilities. For example, that some set of three features are half as common as some other set of three. Uniformly sampling the activation probabilities is very unlikely to produce this configuration, and so we will miss this bug.

We want a distribution of activation probabilities that is likely to produce this configuration. Not only that, we want a distribution of activation probabilities that is also likely to produce any other possible bug-inducing configuration: feature A half as likely as B half as likely as C; A ten times as likely as all other features; feature probabilities distributed according to some power law; and many others besides.

This implies the distribution of activation probabilities should itself be randomly sampled from the space of distributions.

It's swarms all the way down.

Thanks to Zac for bouncing swarm testing ideas around with me.

My agent management software

2026-04-29T00:00:00-04:00

I write a lot of code. Or rather, I used to write a lot of code. After Claude Opus ~4.5, it's now more accurate to say that I review and design a lot of code.

Around the release of Opus 4.5 was also when I started working on Hegel. As a greenfield project spanning multiple repositories, my work on Hegel surfaced pain points I don't normally encounter when working on Hypothesis or other projects, such as managing the frequent small PRs and merge conflicts that come with a young, active codebase.

Thanks to some combination of these two factors, I found myself settling on a wishlist for tooling around my development flow:

I now context switch—a lot. I'm writing a feature spec one moment, bouncing design ideas off an agent the next, before getting pulled away to review a third agent's work. All while waiting for a long-running research or implementation agent in the background. I need something that manages my various task states, so I always feel that I can walk away and come back later.
Coordinating a change across multiple repositories requires a context switch to manage their branches, PRs, and GitHub interlinks. It shouldn't have to. I want to say what I want once, across all repositories, and let the agents get the git details right.
I never want to manually resolve a merge conflict again. The agents are here. We have the technology.

And, well—seeing as coding agents have made personalized tooling cheap (but not free, despite some claims to the contrary!), I figured I'd spend a week building exactly such a tool.

Plait

Here's Plait, my agent management software¹:¹ Heavily vibecoded, but not entirely. I gave detailed guidance on all the UI and the actual semantics, and on several gritty technical decisions.

The unit of work in a repository is a worktop. Each worktop has a git worktree, and has a nullable 1:1 correspondence with a pull request. That is, you can think of a worktop as scoped to the same unit of work as a PR, but which may or may not have an associated PR yet.

A worktop can contain multiple Claude sessions:

Claude sessions are standard claude processes. Claude code persists sessions on disk automatically, which Plait resumes on demand with claude --resume <session_id>.

My most used workflow is to open a new worktop and talk with its session, eventually telling it to PR its changes. Many worktops only need this single session. Others, especially more involved features, benefit from the advanced context management you get with multiple sessions.

In the background, every 5 minutes, Plait kicks off a daemon process. This daemon checks for state changes in any worktops with associated pull requests. Is there a merge conflict? Has the CI turned from green to red? Are there new PR comments or reactions?

If so, the daemon starts a tend session. This is a Claude session with instructions to resolve the merge conflict, fix the CI if caused by our changes, and resolve any comments addressed towards it. Tend sessions are saved for each worktop if I need to inspect them later.²² Useful for debugging why a tend session didn't respect some part of its system prompt, for example.

Finally, Plait has a higher order notion called a slate. A slate orchestrates multiple worktops, potentially across repositories.

I start a slate whenever a change touches more than one repository. I talk with the slate's session until I'm confident it has enough context to spawn sessions whose instructions I won't need to immediately revise. The slate then creates the appropriate worktops, spawning a session in each with instructions to implement its portion of the feature.

From here, I have two options. I can either dip down to a specific worktop to manually manage its sessions. Or, if I realize I need to make a cross-repo adjustment, I can tell that to the slate session, and have it spawn and manage the worktop sessions for me.

As an escape hatch to the underlying tools, I can always click VS Code on a worktop to open a VS Code window at that worktree. And I can click VS Code on a Claude session to open the same, additionally with a terminal window opened to that Claude session.³³ I don't need these often, but when I do, I really need them.

Plait is open-source here. I make no guarantees of support or stability. In fact, I almost guarantee it won't work for you!

To be clear, I fully expect Plait to be obsolete within 12 months. Either because one of the AI labs releases an AI-native GitHub that I feel is as good or better than Plait, or because the AI labs have made substantially more than just this workflow obsolete. For now, I'm enjoying it!

Property-based testing is about to rule the (software) world

2026-02-11T00:00:00-05:00

And what can we do to prepare?

Many people have strong opinions about the next few years of AI progress. Regardless of yours, I claim that (1) the models will continue to improve for at least another 6 months; and (2) even if that stopped today, Opus 4.6-tier models are already powerful enough to dramatically change how many developers write software.

I characterize this change as "AI code is treated as a black box". AI-pilled programmers care only about the observable outcome of code, not the implementation. In other words: the only thing that matters anymore is the guarantees on the box. When I ask the black-box z3 solver for a satisfying assignment, I don't care how it got there, only that the result is a valid SAT formula.

If we are to embrace AI code as an industry, we will and must adopt better ways to place guarantees on these black boxes. And I think property-based testing will quickly emerge as the forerunner.¹¹ At least until we can autonomously formally verify code according to the theorem statement "this code has no bugs". I expect this to be many years away even at current model progress rates. ²² Or fuzzing, if you prefer that framing. I largely see fuzzing and PBT as two views on the identical problem, and think it's unfortunate we don't have more communication between these two worlds.

Property-based testing

I have always been surprised at how under-adopted property-based testing is. Do companies not care about testing? Is it not mentioned enough in university curriculums? (Yes, but I digress). Has PBT just not permeated the cultural zeitgeist?

It doesn't really matter. AI is about to provide the forcing function for PBT to become a developer household name. Or, to put it another way: PBT is about to get a lot more users.

And yet, the PBT ecosystem is underprepared for this influx. In Python, I maintain Hypothesis, which I have no qualms in claiming as the most successful PBT library of all time.³³ See https://hypothesis.readthedocs.io/en/latest/usage.html. For example, 4% of 2024 PSF survey respondents report using Hypothesis. Python might well weather this storm.

But as much as I love Python, it comprises a small percentage of production code. What about other languages? Most do have a PBT library. And, to be clear, many years of development effort have gone into them. But I think even their maintainers will acknowledge most other libraries don't match the breadth and depth of Hypothesis:

Internal shrinking, which is consistently world-class
Pluggable backends, including z3 integration
Observability
Coverage-guided fuzzing integration
A powerful internal test case representation
Stateful testing
Test case database, for regressions
Test case deduplication

My point is not to glorify Hypothesis. Even after 11 years of development, there is always more to improve. Rather, the demand for PBT is about to explode, and I don't think any language is prepared for it—maybe not even Python.

My concrete call to action: as a PBT ecosystem, we need to figure out how to share improvements among all libraries, to consolidate and amplify the best of our development effort. I am not the first to say this, but it has never been more true than today. The open PBT observability spec is designed for any language and is a step in this direction.

What else can we standardize? Shrinking? The database? The choice sequence? How can we take the best parts of every library and combine them into one, in preparation for the PBT renaissance?

If you maintain a PBT library and want to collaborate with Hypothesis on this, reach out.

Homebrew catan

2024-08-27T00:00:00-04:00

My family's board game of choice is Catan. We've probably played close to 50 games of it in my lifetime. We've experimented with some small homebrew rules before, and more recently I saw real-time Catan, which we played two games of. Even after two games it was clear to us that real-time Catan is an enormous improvement, and I doubt we'll ever go back to regular Catan again.

That said, we did find we needed to tweak the rules. Here's our full homebrew ruleset, building off cities and knights + seafarers:

Turns have a set time limit. We generally start with 45 seconds a turn, and increase to 60 seconds later in the game if it's clear people need more time for more complex turns.
You may take any action on anybody's turn, including trading with anyone else.
The only exception to this is progress cards, which must be played on your turn.
When a player takes an action that requires a response from another player (e.g. master merchant), pause the timer for all players.
When a player reaches 13 victory points, the game does not end immediately. Instead there is an (indefinite, but reasonable) rebuttal period for the remainder of the turn where players continue to play.
If a player still has 13 VPs at the end of the turn, they win.
If two players are tied for VPs at the end of the turn, play continues until one player is ahead at the end of a turn.
If any actions conflict, ties are broken by turn order, with the person who's turn it is having priority, and so on continuing clockwise.
You may declare any progress card you own as tradeable by placing it face up in front of you.
You can barter with other players using tradeable progress cards as you would any other resource.
They are still progress cards in every respect. They count toward your progress card limit, they can be stolen by the spy, and you can still play them.

All other rules that interact with turns are still in play: you cannot play a progress card on the same turn you recieve it, the player who rolls a 7 moves the robber, etc. The purpose of the rebuttal period is to deter players from waiting until the last second to reach 13 victory points. And the purpose of not immediately ending the game when a player "wins" is to avoid a mad rush to reach 13 victory points before anyone else on a turn! Requiring progress cards to be played on your turn is both to nerf them, as we found they were otherwise too powerful, and to reduce the potential for conflicting actions.

In my opinion, breaking ties by turn order is more elegant than casually deciding each case at the table, as the original post described. We found conflicting actions to be a large problem – they only happened ~once a game, but could turn the course of the game (such as a wedding played right as someone builds a settlement).

While we're on the topic of homebrews, we've long been searching for a way to make the green commodity's ability in cities and knights less powerful, but haven't found anything thematically satisfying while not nerfing it into the ground.

Thanks to Robert O'Callahan for describing the original idea!

Gödel's incompleteness theorem

2022-03-16T00:00:00-04:00

Ah, Gödel's incompleteness theorem. I won't say it's the most misused theorem in all of mathematics, but I would argue it has the worst ratio of "people who actually understand it" to "people who misapply it".

Here it is:

For any sufficiently strong theory $T$, there is a sentence $\sigma$ which is independent of $T$.

Before we can unpack it, you need a crash course in model theory. This will be a little bit painful, but I promise it's critically important.

Sentences and Theories

Here are the axioms of group theory, which you'll find at the beginning of any standard textbook:

\[ \begin{equation} \begin{aligned} & \sigma_0: \forall a \forall b \forall c \ (c*(a*b) = (c*a)*b) \\ & \sigma_1: \forall a \ (e*a = a) \\ & \sigma_2: \forall a \exists b \ (a*b = e) \\ \end{aligned} \end{equation} \]

(If you're not familiar with group theory, don't worry; the actual content of these axioms is largely irrelevant for us. They state that $*$ is associative, has an identity $e$, and every element has an inverse respectively).

What your group theory textbook probably didn't tell you is that there's a language $L_{group}$ associated with group theory. This is the set of special symbols we'd like to be able to refer to in our sentences: $L_{group} = \{e, *\}$.

The three axioms above are examples of sentences. A sentence is a statement in first order logic which contains only logical symbols ($\lnot$, $\land$, $\lor$, $\implies$, $\iff$, $\forall$, $\exists$), or symbols from our language $L$. For instance, the following is not an $L_{group}$-sentence:

\[\forall a \ (a + a = a)\]

because $+$ isn't in $L_{group}$. Note that whether a statement is a sentence or not depends on the language, which is why we say $L_{group}$-sentence instead of just sentence. When the language is clear from context, we'll drop the $L$- prefix and call it a sentence.

We can bundle these axioms together into an object called a theory. $T_{group} = \{\sigma_0, \sigma_1, \sigma_2\}$ is the theory of groups. A theory is any set of sentences.¹¹ Since theories consist of sentences, and sentences depend on a language, you would be right to suspect that theories also depend on a language. Formally we call a theory $T$ an $L$-theory, where L is the language of the theory. We again drop the $L$- prefix when the langauge is clear from context.

Here's the incompleteness theorem again:

For any sufficiently strong theory $T$, there is a sentence $\sigma$ which is independent of $T$.

We've defined sentence and theory. Let's tackle independent next.

Models

To discuss independence of sentences, we first need to talk about models. We say that $\mathcal{A}$ is a model of a theory $T$ if $\sigma$ is true in $\mathcal{A}$ for all $\sigma \in T$.²² If $\sigma$ is true in $\mathcal{A}$, you'll see this written in the literature as $\mathcal{A} \vDash \sigma$. However, you'll see very shortly that this is an overloading of the $\vDash$ operator; its meaning changes depending on if the right hand side is a sentence $\sigma$ or a theory $T$. I've avoided writing $\mathcal{A} \vDash \sigma$ here for clarity, but it is the more precise usage.

Don't be scared by the notation. If you wanted to check whether something is a group or not, what do you do? You check that it satisfies all the axioms of being a group. That's all this definition is stating. Saying "$(\mathbb{Z}, +)$ is a group" is equivalent to saying "$(\mathbb{Z}, +)$ models $T_{group}$". And if $\mathcal{A}$ models $T$, we write $\mathcal{A} \vDash T$.

We need just one more definition. Let $T$ be any theory. Then $T$ is complete if, for all sentences $\sigma$ and for all models $\mathcal{A} \vDash T$ and $\mathcal{B} \vDash T$, $\sigma$ is true in $\mathcal{A}$ iff $\sigma$ is true in $\mathcal{B}$. In other words, "every model of $T$ agrees on the truth value of every sentence".

A natural question is whether $T_{group}$ is complete. Can you think of a sentence $\sigma$ which is true in some group $\mathcal{A} \vDash T_{group}$ but false in another group $\mathcal{B} \vDash T_{group}$? Hint: the answer is yes, there are several such sentences, and they aren't that complicated. Try and think of one now before you read on, if you like.

(pause...) if you said "$\mathcal{A}$ is abelian" (ie $*$ commutes in $\mathcal{A}$), you're correct! I also would have accepted "$\mathcal{A}$ has an element of order n" for some n. Here's a sentence that is true in $\mathcal{A}$ iff $\mathcal{A}$ is abelian:

\[ \sigma_{abelian} = \forall a \forall b \ (a*b = b*a) \]

To see that $\sigma_{abelian}$ proves that $T_{group}$ is not complete, pick your favorite abelian group, say $(\mathbb{Z}, +)$, and your favorite non-abelian group, say $GL(2, \mathbb{R})$. $\sigma_{abelian}$ is true in $(\mathbb{Z}, +)$ and false in $GL(2, \mathbb{R})$, since addition commutes and matrix multiplication does not. But both $(\mathbb{Z}, +)$ and $GL(2, \mathbb{R})$ are models of $T_{group}$ — after all, they're both groups and thus satisfy the three axioms of $T_{group}$. So $\sigma_{abelian}$ is true in $(\mathbb{Z}, +) \vDash T_{group}$ and false in $GL(2, \mathbb{R}) \vDash T_{group}$, so $T_{group}$ is not complete.

Independence

What about independence? Let $T$ be any theory and $\sigma$ be any sentence. Then $\sigma$ is independent of $T$ if there are two models $\mathcal{A} \vDash T$ and $\mathcal{B} \vDash T$ such that $\sigma$ is true in $\mathcal{A}$ and false in $\mathcal{B}$. In the example above, $\sigma_{abelian}$ is independent of $T_{group}$. A corollary is that a theory $T$ is not complete iff there is some sentence $\sigma$ which is independent of $T$. I'll do the proof explicitly below, but it's nothing more than unpacking the respective definitions.

$\implies$ Let $T$ be not complete. So there is some sentence $\sigma$, some model $\mathcal{A} \vDash T$ and $\mathcal{B} \vDash T$, such that either $\sigma$ is true in $\mathcal{A}$ and false in $\mathcal{B}$, or false in $\mathcal{A}$ and true in $\mathcal{B}$. In either case, $\sigma$ is independent.

$\impliedby$ Let $\sigma$ be a sentence independent of $T$. Then there are $\mathcal{A} \vDash T$ and $\mathcal{B} \vDash T$ such that $\sigma$ is true in $\mathcal{A}$ and false in $\mathcal{B}$. So $T$ is not complete. $\blacksquare$

If a theory $T$ is "not complete", we call $T$ incomplete.

Let's take a closer look at the incompleteness theorem, as stated:

For any sufficiently strong theory $T$, there is a sentence $\sigma$ which is independent of $T$.

We just proved that $T$ is incomplete iff there is a sentence $\sigma$ which is independent of $T$. So we can restate the theorem as:

Any sufficiently strong theory $T$ is incomplete.

This is where the "incompleteness" portion of the theorem's name comes from. Although these statements are equivalent, I'll continue to use the first, longer version, since I feel it's more intuitive (as it doesn't require you to unpack the definition of $T$ being incomplete).

Before I make our final definition of sufficiently strong, I want to take a detour into euclidean geometry as a final example to round out our discussion of theories and models.

Euclidean geometry

Euclidean geometry is another example of a theory. It contains five axioms and three "undefined terms": point, line, and plane are undefined and are referenced in the axioms without definition. Does that sound familiar? We did the exact same thing in groups, using the "undefined terms" $*$ and $e$ in our axioms, and defining them to be part of our language $L_{group}$. It turns out that the notion of a language has always been hiding in euclidean geometry. The language of euclidean geometry is just $L_{euclid} = \{\text{point}, \text{line}, \text{plane}\}$. I'll call $T_{EG}$ the theory of euclidean geometry, which is the set of the five axioms of euclidean geometry.

You'll notice that I'm not giving a precise mathematic definition of the axioms, but that's because Euclid himself didn't really give precise mathematical definitions either. Euclidean geometry can in fact be made precise (see Tarski's Axioms), and everything I say below will still hold, but I'll avoid deviating too much from the euclidean geometry described in Euclid's Elements.

You may also know of the particularly contentious parallel postulate (PP), the fifth axiom of euclidean geometry. Some people thought that the parallel postulate could be proven from the rest of the axioms, and gave the name "neutral geometry" to the set of axioms of euclidean gemoetry without PP. I'll call the theory of neutral geometry $T_{NG} = T_{EG} \setminus \{PP\}$. They then showed PP could not be proven from the rest of the axioms by constructucting two models of $T_{NG}$: one in which PP was true (a model of euclidean geometry) and one in which PP is false (a model of elliptical geometry).

Once they had shown PP could not be proven from neutral geometry, they called PP independent of neutral geometry. Does this term "independent" sound familiar? It should — we defined $\sigma$ to be independent of $T$ if there are $\mathcal{A} \vDash T$, $\mathcal{B} \vDash T$ where $\sigma$ is true in $\mathcal{A}$ and false in $\mathcal{B}$. Here, $\sigma$ is the parallel postulate, $T$ is neutral geometry, $\mathcal{A}$ is a model of euclidean geometry, and $\mathcal{B}$ is a model of elliptical geometry. In general, proving that an axiom $\sigma \in T$ is "independent" of (cannot be proven from) the other axioms of $T$ is equivalent to proving that $\sigma$ is independent of $T \setminus \{\sigma\}$, in the formal sense of independence described above.

Because PP is independent of $T_{NG}$, $T_{NG}$ is incomplete. However, it turns out that $T_{EG} = T_{NG} \cup \{\text{PP}\}$ is complete, so by taking PP as a new axiom we've created a complete theory.³³ Proving that euclidean geometry is complete requires a more formal axiomatization than what Euclid gave, and so we turn to Tarski's Axioms (sometimes called elementary euclidean geometry) instead. This is outside the scope of this post, but Tarski proved that his theory was complete by showing that it admits quantifier elimination. Completeness follows from this since the language has no constants, which means the only sentences without quantifiers are $\top$ and $\bot$, which are true and false in every model respectively. We'll discuss this concept of "completing" a theory $T$ by adding new axioms again later, and whether this can save us from the consequences of the incompleteness theorem. Spoiler: it can't.

Sufficiently strong

I've left the simplest – or at least, easiest to informally explain – for last. A theory $T$ is sufficiently strong if it contains the natural numbers, addition on the natural numbers, and multiplication on the natural numbers (or contains objects isomorphic to them). More formally, $T$ is sufficiently strong if it contains Robinson arithmetic, called $Q$. If you're familiar with peano arithmetic, $Q$ is peano arithmetic without induction.

Understanding why containing $Q$ is necessary gets to the heart of the proof of the incompleteness theorem and is a much deeper discussion than we can get into here, so I hope you'll forgive me for not going into any more detail.

Bringing it all together

Let's recap:

A sentence $\sigma$ is a statement in first order logic, potentially containing symbols from some language $L$
A theory $T$ is a set of sentences
$\mathcal{A}$ is a model of $T$ (written $\mathcal{A} \vDash T$) if $\sigma$ is true in $\mathcal{A}$ for all $\sigma \in T$
A sentence $\sigma$ is independent of a theory $T$ if there are models $\mathcal{A} \vDash T$, $\mathcal{B} \vDash T$ with $\sigma$ true in $\mathcal{A}$ and false in $\mathcal{B}$
A theory $T$ is sufficiently strong if it contains $Q$, aka robinson arithmetic (informally, if it contains the natural numbers, addition, and multiplication)

And finally, the incompleteness theorem itself:

For any sufficiently strong theory $T$, there is a sentence $\sigma$ which is independent of $T$.

Congratulations — you now know everything you need to understand the statement of the incompleteness theorem. If that was your goal, you can walk away here, but I'll discuss the consequences of this theorem next.

Consequences

To start, a question: is the converse of the incompleteness theorem true? No. We saw above that $T_{group}$ has a sentence $\sigma_{abelian}$ which is independent of $T_{group}$, but $T_{group}$ certainly does not contain $Q$. Informally, this means that theories can be incomplete for "other reasons" than the incompleteness theorem (actually, it's quite easy to create incomplete theories; much easier than creating complete ones). The reason why the incompleteness theorem is so important is not because it applies to a large number of theories, but because the theories it does apply to are important ones that we would really prefer to be complete.

In particular, the incompleteness theorem often comes up adjacent to the foundations of mathematics, with theories like $\text{ZFC}$. Although perhaps not obvious just by looking at the axioms, $\text{ZFC}$ can prove the axioms of $Q$, and is in fact much, much stronger than it. So $\text{ZFC}$ is sufficiently strong and thus subject to the incompleteness theorem, so there is some $\sigma$ which is independent of $\text{ZFC}$. In other words, there are theorems (sentences) which we will never be able to prove or disprove from the axioms of $\text{ZFC}$.

This probably doesn't sound too bad. So what? Well, think about your favorite mathematical field (which almost certainly uses $\text{ZFC}$ as its mathemtical foundations, unless you're a category theorist). Then think about some famous unsolved conjecture in that field. Most people think there are only two options: either that conjecture is true, or it's false. The incompleteness theorem says there's a third possibility: the conjeture is one of these independent sentences $\sigma$, and thus can never be proven or disproven in $\text{ZFC}$.

I would say that proving a famous theorem independent is a much worse fate than proving it either true or false. Consider $T_{NG}$, the theory of neutral geometry we discussed above, and the theorem under discussion to be the parallel postulate PP, which we know is independent of $T_{NG}$. When PP was found to be independent of $T_{NG}$, it split the world of euclidean geometry in two. In one camp are the worlds in which PP is true; we call these euclidean geometries, with theory $T_{NG} \cup \{\text{PP}\}$. In the other camp are the worlds in which PP is false; we call these non-euclidean geometries, with theory $T_{NG} \cup \{\lnot \text{PP}\}$. Because PP is independent of $T_{NG}$, both of these worlds are "equally valid". In my opinion, having two possible worlds is worse than knowing for certain which "world" we live in, like we would if PP was not independent of $T_{NG}$ (and therefore either true or false).

However, there's a reason why non-euclidean geometries are significantly less studied: most people believe PP is "intuitively true", and study euclidean geometry instead of non-euclidean geometry. This is true of PP, but it's not true of all independent sentences. Sometimes an independent sentence really does fracture a theory into multiple, equally popular camps. In other words, it's not always obvious which "choice" to make (eg whether to add PP or $\lnot$PP).

For instance, in set theory, the Continuum Hypothesis (CH) is the most well known example of a theorem independent of $\text{ZFC}$. When it was proven to be independent, it split the world of set theory in two, just like PP did. But this time it's worse, because there is a large amount of disagreement among set theorists about whether CH is intuitively true. If you tried to get $\text{ZFC} \cup \{\text{CH}\}$ accepted as the foundation of mathematics (instead of $\text{ZFC} \cup \{\lnot \text{CH}\}$ or just $\text{ZFC}$), you would get significant pushback from set theorists, beacuse to them, both worlds are equally interesting.

You might hold out hope that alright, fine, $\text{ZFC}$ has some independent sentences, but they're sentences we didn't really care about anyway. This is actually mostly true if you're not a set theorist and don't work with graduate level math! Most sentences independent of $\text{ZFC}$ come from set theory, and the rest are complicated statements in other fields, most of which I don't even understand the statement of.⁴⁴ See List of statements independent of ZFC. But the incompleteness theorem puts a "cap", so to speak, on $\text{ZFC}$ (and thus mathematics): the deeper into a subject you go, the closer and closer you brush up against independent statements. And if you're particularly unlucky, you'll actually run into a theorem in your work which is independent of $\text{ZFC}$, and you'll curse the incompleteness theorem when you do.

So, sentences being independent of a theory is bad, and all theories which can serve as the fondation of mathematics have independent sentences (because they are, to a tee, sufficiently strong). This is the single most important implication of the incompleteness theorem.

Incompleteness of $T_{group}$

But wait — if a theory $T$ being incomplete is bad, and we proved that $T_{group}$ is incomplete above, isn't that bad news for group theorists? Well, it's not good, but it's also not bad. It's true that $\sigma_{abelian}$ splits $T_{group}$ into two theories: $T_{group} \cup \{\sigma_{abelian}\}$ and $T_{group} \cup \{\lnot \sigma_{abelian}\}$. But these are just the theories of abelian and non-abelian groups respectively. If I had asked you whether studying abelian and non-abelian groups separately bothers you, you would have looked at me like I'm crazy. After all, if you want to prove something about abelian groups, you just assume that $G$ is abelian (but note that this is identical to working in $T_{group} \cup \{\sigma_{abelian}\}$).

The difference lies in that $T_{group}$ is not trying to be a theory of mathematics. You don't particularly care if you can't prove every possible statement for all groups, because if you can't, you can always look at a specific group you care about and prove whether that statement is true in that group or not. This isn't possible in a theory of mathematics.⁵⁵ This is because any theory of mathematics can't prove that there are any models of that theory, or else the theory would be consistent, which contradicts Gödel's second incompleteness theorem. So there are no "specific models" of a theory of mathematics to look at — in fact, there are no models of a theory of mathematics at all.

Completeness of $T_{EG}$

But wait — we said earlier that the theory of euclidean geometry, $T_{EG}$, was complete. Does this contradict the incompleteness theorem? No, because $T_{EG}$ is not "sufficiently strong". There are a number of interesting theories which are complete, like $T_{EG}$, but aren't strong enough to be subject to the incompleteness theorem.

"Completing" a theory

Recall that we saw $T_{NG}$ (neutral geometry), which is incomplete, could be extended to a complete theory $T_{EG}$ (euclidean geometry) by adding the parallel postulate. We say that $T_{EG}$ is a "completion" of $T_{NG}$, that $T_{NG}$ can be "completed" by adding PP, etc.

You might wonder if we could pull the same trick for theories affected by the incompleteness theorem. Given some sufficiently strong theory $T$, the incompleteness theorem says there is some $\sigma$ independent of $T$. Could we complete $T$ by adding either $\sigma$ or $\lnot \sigma$ to $T$ as an axiom? The answer is no, regardless of which we choose. Adding an axiom to a theory never makes that theory weaker (ie prove less sentences) — it can only make it stronger. This new theory $T' = T \cup \{\sigma\}$ would still be sufficiently strong and thus satisfy the incompleteness theorem, so there is some new sentence $\sigma'$ which is independent of $T'$.

So no matter how many independent sentences we add as axioms to a sufficiently strong theory, it will still be sufficiently strong and subject to the incompleteness theorem. A sufficiently strong theory can never be "completed".

Independent sentences are "true"

This misunderstanding (I'm tempted to say "abuse") is the singular reason I wrote this post, so you'll have to forgive me if I rant a bit here. The single most common misuse of the incompleteness theorem is stating that the independent sentence is somehow "true". Here's a direct quote from wikipedia:

For any such consistent formal system, there will always be statements about natural numbers that are true, but that are unprovable within the system.

(They're using "consistent formal system" to be some theory $T$, "statements about the natural numbers" to be some sentence $\sigma$, and "unprovable within the system" to mean "independent of $T$").

Except this is wrong. An independent sentence $\sigma$ is absolutely not "true". It is, by definition, true in some model $\mathcal{A} \vDash T$ and false in some other model $\mathcal{B} \vDash T$, so calling it "true" is nonsense. It's neither true nor false; it's independent.

What people really mean when they say that an independent sentence $\sigma$ is true is that it's true in the "standard model", and therefore, they argue, intuitively true. What is the standard model? Nothing more than a particular model $\mathcal{A} \vDash T$ we have arbitrarily chosen as intuitive for visualizing $T$. For instance, the standard model of euclidean geometry $T_{EG}$ is the plane $\mathbb{R}^2$.

But for other theories, it's not clear at all what the standard model is – say, for $T_{group}$. You might suggest $$(\mathbb{Z}, +)$$, but there's no good reason to choose that group over, say $$(\mathbb{Z}_8, +)$$, or even $$GL(2, \mathbb{R})$$. Here, the concept of a "standard model" breaks down.

For theories which have a standard model, this line of thinking does have some philosophical merit. I just wish people would say "there is a sentence which cannot be proven from $T$ but is true in the standard model", instead of saying "there is a true sentence which cannot be proven", which sounds like a contradiction. This seeming contradiction bothered me for many years when reading about the incompleteness theorem, and I was greatly relieved to eventually learn that people were simply misinterpreting the theorem.

Gödel's completeness theorem

Before his incompleteness theorem, Gödel proved another theorem about the completeness of first order logic. Informally, this theorem says that for all theories $T$ and sentences $\sigma$, if $\sigma$ is true in every model $\mathcal{A} \vDash T$, then there is a proof of $\sigma$ from the axioms of $T$. In other words, there is a proof of every true statement.

The naming of these theorems suggests a contradiction: how can we have both Gödel's completeness theorem and Gödel's incompleteness theorem?

Well, because they refer to two different notions of completeness. "completeness" in the completeness theorem means that "everything which is true is provable". However, "incompleteness" in the incompleteness theorem means that some theories $T$ have sentences which are neither true nor false in $T$. These independent sentences don't even satisfy the conditions of the completeness theorem (since they're not true in every model), so these two theorems are entirely orthogonal.

Technicalities

I haven't been entirely truthful with you. There are two extra assumptions we need to add before we get the true incompleteness theorem. They deal with what are essentially edge cases – though very important edge cases.

Satisfiable

First, we require that the theory $T$ be satisfiable. A theory $T$ is satisfiable if there is any model $\mathcal{A} \vDash T$ at all. Equivalently, $T$ is satisfiable if its axioms are consistent, ie you can't derive a contradiction from them. If we allowed $T$ to be unsatisfiable, then the incompleteness theorem would fail in the trivial case: let $T$ be any sufficiently strong, unsatisfiable theory. Then there are no models $\mathcal{A} \vDash T$, so vacuously, there are no independent sentences $\sigma$ (since an independent sentence requires at least two models). But this would contradict the incompleteness theorem.

Here's our updated incompleteness theorem:

For any sufficiently strong, satisfiable theory $T$, there is a sentence $\sigma$ which is independent of $T$.

Recursively enumerable

For what are actually pretty technical reasons, we also require $T$ to "recursively enumerable". This is equivalent to saying that the elements of $T$ are "computable", ie there is an algorithm which, given any sentence $\sigma$, returns true if $\sigma \in T$ and false otherwise. It's not worth getting into the details here, but this basically rules out crazy theories where you just throw in so many axioms that you're eventually able to prove everything in all models. Any "reasonable" theory like $\text{ZFC}$ or $Q$ is recursively enumerable.

You might also see such theories being called "decidable", as in, you can "decide" whether a sentence $\sigma$ is an element of $T$.⁶⁶ The multitude of names is thanks to computability theory, which proved that several distinct notions of computability (all with their own names) are actually exactly equivalent, and thus the names are interchangeable.

So our updated incompleteness theorem is then:

For any sufficiently strong, satisfiable, recursively enumerable theory $T$, there is a sentence $\sigma$ which is independent of $T$.

You can see why I didn't want to lead with this definition :)

I promise that I'm not holding anything back anymore — this is the genuine, full incompleteness theorem which Gödel himself proved⁷.⁷ Ok, fine, you got me: Gödel's original proof required something called $\omega$-consistency, a strengthening of consistency. However, it turns out this condition can be weakened to consistency alone, with Rosser's trick. These extra assumptions rarely come up in casual discussions, which is why I left them until now to discuss.

Afterword

I debated a lot about which examples of theories to use, and in fact originally wrote a draft where I used the theory of real-valued vector spaces. Unfortunately, its axiomatization is quite dirty, and so I dropped it in favor of the theory of groups, even though I think more people would be familiar with vector spaces than groups. Oh well.

There are also some philosophical implications I wanted to include, but I don't feel qualified to discuss them. I'm also not really convinced how big the philosophical implications of the incompleteness theorem are.

Liam DeVoe

Swarm testing

Traditional swarm testing

A problem

A simple extension

One step further

My agent management software

Plait

Property-based testing is about to rule the (software) world

Property-based testing

Homebrew catan

Gödel's incompleteness theorem

Sentences and Theories

Models

Independence

Euclidean geometry

Sufficiently strong

Bringing it all together

Consequences

Incompleteness of \(T_{group}\)

Completeness of \(T_{EG}\)

"Completing" a theory

Independent sentences are "true"

Gödel's completeness theorem

Technicalities

Satisfiable

Recursively enumerable

Afterword