Swarm testing
Swarm testing is a technique for increasing behavioral diversity in randomized testing. It's conceptually simple, yet powerful, which makes it a favorite of mine. In this post, I describe a natural extension to swarm testing which yields an additional increase in behavioral diversity.
Traditional swarm testing
Consider a stack machine with three instructions (push, pop, add), and the corresponding stateful test1:1 In pseudocode, because I want to emphasize the behavior before any particular testing framework changes the distribution.
class StackMachineTest:
@rule(integers())
def push(self, value):
...
@rule()
def pop(self):
# early-returns if less than one value on stack
...
@rule()
def add(self):
# early-returns if less than two values on stack
...
Here is one simple approach to exercising this test. For each test case, sample the number of rules \(n\) to run from some distribution centered on the desired average test case size. Then pick the next rule to run uniformly at random (from {push, pop, add}), until you've run \(n\) rules total.
This testing strategy has a weakness for our test. Suppose the stack machine implementation has a bug, which only manifests when the stack size is large (say, > 10). We can visualize whether the test finds this bug by plotting the number of calls to push vs pop:
This plot shows the joint distribution of the number of calls to push and pop within a test case2.2 push and pop are individually normally distributed, because picking the next rule uniformly at random is a bernoulli trial, from which repeated draws form a normal distribution. Each "point" on the plot represents a single test case. The bug lives in the lower right corner, where push - pop ≥ 10. Because we expect to draw roughly as many pop rules as push rules, the stack is unlikely to grow large enough to trigger the bug.
This leads us to the following general observation: some features, like push and pop here, actively mask bugs when combined together. Conceptually, such bugs live along the axes of our plots:
And are unlikely to be triggered.
We would like a testing strategy which explores this part of the search space. The insight of swarm testing is that one can achieve this by randomly disabling certain features for an individual test case. For example, one might assign a 50% probability of disabling each rule3.3 This is the algorithm used in the swarm testing paper. Note however that this is a poor choice for other reasons: it is unlikely to disable either almost all, or almost no, rules as the number of rules grows. The fix is straightforward, but orthogonal to this article. For the interaction of Rule1 and Rule2, there are four equally-likely possibilities in a test case:
- Both rules are enabled. As above.
- Rule1 is enabled, but not Rule2. We see some exploration along \(y = 0\).
- Rule2 is enabled, but not Rule1. We see some exploration along \(x = 0\).
- Neither are enabled. No exploration; uninteresting.
We can visualize the resulting distribution as the sum of the first three cases:
This testing strategy now explores the previously unlikely state space that contains this type of bug. This testing strategy would easily find our push / pop bug, for example.
A problem
Up to this point, I've described traditional swarm testing. And it's great; we get some nice increase in diversity. Specifically, we can explore states which require one rule or more rules to be completely disabled.
But, as you may have noticed, some under-explored areas remain4:4 I am intentionally ignoring the search space represented by the upper right area. This area can easily be covered by increasing the average number of rules run in a test case.
The newly-highlighted area corresponds to when Rule1 is enabled, but substantially less likely than Rule2; or vice versa.
To give a concrete example of why we might care about this case, suppose our stack machine gains a new optimize opcode. When run, optimize looks at the execution history of the machine and performs a dynamic JIT-style optimization. Now suppose that optimize has a bug only when the execution history is sufficiently long, and there is the right ratio of pop calls to push calls; say, 5 to 1:
This bug has two conditions: that push and pop have the right ratio, and that both rules are enabled. It therefore won't be caught by either the original testing strategy (which is unlikely to produce the right ratio) or by the swarm testing strategy (which will fully disable one of the rules).
A simple extension
With this motivating example in mind, I propose a simple extension to swarm testing. Traditionally, each rule is disabled with 50% probability. Instead, I propose that for each test case, each rule \(r\) is assigned an activation probability \(r_p \in [0, 1]\), sampled uniformly. Then, whenever a rule would normally be run, it is instead skipped with probability \(1 - r_p\).
It might be helpful to play around and see why this algorithm gives us coverage of the previously-rare regions:
Conceptually, we are letting the distribution "roam around" our graph uniformly. Because we're uniformly sampling pushp \(\in [0, 1]\) and popp \(\in [0, 1]\), we're equally likely to get a distribution centered on any point in the space of # calls to push vs # calls to pop.
Here, you can see that exploration in practice:
This testing strategy will easily find both the new optimize bug, and our original push / pop bug. I view it as a straightforward improvement on swarm testing.
One step further
I'll conclude with a teaser. Above, I said the activation probabilities are sampled from a uniform distribution on \([0, 1]\). Let's consider a program with more features; say, 10. Now suppose this program has a bug only in some particular configuration of relative feature probabilities. For example, that some set of three features are half as common as some other set of three. Uniformly sampling the activation probabilities is very unlikely to produce this configuration, and so we will miss this bug.
We want a distribution of activation probabilities that is likely to produce this configuration. Not only that, we want a distribution of activation probabilities that is also likely to produce any other possible bug-inducing configuration: feature A half as likely as B half as likely as C; A ten times as likely as all other features; feature probabilities distributed according to some power law; and many others besides.
This implies the distribution of activation probabilities should itself be randomly sampled from the space of distributions.
It's swarms all the way down.
Thanks to Zac for bouncing swarm testing ideas around with me.