Property based testing for scientific code in Python

Automated software testing starts with the often annoying and time-consuming process of writing tests. But no matter how annoying it is, in the end it always pays off, at least that’s my experience. For this article, I assume that the reader acknowledges the importance of automated software testing, because I would like to point to a way on how to write better tests in less time by using property based testing.

The Python package hypothesis provides a toolkit for doing property based testing. The concept works like this: You have piece of code that you want to test, e.g. a function that accepts certain inputs as arguments. This function returns a value and you have certain assumptions about this value. These are the properties of the return value that you test. Some of these assumptions should always hold true, no matter what the input looks like. For instance, if you have a function that calculates some probability value, it should always be a numeric value in the interval [0, 1]. Other properties may depend on the input value, e.g. for certain inputs the result should always be zero (or close to zero, because you have to take floating point precision issues into account). And then you may have certain assumptions about the behavior of the function itself, e.g. that for certain inputs it should throw an exception.

With the hypothesis package, you write a test that accepts arbitrary input data of a certain type and optionally with certain characteristics (e.g. integers between 0 and 10, or strings of length 3, etc.). Inside your test function, you set up the previously mentioned assumptions. During testing, hypothesis then generates random test data according to the specifications and throws that at your test function to see if your assumptions or assertions hold true:

If it finds an example where it doesn’t, it takes that example and cuts it down to size, simplifying it until it finds a much smaller example that still causes the problem. It then saves that example for later, so that once it has found a problem with your code it will not forget it in the future.

hypothesis documentation,

A first example

Let’s try out hypothesis with an example: Suppose you have implemented a function called str_multisplit(s, sep) that behaves like Python’s str.split() method, but works with not just one separator string or character, but several of them:

str_multisplit('Split/this.example/string', ['/', '.'])
## ['Split', 'this', 'example', 'string']
str_multisplit('Split/this.example/string', ['/'])
## ['Split', 'this.example', 'string']
str_multisplit('Split/this.example/string', [','])
## ['Split/this.example/string']

Note how in the first example the input string s is split into a list of substrings by the characters '/' and '.'. I’ve put an example implementation of such a function on GitHub.

When we wanted to write a property based test for this, we would at first list important properties that a result from this function should satisfy, e.g.:

  1. no matter what the inputs are, the function should always return a list
  2. if argument s is an empty string, the result must be a list containing only the empty string, i.e. ['']
  3. if sep is an empty sequence, the result must be a list containing only the input s, i.e. [s]
  4. each substring of the result must satisfy:
    • it must also be a substring of s
    • it must not contain any of separator strings sep
  5. the number of substrings in the result equals the sum of the occurrences of each unique sep-item in s plus 1

Now we can implement a test function that checks whether these properties hold for random inputs using the hypothesis package. The two most important objects from hypothesis that you need in almost all tests are the @given() decorator and the strategies module. Both are combined to define the arguments that a test function accepts and how the random values for these arguments should be generated – the latter is called a “data generation strategy”. So, let’s import these things first (using the short name st for the strategies module:

from hypothesis import given
from hypothesis import strategies as st

Our function str_multisplit() accepts two arguments, the input string s and the characters/strings sep to split by, and we want to provide random test data for both of these arguments. The first argument should be a random string. Hypothesis provides a data generation strategy for that, called text(). The second argument is a list of characters (or strings but we stick with characters for now). We can use the lists() strategy in combination with characters() to generate lists of characters. The generated data can then be passed to the function we want to test, str_multisplit(), and we store the result in a variable res:

@given(s=st.text(), sep=st.lists(st.characters()))
def test_str_multisplit_hypothesis(s, sep):
    res = str_multisplit(s, sep)

We could already run this test to see if some random input data provokes an exception, an endless loop or other problems in our function.

We can now implement the five properties that we listed above by checking res like this:

@given(s=st.text(), sep=st.lists(st.characters()))
def test_str_multisplit_hypothesis(s, sep):
    res = str_multisplit(s, sep)

    # 1. always return a list
    assert type(res) is list

    # 2. if argument s is an empty string, result must be ['']
    if len(s) == 0:
        assert res == ['']

    # 3. if sep is an empty sequence, result must be a list
    # containing only the input s, i.e. [s]
    if len(sep) == 0:
        assert res == [s]

    # 4. each substring must ...
    for p in res:
        assert p in s                # ... be a substring of s, too
        # ... not contain any of separator strings sep
        assert all(c not in p for c in sep)

    # 5. number of substrings in the result equals sum of the
    # occurrences of each *unique* sep-item c in s plus 1
    n_asserted_parts = 0
    for c in set(sep):
        n_asserted_parts += s.count(c)
    assert len(res) == n_asserted_parts + 1

Both, the str_multisplit() function and the above test function are on GitHub. However, I strongly recommend to always also write a test to check against a set of examples. The @pytest.mark.parametrize decorator is very useful for that and I’ve employed it in the test code on GitHub, too (lines 11ff).

If you run the above test it should pass without problems, i.e. all our assertions hold true for the generated test data (and the examples in the second test function).

Our function should also work with strings (with length > 0) as separators, not just characters, e.g.:

str_multisplit('Hello<>World///Hello<>Test!', ['<>', '///'])
## ['Hello', 'World', 'Hello', 'Test!']

We can adjust the hypothesis test for this by changing the data generation strategy for sep to st.lists(st.text(min_size=1)). However, this will make our test fail for the generated input s = '00', sep = ['0', '00'] (note that when you run this test, it may generate other random inputs that fail):

        n_asserted_parts = 0
        for c in set(sep):
            n_asserted_parts += s.count(c)
>       assert len(res) == n_asserted_parts + 1
E       assert 3 == 4

res        = ['', '', '']
... AssertionError

This is interesting, because with res = ['', '', ''] our function actually produces the correct result for str_multisplit('00', sep=['0', '00']) (Python’s str.split()gives the same result with '00'.split('0')and '00' can’t be used to split it further). What’s wrong here is the last check in the test, because it doesn’t account for cases where one element in sep is part of another element in sep (like '0' being a substring of '00').

Of course the generated test inputs s = '00', sep = ['0', '00'] are an exotic edge-case. But this is what hypothesis is for. It specifically searches for edge-cases that make your code break. Cases that you probably wouldn’t think of yourself. This is one important take-away from this example. The other important take-away is that sometimes not the tested code is faulty, but the test itself as shown above.

The hardest thing in property based testing is coming up with properties or assumptions that signal quickly that something is wrong with the result and are at the same time easy to check. In most cases it makes sense to:

  • check the result type
  • check for limits, e.g. if the result is in the expected interval (for instance a calculated probability to be in [0, 1]) or has the expected dimensions (for instance matrix operations resulting in the expected number of rows and columns)
  • check for (expected) exceptions
  • if you have a function f and the inverse of that function f_inv, you can check for identity (i.e. make sure that f(f_inv(x)) == f_inv(f(x)) == x) , or when not symmetric like an encode/decode function pair, make sure that decode(encode(x)) == x

It’s especially important that the checks are not too complex. If they are, it’s likely that you create a faulty test as shown in the above example. When you’re essentially re-implementing the actual function within your test function in order to check results, you’ve definitely taken it too far.

A second example: word co-occurrence

I’d like to show a second example that exhibits very well how the hypothesis package works together with functions and data types from the Scientific Python stack, namely NumPy. For this example, we take a function that calculates the word co-occurrence for a given document-term matrix (DTM).

Suppose you have the following DTM with three documents and four unique words:

# w_1 2  3  4 
 [[1, 0, 0, 2], # doc_1
  [5, 3, 3, 1], #     2
  [0, 0, 1, 1]] #     3

The rows represent the documents (let’s say doc_1 to doc_3) and the columns the words (let’s call them w_1 to w_4). This means for example, that word w_1 (first column) occurs once in doc_1, 5 times in doc_2 and doesn’t occur in doc_3. A word co-occurrence matrix shows us how often each possible word pair occurs in the same document. For example, the pair (w_1, w_2) has a co-occurrence value of 1, because they both only co-occur in doc_2. (w_1, w_4) co-occur in doc_1 and doc_2, hence their co-occurrence is 2. The full word co-occurrence matrix for the given DTM would look like this:

# w_1 2  3  4 
[[2, 1, 1, 2],  # w_1
 [1, 1, 1, 1],  #   2
 [1, 1, 2, 2],  #   3
 [2, 1, 2, 3]]  #   4

The columns and rows give the pairs of words, e.g. when you look up column 1 and row 4, you get the co-occurrence value for (w_1, w_4), which is 2 as we expected. The matrix is always symmetric, because the order of the words doesn’t matter for co-occurrence. The diagonal gives the document frequency of that word, i.e. how often this word occurs at least once across all documents.

I’ve put an implementation of a word co-occurrence function on GitHub. It accepts a two-dimensional NumPy array as input. How do we create a property based test function with hypothesis for this? Luckily, hypothesis contains some extensions for generating test data for NumPy. What we need are the data generation strategies arrays and array_shapes from hypothesis.extra.numpy. The former allows us to generate NumPy arrays and the latter let’s us define the shape of these arrays. Let’s generate a few examples:

import numpy as np
from hypothesis import given, strategies as st
from hypothesis.extra.numpy import arrays, array_shapes

arrays(, array_shapes(2, 2)).example()
## array([[-7178833191759845341],
##       [              -19818],
##       [                2007]])

arrays(, array_shapes(2, 2)).example()
## array([[-11281, -11281, -11281, -11281, -11281],
##        [-11281, -11281, -11281, -11281, -11281],
##        [-11281, -11281,  35434, -11281, -11281],
##        [ -5720, -11281, -11281, -11281, -11281],
##        [-11281, -11281, -11281, -11281, -11281]])

The first argument determines the data type for the array, here integers, the second is for specifying its shape. The first two arguments for array_shapes() are the possible minimum and maximum dimension for the generated arrays. Here, we set both to 2 so that we always get a two-dimensional array. arrays() is a data generation strategy, similar to st.lists() or st.text() that we’ve used before. You can always call .example() on such strategy objects to generate a concrete sample.

We can additionally specify a strategy for drawing the elements of the array, e.g. here limiting them to integers in range [0, 5]:

arrays(, array_shapes(2, 2),
       elements=st.integers(0, 5)).example()

## array([[3],
##        [3],
##        [3],
##        [0],
##        [3]])

We can now plug this data generation strategy into the @given decorator for our test function and apply our word co-occurrence function to the generated DTM:

@given(dtm=arrays(, array_shapes(2, 2), 
def test_word_cooccurrence(dtm):
    res = word_cooccurrence(dtm)

Now we can formulate our assumptions and implement them in our test. The result should always be a 2D integer array of shape N by N where N is the number of unique words (i.e. the number of columns in the DTM). All values in the result matrix must be greater or equal 0 and cannot be greater than the number of documents (i.e. the number of rows in the DTM). And finally the result matrix must be symmetric, as already mentioned. Formulating this in Python gives:

@given(dtm=arrays(, array_shapes(2, 2), 
def test_word_cooccurrence(dtm):
    res = word_cooccurrence(dtm)

    n_docs, vocab_size = dtm.shape

    assert isinstance(res, np.ndarray)
    assert res.dtype ==
    assert res.ndim == 2
    assert res.shape == (vocab_size, vocab_size)
    assert np.all((res >= 0) & (res <= n_docs)) # min/max
    assert np.array_equal(res, res.T)  # symmetry

The GitHub gist contains the full code for this example. Of course, you should also create a traditional test with some known results for given inputs.

All in all, the hypthesis package is a great addition to a given test suite and helps you to find bugs in your code for cases that you wouldn’t think of. It allows you to save time by writing shorter traditional tests that cover only a few example cases. You can delegate the work of finding the bugs for edge-cases to hypothesis. I’ve integrated property based testing with hypothesis into some of my projects, for example the Text Mining and Topic Modeling Toolkit.

Comments are closed.

Post Navigation