Note: as of 2024, this Gist is obsolete - Hypothesis supports alternative backends, including hypothesis-crosshair. Further plans are documented in HypothesisWorks/hypothesis#3914 and future work will likely be discussed on our issue trackers and docs.

Hybrid concrete/symbolic testing

You’ve probably heard that "testing can be used to show the presence of bugs, but never to show their absence" - usually as an argument for formal methods like symbolic execution. On the other hand, testing is often easier and finds bugs in formally verified systems too. But why not use both?

Crosshair is an SMT-based symbolic execution tool for Python. Providing an interface or extension to check Hypothesis tests, like propverify in Rust (alas, see the retrospective) or DeepState in C and C++, would give users the best of both worlds.

Zac Hatfield-Dodds (Hypothesis core dev) and Phillip Schanely (Crosshair core dev) are planning to collaborate on various experiments or interoperability features; Zac is also supervising undergraduate projects in this area, in addition to our usual willingness to support open-source contributions.

Background

"Write tests once, run them many times (with many backends)"

The three main options for backends are

Random "generative" testing - Hypothesis already does this well
Feedback-guided mutation / fuzzing - you can plug in a fuzzer for single tests already. HypoFuzz can fuzz all your tests at once, with dynamic prioritization.
Solver-based exhaustive path testing - this project!

The Fuzzing Book is a great introduction to these topics, and has some simple examples (in Python!) of symbolic and concolic testing.

Crosshair basically works by creating proxy objects which record the conditions they're used in, and then using the Z3 solver to choose different paths each time. When stuck, or used in a non-symbolic function call (etc.) they choose a random possible value and execution continues as normal. Ongoing development is mostly based on improving the engine/frontend architecture (currently mixed) and patching more of the builtins and stdlib to support symbolic operations.

Write for Hypothesis, run with Crosshair

Integration UX

run Crosshair on Hypothesis tests just like any other kind of contract, or
integrated into HypoFuzz for an automatic hybrid fuzzing workflow

Running this as part of Hypothesis itself is tempting, but too prone to unpredictable performance issues or otherwise misbehaving on reasonable test code to be worth it.

Crosshair -> choice sequence -> whole test

Basic support for this approach was added in crosshair 0.0.17!

Hypothesis is implemented in terms of a "choice sequence", which you can think of as the sequence of choices we make while traversing a decision tree of what to generate. David's paper on this explains more details.

If we can drive Hypothesis tests using CrossHair from the choice sequence level, this would dodge essentially all the problems of converting strategies. Because that's our standard intermediate representation, it would also work beautifully with our database logic, target(), fuzzing, and so on.

The current representation is a bytestring - hence the fuzzer integration, but asking crosshair to analyse all of the Hypothesis-internal transformations between bits and other types leads to terrible performance. We therefore plan to try adding a "mid-level IR" layer consisting of a sequence of (bool, int, float, bytes, str) that can be losslessly 'lowered' back to our standard choice sequence, but is more tractable for symbolic analysis.

MIR design notes

Zac's initial plan is to expand the role of the ConjectureData object by adding methods corresponding to the existing conjecture.utils helpers for e.g. integer_range and biased_coin. If we arrange this right, it should be possible to subclass ConjectureData in such a way that we get the actual values from Crosshair and then "write back" the lower-level bits that would have generated them - so that shrinking and our database logic "just works". We'll see how it goes!

HypothesisWorks/hypothesis#3086

Generating a high-coverage seed corpus

One major advantage of fuzzing a Hypothesis test is that you can build up a "seed" corpus with good coverage just by running many random tests. Using Crosshair to discover new paths could plausibly work better - in practice you'd run both and take the union of the two corpora - so long as Crosshair can report the choice sequence which would lead Hypothesis to execute the same path. This is particular promising as a way to discover inputs which cover very rare branches.

The crosshair cover command demonstrates that this approach is viable.

"Assimilating" examples: output -> Crosshair -> Hypothesis

"Assimilating" external examples into Hypothesis (i.e. reversing the parse operation done by strategies) would allow Hypothesis to shrink examples from e.g. user bug reports, making debugging a lot easier.

This is impossible in the general case if we wanted to implement it as a new method on Hypothesis' SearchStrategy objects, though maybe practical in enough cases to be interesting.

However, using Crosshair to solve for a choice sequence which produces some example would mean running 'forwards' instead of 'backwards', and that's much more practical - the limit becomes "what does Crosshair support" rather than "can we invert arbitrary functions". It wouldn't support everything, but works much better on non-trivial code such as e.g. any strategy with .map(some_function).

Strategies -> Crosshair

Matthew Law tried this approach; confirming that it works for some tests but that the choice-sequence approach is more general.

The main challenge for this subproject is to extract generated types and constraints from Hypothesis, since SearchStrategy objects can't currently be inspected to determine what type of object they generate or any constraints on it.

That could probably be fixed, or we could just enumerate all the low-level strategies; but this approach will only ever work for a subset of tests regardless. That's because strategies can be composed with arbitrary user code, and it matters whether the string "hi" was generated from just("hi") or just("hi").map(ensure_in_database), so let's just consider the cases where it can work.

preconditions with assume() are relatively easy (see this issue)
the .filter() method can likewise just add constraints

A trickier-to-implement but useful application of this logic is that if we can prove that the test always passes for some subset of inputs (e.g. numeric ranges, small collection sizes, etc), it would be possible to redefine the strategy to exclude those inputs and thus focus random testing on only the unproven part of the input space. Equivalently, we might mark the corresponding branches of the choice sequence tree as exhausted so the conjecture backend ignores them.

The exception here is for tests which use type annotations and hypothesis.infer. In that case, we can just work directly on types - albeit at the expense of the choice-sequence integration points.

Conclusions

It seems unlikely that this will outperform Hypothesis' existing backend on many realistic workloads in the near future, but likely to be complementary in the medium term. We are particularly optimistic about hybrid approaches, where it is possible to choose the most suitable technique at runtime.

Overall, we're having a lot of fun working on this and tossing ideas around :-)

Zac-HD/hypothesis-crosshair.md