buy the book ribbon

Fast Tests, Slow Tests, and Hot Lava

The database is hot lava!

— Casey Kinsey

We’ve come to the end of the book, and the end of our journey with this to-do app and its tests. Let’s recap our test structure so far:

  • We have a suite of functional tests that use Selenium to test that the whole app really works. On several occasions, the FTs have saved us from shipping broken code—whether it was broken CSS, a broken database due to filesystem permissions, or broken email integration.

  • And we have a suite of unit tests that use Django test client, enabling us to test-drive our code for models, forms, views, URLs, and even (to some extent) templates. They’ve enabled us to build the app incrementally, to refactor with confidence, and they’ve supported a fast unit-test/code cycle.

  • We’ve also spent a good bit of time on our infrastructure, packaging up our app with Docker for ease of deployment, and we’ve set up a CI pipeline to run our tests automatically on push.

However, being a simple app that could fit in a book, there are inevitably some limitations and simplifications in our approach. In this chapter, I’d like to talk about how to carry your testing principles forward, as you move into larger, more complex applications in the real world.

Let’s find out why someone might say that the database is hot lava!

Why Do We Test? Our Desiderata for Effective Tests

At testdesiderata.com, Kent Beck and Kelly Sutton outline several desiderata (desirable characteristics) for tests, as outlined in Test desiderata:

Table 1. Test desiderata

Isolated: Tests should return the same results regardless of the order in which they are run.

Composable: We should be able to test different dimensions of variability separately and combine the results.

Deterministic: If nothing changes, the test results shouldn’t change.

Fast: Tests should run quickly.

Writable: Tests should be cheap to write, relative to the cost of the code being tested.

Readable: Tests should be comprehensible for readers, invoking the motivation for writing the particular test.

Behavioural: Tests should be sensitive to changes in the behaviour of the code under test. If the behaviour changes, the test result should change.

Structure-agnostic: Tests should not change their results if the structure of the code changes.

Automated: Tests should run without human intervention.

Specific: If a test fails, the cause of the failure should be obvious.

Predictive: If the tests all pass, then the code under test should be suitable for production.

Inspiring: Passing the tests should inspire confidence.

We’ve talked about almost all of these desiderata in the book: we talked about isolation when we switched to using the Django test runner. We talked about composability when discussing the car factory example in [chapter_21_mocking_2]. We talked about tests being readable when we talked about the given-when-then structure and when implementing helper methods in our FTs. We talked about testing behaviour rather than implementation at several points, including in the mocking chapters. We talked about structure in the forms chapters, when we showed that the higher-level views tests enabled us to refactor more freely than the lower-level forms tests. We’ve talked about splitting up our tests to have fewer assertions to make them more specific. We talked about determinism when discussing flaky tests and the use of wait_for() in our FTs, for example, as well as in the production debugging chapter.

And in this chapter, we’re going to talk primarily about speed, and about what makes tests inspiring.

But first, it’s worth taking a step back from the list in Test desiderata, and asking: "What do we want from our tests?"

Confidence and Correctness (Preventing Regression)

A fundamental part of programming is that, now and again, you need to check whether "it works". Automated testing is the solution to the problem that checking things manually can quickly become tedious and be unreliable. We want our tests to tell us that our code works—both at the low level of individual functions or classes, and at the higher level of "does it all hang together?"

A Productive Workflow

Our tests need to be fast enough to write, but more importantly, fast to run. We want to get into a smooth, productive workflow, and try to enter that holy credo of programmers—the "flow state". Beyond that, we want our tests to take some of the stress out of programming, encouraging us to work in small increments, with frequent bursts of dopamine from seeing green tests.

Driving Better Design

And our tests should help us to write better code: first, by enabling fearless refactoring and, second, by giving us feedback on the design of our code. Writing the tests first lets us think about our API from the outside in, before we write it—​and we’ve seen that. But in this chapter, we’ll also talk about the potential for tests to give you feedback on your design in more subtle ways. As we’ll see, designing code to be more testable often leads to code that has clearly identified dependencies, and is more modular and more decoupled.

As we continuously think about what kinds of tests to write, we are trying to achieve the optimum balance of these different desiderata.

Were Our Unit Tests Integration Tests All Along? What Is That Warm Glow Coming from the Database?

Almost all of the "unit" tests in the book perhaps should have been called integration tests, because they all rely on Django’s test runner, which gives us a real database to talk to. Many also use the Django test client, which does a lot of magic with the middleware layers that sit between requests. The end result is that our tests are heavily integrated with both the database and Django itself.

We’ve Been in the "Sweet Spot"

Now, actually, this has been a pretty good thing for us so far. We’re very much in the "sweet spot" of Django’s testing tools. Our unit tests have been fast enough to enable a smooth workflow, and they’ve given us a strong reassurance that our application really works—from the models all the way up to the templates. By allying them with a small-ish suite of functional tests, we’ve got a lot of confidence in our code. And we’ve been able to use them to get at least a bit of feedback on our design, and to enable lots of refactoring.

What Is a "True" Unit Test? Does it Matter?

But people will often tell you that a "true" unit test should be more isolated. It’s meant to test a single "unit" of software, and your database "should" be outside of that. Why do they say that (other than for the smugness they get from should-ing us)?

As you can tell, I think the argument from definitions is a bit of a red herring. But you might hear instead, "the database is hot lava!"—as Casey Kinsey put it in a memorable DjangoCon talk. There is real feeling and real experience behind these comments. What are people getting at?

Integration and Functional Tests Get Slower Over Time

The problem is that, as your application and codebase grow, involving the database in every single test starts to carry an unacceptable cost—in terms of execution speed. Casey’s company, for example, was struggling with test suites that took several hours.

At PythonAnywhere, our functional test suite didn’t just rely on the database; it would spin up a full test cluster of six virtual machines. A full run used to take at least 12 hours, and we’d have to wait overnight for our results. That was one of the least productive parts of an otherwise extraordinary workflow.

At Kraken, the full test suite does only take about 45 minutes, which is not bad for nearly 10 million lines of code, but that’s only thanks to a quite frankly ridiculous level of parallelisation and associated expenditure on CI. We’re now spending a lot of effort on trying to move more of our unit tests to being "true" unit tests.

The problem is that these things don’t scale linearly. The more database tables you have, the more relationships between them, and that starts to increase geometrically.

So you can see why, over time, these kinds of tests are going to fail to meet our desiderata because they’re too slow to enable a productive workflow and a fast enough feedback cycle.

Don’t take it from me! Gary Bernhardt, a legend in both the Ruby and Python testing communities, has a talk simply called "Fast Test, Slow Test", which is a great tour of the problems I’m discussing here.
The Holy Flow State

Thinking sociologically for a moment, we programmers have our own culture and our own "religion" in a way. It has many congregations within it—such as the cult of TDD, to which you are now initiated. There are the followers of Vim and the heretics of Emacs. But one thing we all agree on—one particular spiritual practice, our own transcendental meditation—is the holy flow state. That feeling of pure focus, of concentration, where hours pass like no time at all, where code flows naturally from our fingers, where problems are just tricky enough to be interesting but not so hard that they defeat us…​

There is absolutely no hope of achieving flow if you spend your time waiting for a slow test suite to run. Anything longer than a few seconds and you’re going to let your attention wander, you context-switch, and the flow state is gone. And the flow state is a fragile dream; once it’s gone, it takes a long time to come back.[1]

We’re Not Getting the Full Potential Benefits of Testing

TDD experts often say, "It should be called test-driven design, not test-driven development". What do they mean by that?

We have definitely seen a bit of the positive influence of TDD on our design. We’ve talked about how our tests are the first clients of any API we create, and we’ve talked about the benefits of "programming by wishful thinking" and outside-in.

But there’s more to it. These same TDD experts also often say that you should "listen to your tests". Unless you’ve read the online Appendix: Test Isolation and "Listening to Your Tests", that will still sound like a bit of a mystery.

So, how can we get to a position where our tests are giving us maximum feedback on our design?

The Ideal of the Test Pyramid

I know I said I didn’t want to get bogged down into arguments based on definitions, but let’s set out the way people normally think about these three types of tests:

Functional/end-to-end tests

FTs check that the system works end-to-end, exercising the full stack of the application, including all dependencies and connected external systems. An FT is the ultimate test that it all hangs together, and that things are "really" going to work.

Integration tests

The purpose of an integration test should be to check that the code you write is integrated correctly with some "external" system or dependency.

(True) unit tests

Unit tests are the lowest-level tests, and are supposed to test a single "unit" of code or behaviour. The ideal unit test is fully isolated from everything external to the unit under test, such that changes to things outside cannot break the test.

The canonical advice is that you should aim to have the majority of your tests be unit tests, with a smaller number of integration tests, and an even smaller number of functional tests—as in the classic "test pyramid" of The test pyramid.

A Pyramid shape, with a large bottom layer of unit tests, a medium layer of integration tests, and a small peak of FTs
Figure 1. The test pyramid
Bottom layer: unit tests (the vast majority)

These isolated tests are fast and they pinpoint failures precisely. We want these to cover the majority of our functionality, and the entirety of our business logic if possible.

Middle layer: integration tests (a significant portion)

In an ideal world, these are reserved purely for testing the interactions between our code and external systems—like the database, or even (arguably) Django itself. These are slower, but they give us the confidence that our components work together.

Top layer: a minimal set of functional/end-to-end tests

These tests are there to give us the ultimate reassurance that everything works end-to-end and top-to-bottom. But because they are the slowest and most brittle, we want as few of them as possible.

On Acceptance Tests

What about "acceptance tests"? You might have heard this term bandied about. Often, people use it to mean the same thing as functional tests or end-to-end tests. But, as taught to me by one of the legends of quality assurance at MADE.com (Hi, Marta!), any kind of test can be an acceptance test if it maps onto one of your acceptance criteria.

The point of an acceptance test is to validate a piece of behaviour that’s important to the user. In our application, that’s how we’ve been thinking about our FTs.

But, ultimately, using FTs to test every single piece of user-relevant functionality is not sustainable. We need to figure out ways to have our integration tests and unit tests do the work of verifying user-visible behaviour, understood at the right level of abstraction.

Avoiding Mock Hell

Well that’s all very well, Harry (you might say), but our current test setup is nothing like this! How do we get there from here? We’ve seen how to use mocks to isolate ourselves from external dependencies. Are they the solution then?

As I was at pains to point out the mocking chapters, the use of mocks comes with painful trade-offs:

  • They make tests harder to read and write.

  • They leave your tests tightly coupled to implementation details.

  • As a result, they tend to impede refactoring.

  • And, in the extreme, you can sometimes end up with mocks testing mocks, almost entirely disconnected from what the code actually does.

Ed Jung calls this Mock Hell.

This isn’t to say that mocks are always bad! But just that, from experience, attempting to use them as your primary tool for decoupling your tests from external dependencies is not a viable solution; it carries costs that often outweigh the benefits.

I’m glossing over the use of mocks in a London-school approach to TDD. See the Online Appendix: Test Isolation and "Listening to Your Tests".

The Actual Solutions Are Architectural

The actual solution to the problem isn’t obvious from where we’re standing. It lies in rethinking the architecture of our application. In brief, if we can decouple the core business logic of our application from its dependencies, then we can write true unit tests for it that do not depend on those, um, dependencies.

Integration tests are most necessary at the boundaries of a system—​at the points where our code integrates with external systems—like the database, filesystem, network, or a UI. Similarly, it’s at the boundaries that the downsides of test isolation and mocks are at their worst, because it’s at the boundaries that you’re most likely to be annoyed if your tests are tightly coupled to an implementation, or to need more reassurance that things are integrated properly.

Conversely, code at the core of our application—​code that’s purely concerned with our business domain and business rules, code that’s entirely under our control—​has no intrinsic need for integration tests.

So, the way to get what we want is to minimise the amount of our code that has to deal with boundaries. Then we test our core business logic with unit tests, and test the rest with integration and functional tests.

But how do we do that?

Ports and Adapters/Hexagonal/Onion/Clean Architecture

The classic solutions to this problem from the object-oriented world come under different names, but they’re all variations of the same trick: identifying the boundaries, creating an interface to define those boundaries, and then using that interface at test time to swap out fake versions of your real dependencies.

Steve Freeman and Nat Pryce, in their book Growing Object-Oriented Software, Guided by Tests, call this approach "Ports and Adapters" (see Ports and Adapters (diagram by Nat Pryce)).

Illustration of ports and adapaters architecture, with isolated core and integration points
Figure 2. Ports and Adapters (diagram by Nat Pryce)

This pattern, or variations of it, are known as "Hexagonal Architecture" (by Alistair Cockburn), "Clean Architecture" (by Robert C. Martin, aka Uncle Bob), or "Onion Architecture" (by Jeffrey Palermo).

Time for a Plug! Read More in "Cosmic Python"

At the end of the process of writing this book (the first time around) I realised that I was going to have to learn about these architectural solutions, and it was at MADE.com that I met Bob Gregory who was to become my coauthor. There, we explored "ports and adapters" and related architectures, which were quite rare at the time in the Python world.

So if you’d like a take on these architectural patterns with a Pythonic twist, check out Architecture Patterns with Python, which we subtitled "Cosmic Python", because "cosmos" is the opposite of "chaos", in Greek.

Functional Core, Imperative Shell

Gary Bernhardt pushes this further, recommending an architecture he calls "Functional Core, Imperative Shell", whereby the "shell" of the application (the place where interaction with boundaries happens) follows the imperative programming paradigm, and can be tested by integration tests, functional tests, or even (gasp!) not at all (if it’s kept minimal enough).

But the core of the application is actually written following the functional programming paradigm (complete with the "no side effects" corollary), which allows fully isolated, "pure" unit tests—without any mocks or fakes.

Check out Gary’s presentation titled "Boundaries" for more on this approach.

The Central Conceit: These Architectures Are "Better"

These patterns do not come for free! Introducing the extra indirection and abstraction can add complexity to your code. In fact, the creator of Ruby on Rails, David Heinemeier Hansson (DHH), has a famous blog post where he describes these architectures as test-induced design damage. That post eventually led to quite a thoughtful and nuanced discussion between DHH, Martin Fowler, and Kent Beck.

Like any technique, these patterns can be misused, but I wanted to make the case for their upside: by making our software more testable, we also make it more modular and maintainable. We are forced to clearly separate our concerns, and we make it easier to do things like upgrade our infrastructure when we need to. This is the place where the "improved design" desiderata comes in.

Making our software more testable also often leads to a better design.
Testing in Production

I should also make brief mention of the power of observability and monitoring.

Kent Beck tells a story about his first few weeks at Facebook, when one of the first tests he wrote turned out to be flaky in the build. Someone just deleted it. Shocked and asking why, he was told, "We know production is up. Your test is just producing noise; we don’t need it". [2]

Facebook has such confidence in its production monitoring and observability that it can provide them with most of the feedback they need about whether the system is working.

Not everywhere is Facebook! But it’s a good indication that automated tests aren’t the be-all and end-all.

The Hardest Part: Knowing When to Make the Switch

An illustration of a frog being slowly boiled in a pan

When is it time to hop out?

For small- to medium-sized applications, as we’ve seen, the Django test runner and the integration tests it encourages us to write are just fine. The problem is knowing when it’s time to make the change to a more decoupled architecture, and to start striving explicitly for the test pyramid.

It’s hard to give good advice here, as I’ve only experienced environments where either someone else made the decision before I joined, or the company is already struggling with a point where it’s (at least arguably) too late.

One thing to bear in mind, though, is that the longer you leave it, the harder it is. Another is that because the pain is only going to set in gradually, like the apocryphal boiled frogs, you’re unlikely to notice until you’re past the "perfect" moment to switch. And on top of that, it’s never going to be a convenient time to switch. This is one of those things, like tech debt, that is always going to struggle to justify itself in the face of more immediate priorities.

So, perhaps one strategy would be an Odysseus pact: tie yourself to the mast, and make a commitment—​while the tests are still fast—​to set a "red line" for when to switch. For example, "If the tests ever take more than 10 seconds to run locally, then it’s time to rethink the architecture".

I’m not saying 10 seconds is the right number, by the way. I know plenty of people who are perfectly happy to wait 30 seconds. And I know Gary Bernhardt, for one, would get very nervous at a test suite that takes more than 100 milliseconds.

But I think the idea of drawing that line in the sand, wherever it is, before you get there, might be a good way to fight the "boiled frog" problem. Failing all of that, if the best time to make the change was "ages ago", then the second best time is "right now".

Other than that, I can only wish you good luck, and hope that by warning you of the dangers, you’ll keep an eye on your test suite and spot the problems before they get too large.

Happy testing!

Wrap-Up

In this book, I’ve been able to show you how to use TDD, and have talked a bit about why we do it and what makes a good test. But we’re inevitably limited by the scope of the project. What that means is that some of the more advanced uses of TDD, particularly the interplay between testing and architecture, have been beyond the scope of this book.

But I hope that this chapter has been a bit a guide to find your way around that topic as your career progresses.

Further Reading

A few places to go for more inspiration:

"Fast Test, Slow Test" and "Boundaries"

Gary Bernhardt’s talks from Pycon 2012 and 2013. His screencasts are also well worth a look.

Integration tests are a scam

J.B. Rainsberger has a famous rant about the way integration tests will ruin your life.[3] Then check out a couple of follow-up posts, particularly the defence of acceptance tests, and the analysis of how slow tests kill productivity.

Ports and Adapters

Steve Freeman and Nat Pryce wrote about this in their book. You can also catch a good discussion in Steve’s talk. See also Uncle Bob’s description of the clean architecture, and Alistair Cockburn coining the term "Hexagonal Architecture".

The test-double testing wiki

Justin Searls' online resource is a great source of definitions and discussions on testing pros and cons, and arrives at its own conclusions of the right way to do things: testing wiki.

Fowler on unit tests

Martin Fowler (author of Refactoring) offers a balanced and pragmatic tour of what unit tests are, and of the trade-offs around speed.

A take from the world of functional programming

Grokking Simplicity by Eric Normand explores the idea of "Functional Core, Imperative Shell". Don’t worry; you don’t need a crazy functional programming language like Haskell or Clojure to understand it—it’s written in perfectly sensible JavaScript.


1. Some people say it takes at least 15 minutes to get back into the flow state. In my experience, that’s overblown, and I sometimes wonder if it’s thanks to TDD. I think TDD reduces the cognitive load of programming. By breaking our work down into small increments, by simplifying our thinking—"What’s the current failing test? What’s the simplest code I can write to make it pass?"—it’s often actually quite easy to context-switch back into coding. Maybe it’s less true for the times when we’re doing design work and thinking about what the abstractions in our code should be though. But also there’s absolutely no hope for you if you’ve started scrolling social media while waiting for your tests to finish. See you in 20 minutes to an hour!
2. There’s a transcript of this story.
3. Rainsberger actually distinguishes "integrated" tests from "integration" tests: an integrated test is any test that’s not fully isolated from things outside the unit under test.

Comments