Dealing with flaky tests

Some tests fail seemingly randomly. What can you do to prevent that from getting out of hand?

Unaddressed flaky tests slow down the entire team

It always starts the same way. The continuous integration build has failed. It looks like a test that has nothing to do with the changes you made. You shrug and re-run the test suite. The test passed. You’re happy you don’t need to investigate further and merge your pull request. You forget about it… until it happens again. If it’s happening to you, chances are it is happening to everyone else pushing to the repository. Soon, whatever your test suite runtime is, it’s doubled or tripled for every pull request, because people got used to pressing the retry button. Test runtime becomes a bottleneck for releases.

A high-level strategy for dealing with flaky tests

You can stop the spiral with a small change in process.

Confirm that a test is flaky
Quarantine the test
Make test failure reproducible
Fix the problem

Chances are that you’re performing the first step already anyway.

The biggest win comes from step two: putting the test under quarantine.

You can defer steps 3 and 4 towards fixing the test as long as you need.

Test quarantine

To quarantine a test is to isolate it from the reliable, passing tests.

Putting a flaky test in quarantine has multiple advantages.

First and foremost, it makes it clear that there’s a problem with the test. You’ve already spent time confirming that the test is flaky. Nobody else needs to wonder if what they just changed broke this test.

Second, quarantined tests can be executed in a different way. For example, you may configure your pipeline such that known flaky tests don’t break the build. Or you could automatically rerun them. A separate group for quarantined tests is key to keeping your overall continuous delivery cycle fast.

Last, it builds a catalog of flaky tests. The team can track the size of the problem and plan accordingly. For example, one could dedicate time to fix one flaky test a week alongside feature development work. You could even decide to delete flaky tests that have been unattended for a long time.

Quarantining tests using RSpec

In RSpec, you can annotate tests with tags:

it "test description", :flaky do
  # ...
end

That allows you to run only tests with a certain tag:

rspec --tag flaky

…and run tests excluding a tag:

rspec --tag "~flaky"

With this, you can exclude flaky tests from the main test suite and run them in a separate job.

Because the test is marked as flaky directly in the code, this process is tracked like the rest of development. You can open a pull request for visibility to the rest of the team and leave a trail for future investigation. The commit message can explain in which circumstances the test fails and other details, e.g. the random seed, a hypothesis for why it fails, links to successful and failing test run results, etc.

Quarantining tests using Minitest

Minitest has an option to run or exclude tests based on a name pattern. The Rails test runner preserves these options.

You could adopt the convention of including “FLAKY” in the names of flaky tests.

def test_user_can_change_language_FLAKY
  # ...
end

The capital letters also go against method naming conventions and call attention to something being wrong with the test.

With the Rails style of writing tests, it would look more like this:

test "FLAKY: user can change language" do
  # ...
end

Once you’ve marked the flaky tests, your default test task could skip them:

bin/rails test --exclude /FLAKY/

…and you could have a task dedicated to running only only flaky tests:

bin/rails test --name /FLAKY/

Reproducing flaky test failures

Common reasons for tests that sometimes fail include:

Lack of test state isolation. State changes from a test are kept around in the next test.
Concurrency and non-determinism. When multiple tasks run concurrently, their start and end times could be different between test runs.

The problem of state isolation can be discovered by running tests in a random order. The problem is usually not in the test that started failing, but in another test that ran before it and did not clean up correctly.

The problem of the sequence of events within a test can be reproduced by running the flaky test many times.

Running tests in the same order

Checking whether the tests are order dependent is a good first investigation step.

Test runners show the random seed that controls the sequence in which the tests run. For example:

bin/rails test
...
Run options: --seed 4137

You can use that seed to reproduce the same order:

bin/rails test --seed 4137

(RSpec also has a --seed option)

If the test fails when running the entire suite with the same seed, but passes with a different seed, that indicates a problem with isolation between tests.

In a Ruby project, you can narrow down the faulty tests with a tool like minitest-bisect:

bundle add minitest-bisect
minitest_bisect --seed 4137 test/**/*_test.rb

RSpec has the feature built-in:

rspec spec --bisect --seed 4137

It’s a bit trickier if your test suite is split into multiple groups that run in parallel. In that case you need two pieces of information:

the --seed parameter from RSpec or Minitest,
the list of tests that run in the same group as the one you are interested in.

For example, I often use the split-test utility in combination with the “matrix run” feature of CI platforms to parallelize test suites.

bundle exec rspec --format progress --seed SEED $(./split-test --junit-xml-report-dir tmp --node-index CURRENT_NODE --node-total TOTAL_NODES --tests-glob 'spec/**/*_spec.rb' 2> /dev/null)

where:

SEED is the random seed
CURRENT_NODE is the test group (e.g. 0, 1, 2, 3)
TOTAL_NODES is the total number of groups (e.g 4)

In that example, I’d have to figure how many groups of tests ran in total and in which group was the failing test.

Running a test in a loop until it fails

Sometimes the problem is contained within a single test.

If the test fails when run by itself, you can try to get it to fail by running it over and over again in a loop.

I’ve written the following script to help me with that.

#!/usr/bin/env bash

set -euo pipefail

function print_iterations() {
  echo ""
  echo "$1 runs"
}

trap 'print_iterations $count' EXIT

count=1
while "$@"; do
  ((count++))
done

To use it, save it in a file, then make that file executable.

chmod +x until_failure.sh

Now you can prefix the test run command with the helper script. For example, if rspec spec/models/user_spec.rb:4 is flaky, run:

./until_failure.sh rspec spec/models/user_spec.rb:4

The script takes a command as argument and runs that command. If the command exits normally (with exit code 0), it will be run again. If the command exits with an error, the script also exits and prints how many tries it took to reproduce the failure.

Beware that reproducing the error that way can take a long time. You can interrupt the process with Ctrl+c at any time.

Inspect test logs from CI

You may not be able to reproduce the failure locally. As an extra debugging step, you can save the Rails log for the group of quarantined tests. Here’s an example with GitHub Actions:

jobs:

  flaky-tests:
    # ...

    steps:
      # ...

      - name: Run flaky tests
        run: bundle exec rspec --tag flaky
      - name: Save test logs on test failure
        if: failure()
        uses: actions/upload-artifact@v4
        with:
          name: 'test.log'
          path: 'log/test.log'

When the test fails, you can download the test log and compare it to the log of a successful test run.

With multiple tests, it becomes difficult to find in the logs where a test starts and ends. You can annotate the logs with test start and end delimiters. Here’s an example with RSpec:

# spec/support/test_logs.rb

module TestLogHelpers
  def self.included(example_group)
    example_group.around(:each, :flaky) do |example|
      Rails.logger.debug "=== BEGIN #{example.location_rerun_argument}"

      example.run

      Rails.logger.debug "=== END #{example.location_rerun_argument}"
    end
  end
end

RSpec.configure do |config|
  config.include TestLogHelpers, :flaky
end

With that information I have found it useful to put the failure and success logs for a single test in two files. Then I inspect them side by side and also compare them with a diff viewer (like vimdiff).

A more abrupt approach to eliminating flakiness

Depending on the size of your test suite and the time it takes to fix flaky tests, a viable strategy is to remove flaky tests.

Ask yourself:

What would be the value of this test if it was reliable?
How long would it take to make the test reliable?
What is the downside of not having an automated test?
How critical is it to catch regressions in that area of the code before release?
Is it quickly obvious if the corresponding production code breaks (via error monitoring, user bug reports, etc.)?
How is the functionality documented besides this flaky test?
Are there any other tests that cover some/all of the functionality at a different level (e.g.: unit test vs browser test)?

Perhaps it’s not worth the time investment to fix the test.

The process becomes:

confirm that a test is flaky,
quarantine the test,
decide whether it’s worth fixing the test,
delete the test.

A little bit of process goes a long way

Debugging and fixing flaky tests takes a long time. Identifying and triaging them can be fast.

With a simple system to mark tests as flaky, you avoid guesswork and unblock the continuous integration pipeline.

Sign up to receive a weekly recap from thoughtbot

Upgrade your codebase with a Code Audit