Some tests fail seemingly randomly. What can you do to prevent that from getting out of hand?
Unaddressed flaky tests slow down the entire team
It always starts the same way. The continuous integration build has failed. It looks like a test that has nothing to do with the changes you made. You shrug and re-run the test suite. The test passed. You’re happy you don’t need to investigate further and merge your pull request. You forget about it… until it happens again. If it’s happening to you, chances are it is happening to everyone else pushing to the repository. Soon, whatever your test suite runtime is, it’s doubled or tripled for every pull request, because people got used to pressing the retry button. Test runtime becomes a bottleneck for releases.
A high-level strategy for dealing with flaky tests
You can stop the spiral with a small change in process.
- Confirm that a test is flaky
- Quarantine the test
- Make test failure reproducible
- Fix the problem
Chances are that you’re performing the first step already anyway.
The biggest win comes from step two: putting the test under quarantine.
You can defer steps 3 and 4 towards fixing the test as long as you need.
Test quarantine
To quarantine a test is to isolate it from the reliable, passing tests.
Putting a flaky test in quarantine has multiple advantages.
First and foremost, it makes it clear that there’s a problem with the test. You’ve already spent time confirming that the test is flaky. Nobody else needs to wonder if what they just changed broke this test.
Second, quarantined tests can be executed in a different way. For example, you may configure your pipeline such that known flaky tests don’t break the build. Or you could automatically rerun them. A separate group for quarantined tests is key to keeping your overall continuous delivery cycle fast.
Last, it builds a catalog of flaky tests. The team can track the size of the problem and plan accordingly. For example, one could dedicate time to fix one flaky test a week alongside feature development work. You could even decide to delete flaky tests that have been unattended for a long time.
Quarantining tests using RSpec
In RSpec, you can annotate tests with tags:
it "test description", :flaky do
# ...
end
That allows you to run only tests with a certain tag:
rspec --tag flaky
…and run tests excluding a tag:
rspec --tag "~flaky"
With this, you can exclude flaky tests from the main test suite and run them in a separate job.
Because the test is marked as flaky directly in the code, this process is tracked like the rest of development. You can open a pull request for visibility to the rest of the team and leave a trail for future investigation. The commit message can explain in which circumstances the test fails and other details, e.g. the random seed, a hypothesis for why it fails, links to successful and failing test run results, etc.
Quarantining tests using Minitest
Minitest has an option to run or exclude tests based on a name pattern. The Rails test runner preserves these options.
You could adopt the convention of including “FLAKY” in the names of flaky tests.
def test_user_can_change_language_FLAKY
# ...
end
The capital letters also go against method naming conventions and call attention to something being wrong with the test.
With the Rails style of writing tests, it would look more like this:
test "FLAKY: user can change language" do
# ...
end
Once you’ve marked the flaky tests, your default test task could skip them:
bin/rails test --exclude /FLAKY/
…and you could have a task dedicated to running only only flaky tests:
bin/rails test --name /FLAKY/
Reproducing flaky test failures
Common reasons for tests that sometimes fail include:
- Lack of test state isolation. State changes from a test are kept around in the next test.
- Concurrency and non-determinism. When multiple tasks run concurrently, their start and end times could be different between test runs.
The problem of state isolation can be discovered by running tests in a random order. The problem is usually not in the test that started failing, but in another test that ran before it and did not clean up correctly.
The problem of the sequence of events within a test can be reproduced by running the flaky test many times.
Running tests in the same order
Checking whether the tests are order dependent is a good first investigation step.
Test runners show the random seed that controls the sequence in which the tests run. For example:
bin/rails test
...
Run options: --seed 4137
You can use that seed to reproduce the same order:
bin/rails test --seed 4137
(RSpec also has a --seed
option)
If the test fails when running the entire suite with the same seed, but passes with a different seed, that indicates a problem with isolation between tests.
In a Ruby project, you can narrow down the faulty tests with a tool like minitest-bisect:
bundle add minitest-bisect
minitest_bisect --seed 4137 test/**/*_test.rb
RSpec has the feature built-in:
rspec spec --bisect --seed 4137
It’s a bit trickier if your test suite is split into multiple groups that run in parallel. In that case you need two pieces of information:
- the
--seed
parameter from RSpec or Minitest, - the list of tests that run in the same group as the one you are interested in.
For example, I often use the split-test utility in combination with the “matrix run” feature of CI platforms to parallelize test suites.
bundle exec rspec --format progress --seed SEED $(./split-test --junit-xml-report-dir tmp --node-index CURRENT_NODE --node-total TOTAL_NODES --tests-glob 'spec/**/*_spec.rb' 2> /dev/null)
where:
SEED
is the random seedCURRENT_NODE
is the test group (e.g. 0, 1, 2, 3)TOTAL_NODES
is the total number of groups (e.g 4)
In that example, I’d have to figure how many groups of tests ran in total and in which group was the failing test.
Running a test in a loop until it fails
Sometimes the problem is contained within a single test.
If the test fails when run by itself, you can try to get it to fail by running it over and over again in a loop.
I’ve written the following script to help me with that.
#!/usr/bin/env bash
set -euo pipefail
function print_iterations() {
echo ""
echo "$1 runs"
}
trap 'print_iterations $count' EXIT
count=1
while "$@"; do
((count++))
done
To use it, save it in a file, then make that file executable.
chmod +x until_failure.sh
Now you can prefix the test run command with the helper script. For example, if rspec spec/models/user_spec.rb:4
is flaky, run:
./until_failure.sh rspec spec/models/user_spec.rb:4
The script takes a command as argument and runs that command. If the command exits normally (with exit code 0), it will be run again. If the command exits with an error, the script also exits and prints how many tries it took to reproduce the failure.
Beware that reproducing the error that way can take a long time. You can interrupt the process with Ctrl
+c
at any time.
Inspect test logs from CI
You may not be able to reproduce the failure locally. As an extra debugging step, you can save the Rails log for the group of quarantined tests. Here’s an example with GitHub Actions:
jobs:
flaky-tests:
# ...
steps:
# ...
- name: Run flaky tests
run: bundle exec rspec --tag flaky
- name: Save test logs on test failure
if: failure()
uses: actions/upload-artifact@v4
with:
name: 'test.log'
path: 'log/test.log'
When the test fails, you can download the test log and compare it to the log of a successful test run.
With multiple tests, it becomes difficult to find in the logs where a test starts and ends. You can annotate the logs with test start and end delimiters. Here’s an example with RSpec:
# spec/support/test_logs.rb
module TestLogHelpers
def self.included(example_group)
example_group.around(:each, :flaky) do |example|
Rails.logger.debug "=== BEGIN #{example.location_rerun_argument}"
example.run
Rails.logger.debug "=== END #{example.location_rerun_argument}"
end
end
end
RSpec.configure do |config|
config.include TestLogHelpers, :flaky
end
With that information I have found it useful to put the failure and success logs for a single test in two files. Then I inspect them side by side and also compare them with a diff viewer (like vimdiff
).
A more abrupt approach to eliminating flakiness
Depending on the size of your test suite and the time it takes to fix flaky tests, a viable strategy is to remove flaky tests.
Ask yourself:
- What would be the value of this test if it was reliable?
- How long would it take to make the test reliable?
- What is the downside of not having an automated test?
- How critical is it to catch regressions in that area of the code before release?
- Is it quickly obvious if the corresponding production code breaks (via error monitoring, user bug reports, etc.)?
- How is the functionality documented besides this flaky test?
- Are there any other tests that cover some/all of the functionality at a different level (e.g.: unit test vs browser test)?
Perhaps it’s not worth the time investment to fix the test.
The process becomes:
- confirm that a test is flaky,
- quarantine the test,
- decide whether it’s worth fixing the test,
- delete the test.
A little bit of process goes a long way
Debugging and fixing flaky tests takes a long time. Identifying and triaging them can be fast.
With a simple system to mark tests as flaky, you avoid guesswork and unblock the continuous integration pipeline.