A race condition bug is a common class of bugs that occur when a sequence of actions assumes a state, but the state can be modified from an external source. We’ll go in depth in many examples, but a quick example is a mobile app that assumes a data connection exists because you had previously performed an API call; the state of the network is controlled by an external source. These two representations of network state are in a race for the next network call.
A special case of race condition is the time-of-check to time-of-use (TOCTOU) bug. These bugs deal with the distance between checking the state to using the knowledge of the state. Continuing the above example: checking whether you have a network connection, and then sending a HTTP request with the assumption that it will work. That network check might have been the network’s last breath before the phone enters a tunnel, thus presenting a TOCTOU bug.
These bugs are subtle, frequent, tricky to debug, and in some cases a security concern. In this article let’s examine some of the more common ones in the hope that we can internalize the fundamental structure, gain an intuition for recognizing this class of bugs, and avoid these classic examples in our own code.
(“TOCTOU” is not pronouced “toucan”, which is a disappointment to us all.)
Networks
The first example we’ll dive into is the one mentioned in the intro: checking for network connectivity before making the request. Here’s an example in Ruby; it contains a TOCTOU bug:
This example exhibits the TOCTOU bug: it has a gap between the time of check and the time of use. We can use a sequence diagram to understand this gap:
The Ruby code seems sequential, but the state the program is tracking is a snapshot of state that is controlled externally. If the Ruby program uses its last network touchpoint to check for access instead of sending the HTTP request, we will see unexpected errors.
The correct way to solve this is hinted at in has_connection?
. We already
have to handle an exception; that is how the Addrinfo.getaddrinfo
method
communicates its return value. Therefore, we can make the HTTP call and handle
the exception all in one swoop:
This has the added advantage of making only one network call, instead of two. The sequence diagram is much simpler now:
The network example is nice because it’s familiar and also a good template. It’s familiar as a user: you’ve definitely experienced bugs in mobile apps that assumed you had network connectivity because you previously had network connectivitity. It’s also familiar as a programmer: the network APIs we use all work via a exception-based pattern, and all examples are like the above. In this way it also provides a good template: move from an explicit conditional (asking permission) to a model where we do the action we want to do and handle the failure (asking forgiveness).
Note that a variant of this pattern holds for languages without exceptions.
Many of these are too verbose for a blog post, but for example the Rust
reqwest::get
function returns a Result<Response>
value instead of just
Response
, indicating that it can have an Error
. Similarly in the C
libcurl library, the curl_easy_perform(3)
function returns CURLE_OK
on
success and non-zero on error, allowing you to make a request and handle the
errors manually.
Files
Let’s play Spot The Bug with this code sample:
This shape looks identical to the buggy network code above. Here’s an example sequence diagram:
This sequence diagram shows the bug: Ruby’s state of the filesystem does not match the actual state of the filesystem. We can recover this with exceptions, as we did with the network example:
The chance of this but appearing is low, but the ability to debug it even lower. This kind of bug can be exploited due to the complexities of how operating systems schedule processes and threads. After each operation there’s a good chance your program will stop for an unknown amount of time while other processes run. The arbitrary and seemingly random nature of this means that these kind of bugs can occur, and can be impossible to reproduce.
File security
A more subtle, security-sensitive bug can occur when checking permission and file access. For example, this has a bug:
The TOCTOU bug becomes apparent under a sequence diagram showing the attack:
This example shows a much more difficult consideration than the above. If an
attacker can know when File.stat
is called vs File.read
, they can manipulate
the program into opening an unexpected file. The reason this example is much
more difficult is because there’s no good solution here. Not all TOCTOU bugs
are as trivial to solve as the file existence example. However, we should try
to rely on underlying filesystem permissions as much as possible. This is
harder to mitigate, which means that this kind of bug does happen in the
wild.
This is harder to exploit because the attacker needs to time calls to
#print_secrets
instead of File.stat
, which is presumably more
user-dependent. In addition, note that we now handle the exception for a file
that we cannot access.
Authorization
The above examples are ones where the external state is under the control of a different entity – the network stack or the filesystem. As a last example let’s talk about what infrastructure you need to build to when plan around TOCTOU bugs.
Let’s use the example of authorization: a user needs to be able to do the things they are able to do, and nothing more (or less!). Some examples of what users can do are read, publish, delete, delete other content, and create admins.
User type | Can read? | Can publish? | Can delete their own content? | Can delete other content? | Can create admins? |
---|---|---|---|---|---|
Visitor | Yes | No | No | No | No |
User | Yes | Yes | Yes | No | No |
Moderator | Yes | Yes | Yes | Yes | No |
Admin | Yes | Yes | Yes | Yes | Yes |
If you check what a user can do when they sign in, you are storing that state for as long as they are authenticated. Here’s an example of buggy authorization code:
Setting authorization details once at sign-in means that any future check is static. Here’s an example of a buggy method of checking whether a user is an admin:
def can_promote?
session[:user][:can_promote]
end
If the user is promoted, they will need to sign out and back in to see the effects. This is something the user wants, so they’ll be encouraged to do so. However, if the user is demoted or otherwise moderated by the admins, they will not feel the effects until they re-authenticate – which they have no interest in doing.
The TOCTOU bug may go deeper. Let’s look at some example code that shows the link to the admin page, the controller backing it, and the model code:
When designing your system, you can avoid introducing TOCTOU bugs by delaying any checks until the last moment, and relying on a “ask forgiveness, not permission” mindset.
A bug-free version of #can_promote?
would go directly to the database every
time:
def can_promote?
current_user.can_promote?
end
But it goes deeper: the actual check must happen in the model layer. It is the last safeguard of the data and the keeper of authorization state.
The prior method for promoting a user, User#promote
, is replaced with one
that tracks details on both users, User#promote_by!
. This allows us to
perform checks at the lowest level.
We want an atomic world with the latest values, so we run two database queries inside a transaction. The first query is for a user that matches our promoter’s ID and also has the permission required to perform the promotion. The second query is the update itself.
There remains a chance that the promoting user will be updated outside of this transaction, so we lock that row for consistency.
We choose to switch to exceptions here for consistency with other examples in
Ruby where we try to avoid TOCTOU bugs, such as in networking or file code, and
also because ActiveRecord::Base#transaction
requires that.
Conclusion
When dealing with files, keep an eye on stat(2)
calls, such as File.exist?
in Ruby, test -e
in shell, or stat()
in C. These indicate potential bugs,
and you may be able to re-write them using exceptions instead.
When working with or designing a system, take note of when you’re tracking state and who can control that state. If it can be modified by an external actor – a train tunnel, the user in another shell, the user opening another tab, or a security-minded attacker – then you have the potential for a time-of-check to time-of-use bug.
TOCTOU bugs range in severity from simply being tricky to track down to actual security issues. It is in the best interest for you and your users to build your software to avoid TOCTOU bugs. Take note of the textbook examples to avoid those in general.
The underlying principle to avoid a TOCTOU bug is “ask for forgiveness, not for permission”. This can be seen as another form of “tell, don’t ask”, and can also be seen as a form of Law of Demeter. The object that maintains the resource is the final authority on the resource’s state, and tracking the state elsewhere is setting yourself up for a TOCTOU bug.