This is your threading bug

A race condition bug is a common class of bugs that occur when a sequence of actions assumes a state, but the state can be modified from an external source. We’ll go in depth in many examples, but a quick example is a mobile app that assumes a data connection exists because you had previously performed an API call; the state of the network is controlled by an external source. These two representations of network state are in a race for the next network call.

A special case of race condition is the time-of-check to time-of-use (TOCTOU) bug. These bugs deal with the distance between checking the state to using the knowledge of the state. Continuing the above example: checking whether you have a network connection, and then sending a HTTP request with the assumption that it will work. That network check might have been the network’s last breath before the phone enters a tunnel, thus presenting a TOCTOU bug.

These bugs are subtle, frequent, tricky to debug, and in some cases a security concern. In this article let’s examine some of the more common ones in the hope that we can internalize the fundamental structure, gain an intuition for recognizing this class of bugs, and avoid these classic examples in our own code.

(“TOCTOU” is not pronouced “toucan”, which is a disappointment to us all.)

Networks

The first example we’ll dive into is the one mentioned in the intro: checking for network connectivity before making the request. Here’s an example in Ruby; it contains a TOCTOU bug:

def request(req)
  if has_connection?(req)
    Net::HTTP.start(req.host, req.port) { |http| http.request(req) }
  end
end

def has_connection?(req)
  Addrinfo.getaddrinfo(req.host, req.port)
rescue SocketError
  false
end
This sample of a TOCTOU bug defines a `request` method that makes a HTTP request, returning the result if there is a network connection, or `nil`. It tests for a network connection via a DNS lookup.

This example exhibits the TOCTOU bug: it has a gap between the time of check and the time of use. We can use a sequence diagram to understand this gap:

getaddrinfo returns success before we enter the tunnel but Net::HTTP.start fails after
The `request` method starts with a connected network. The `has_connection?` and subsequent `getaddrinfo` call happen while the network is connected, but the network disconnects immediately after returning from `getaddrinfo`. This leaves the `Net::HTTP.start` in the false belief that the network is connected.

The Ruby code seems sequential, but the state the program is tracking is a snapshot of state that is controlled externally. If the Ruby program uses its last network touchpoint to check for access instead of sending the HTTP request, we will see unexpected errors.

The correct way to solve this is hinted at in has_connection?. We already have to handle an exception; that is how the Addrinfo.getaddrinfo method communicates its return value. Therefore, we can make the HTTP call and handle the exception all in one swoop:

HTTP_ERRORS = [
  EOFError,
  Errno::ECONNRESET,
  Errno::EINVAL,
  Net::HTTPBadResponse,
  Net::HTTPHeaderSyntaxError,
  Net::ProtocolError,
  SocketError,
  Timeout::Error,
]

def request(req)
  Net::HTTP.start(req.host, req.port) { |http| http.request(req) }
rescue *HTTP_ERRORS
  nil
end
This re-definition of the `#request` method makes the HTTP call immediately, relying on exceptions to communicate network disconnections. The underlying network libraries can raise a slew of exceptions, so we define a constant `HTTP_ERRORS` to format them more neatly.

This has the added advantage of making only one network call, instead of two. The sequence diagram is much simpler now:

Make the Net::HTTP.start call without any checks and ignore the tunnel
The `request` method starts with a connected network. It makes the `Net::HTTP.start` call immediately. It does not try to track the network state, and handles errors as they come.

The network example is nice because it’s familiar and also a good template. It’s familiar as a user: you’ve definitely experienced bugs in mobile apps that assumed you had network connectivity because you previously had network connectivitity. It’s also familiar as a programmer: the network APIs we use all work via a exception-based pattern, and all examples are like the above. In this way it also provides a good template: move from an explicit conditional (asking permission) to a model where we do the action we want to do and handle the failure (asking forgiveness).

Note that a variant of this pattern holds for languages without exceptions. Many of these are too verbose for a blog post, but for example the Rust reqwest::get function returns a Result<Response> value instead of just Response, indicating that it can have an Error. Similarly in the C libcurl library, the curl_easy_perform(3) function returns CURLE_OK on success and non-zero on error, allowing you to make a request and handle the errors manually.

Files

Let’s play Spot The Bug with this code sample:

def eval_gemfile
  if File.exist?("Gemfile.local")
    eval(File.read("Gemfile.local"))
  end
end
This is a simple method that reads the contents of a file named `Gemfile.local` into memory, but only if the file exists. This exposes a TOCTOU bug.

This shape looks identical to the buggy network code above. Here’s an example sequence diagram:

Checking a file before reading from it means it can be deleted in between
In this sequence the `Gemfile.local` file exists at the start of `eval_gemfile` and during the call to `File.exist?`, but an outside force deletes it before we get to the `File.read`.

This sequence diagram shows the bug: Ruby’s state of the filesystem does not match the actual state of the filesystem. We can recover this with exceptions, as we did with the network example:

def eval_gemfile
  eval(File.read("Gemfile.local"))
rescue Errno::ENOENT
end
This is a simple method that reads the contents of a file named `Gemfile.local` into memory. If the file does not exist `File.read` raises `Errno::ENOENT`, but we rescue and silence that one exception. This is equivalent to the prior example but without the TOCTOU bug.

The chance of this but appearing is low, but the ability to debug it even lower. This kind of bug can be exploited due to the complexities of how operating systems schedule processes and threads. After each operation there’s a good chance your program will stop for an unknown amount of time while other processes run. The arbitrary and seemingly random nature of this means that these kind of bugs can occur, and can be impossible to reproduce.

File security

A more subtle, security-sensitive bug can occur when checking permission and file access. For example, this has a bug:

def print_secrets
  if has_access?("secrets")
    puts File.read("secrets")
  end
end

def has_access?(filename)
  status = File.new(filename).stat
  status.file? && status.owned?
end
The `print_secrets` method checks whether the user should have access to the `secrets` file and, if so, prints its contents. The access check is done using the OS `stat(2)` function. The business logic is that a user has access to a file if it is a regular file and the user owns it. This implementation has a TOCTOU bug.

The TOCTOU bug becomes apparent under a sequence diagram showing the attack:

Checking a file's permissions before reading from it opens us up to symlink attacks
In this sequence diagram the `print_secrets` method begins with `secrets` under the user’s control and as a file. We perform the `File.stat`, which confirms this. Then the attacker removes `secrets`, replacing it with a link to `/etc/master.passwd` – the file containing not just user account names but also the user’s passwords. Ruby calls `File.read` on this symlink.

This example shows a much more difficult consideration than the above. If an attacker can know when File.stat is called vs File.read, they can manipulate the program into opening an unexpected file. The reason this example is much more difficult is because there’s no good solution here. Not all TOCTOU bugs are as trivial to solve as the file existence example. However, we should try to rely on underlying filesystem permissions as much as possible. This is harder to mitigate, which means that this kind of bug does happen in the wild.

def print_secrets
  file = File.new("secrets")
  contents = file.read

  if has_permission?(file)
    puts contents
  end
rescue Errno::EACCES
end

def has_permission?(file)
  status = file.stat
  status.file? && status.owned?
end
This `#print_secrets` implementation also has a TOCTOU bug, but it is harder to exploit. We first create a file handle via `File.new`; this can raise `Errno::EACCES` if the user does not have permissions to read this file, which we handle explicitly. After reading the contents of the file we additionally check the permission of the filehandle we read, and only print the contents if that matches.

This is harder to exploit because the attacker needs to time calls to #print_secrets instead of File.stat, which is presumably more user-dependent. In addition, note that we now handle the exception for a file that we cannot access.

Authorization

The above examples are ones where the external state is under the control of a different entity – the network stack or the filesystem. As a last example let’s talk about what infrastructure you need to build to when plan around TOCTOU bugs.

Let’s use the example of authorization: a user needs to be able to do the things they are able to do, and nothing more (or less!). Some examples of what users can do are read, publish, delete, delete other content, and create admins.

User type Can read? Can publish? Can delete their own content? Can delete other content? Can create admins?
Visitor Yes No No No No
User Yes Yes Yes No No
Moderator Yes Yes Yes Yes No
Admin Yes Yes Yes Yes Yes

If you check what a user can do when they sign in, you are storing that state for as long as they are authenticated. Here’s an example of buggy authorization code:

class SignInsController < ApplicationController
  def create
    if user = User.authorize_by(params[:email], params[:password])
      session[:user] = {
        id: user.to_param,
        can_read: user.can_read?,
        can_write: user.can_write?,
        can_delete: user.can_delete?,
        can_moderate: user.can_moderate?,
        can_promote: user.can_promote?,
      }
      redirect_to dashboard_path
    else
      flash.now[:error] = I18n.t(".incorrect_username_or_password")
      render :new
    end
  end
end
This Rails controller mixes authorization into authentication. When the user authenticates correctly, we stuff their abilities into the session. This sets us up for a TOCTOU bug.

Setting authorization details once at sign-in means that any future check is static. Here’s an example of a buggy method of checking whether a user is an admin:

def can_promote?
  session[:user][:can_promote]
end

If the user is promoted, they will need to sign out and back in to see the effects. This is something the user wants, so they’ll be encouraged to do so. However, if the user is demoted or otherwise moderated by the admins, they will not feel the effects until they re-authenticate – which they have no interest in doing.

The TOCTOU bug may go deeper. Let’s look at some example code that shows the link to the admin page, the controller backing it, and the model code:

<nav>
  <ul>
    <li><%= link_to "Write new post", new_article_path %></li>
    <% if can_promote? %>
      <li><%= link_to "Promote user", new_promotions_path %></li>
    <% end %>
  </ul>
</nav>
class PromotionsController < ApplicationController
  def new
    @user = User.new
  end

  def create
    @user = User.find(params[:user_id])

    if @user.promote
      redirect_to new_promotions_path
    else
      render :new
    end
  end
end

class User < ActiveRecord::Base
  def promote
    update(can_promote: true)
  end
end
The ERb template uses our `#can_promote?` helper to decide whether to show a link to the user promotions form. The `create` action for the promotions controller finds a user, promotes them, and redirects. Promoting a user (`User#promote`) simply updates the `can_promote` field in the database. The `User#promote` method is in a style that enforces a TOCTOU bug.

When designing your system, you can avoid introducing TOCTOU bugs by delaying any checks until the last moment, and relying on a “ask forgiveness, not permission” mindset.

A bug-free version of #can_promote? would go directly to the database every time:

def can_promote?
  current_user.can_promote?
end

But it goes deeper: the actual check must happen in the model layer. It is the last safeguard of the data and the keeper of authorization state.

class User < ActiveRecord::Base
  def promote_by!(promoter_id)
    transaction do
      promoter = self.class.lock.find_by(
        id: promoter_id,
        can_promote: true,
      )

      if promoter
        update!(can_promote: true)
      else
        raise ArgumentError, "can only be promoted by an admin"
      end
    end
  end
end
We’ve replaced `User#promote` with `User#promote_by!`, which takes the ID of a user doing a promoting, executes everything inside a database transaction, finds and locks the promoting user only if they are an admin, and only promotes the user if the promoting user exists.

The prior method for promoting a user, User#promote, is replaced with one that tracks details on both users, User#promote_by!. This allows us to perform checks at the lowest level.

We want an atomic world with the latest values, so we run two database queries inside a transaction. The first query is for a user that matches our promoter’s ID and also has the permission required to perform the promotion. The second query is the update itself.

There remains a chance that the promoting user will be updated outside of this transaction, so we lock that row for consistency.

We choose to switch to exceptions here for consistency with other examples in Ruby where we try to avoid TOCTOU bugs, such as in networking or file code, and also because ActiveRecord::Base#transaction requires that.

Conclusion

When dealing with files, keep an eye on stat(2) calls, such as File.exist? in Ruby, test -e in shell, or stat() in C. These indicate potential bugs, and you may be able to re-write them using exceptions instead.

When working with or designing a system, take note of when you’re tracking state and who can control that state. If it can be modified by an external actor – a train tunnel, the user in another shell, the user opening another tab, or a security-minded attacker – then you have the potential for a time-of-check to time-of-use bug.

TOCTOU bugs range in severity from simply being tricky to track down to actual security issues. It is in the best interest for you and your users to build your software to avoid TOCTOU bugs. Take note of the textbook examples to avoid those in general.

The underlying principle to avoid a TOCTOU bug is “ask for forgiveness, not for permission”. This can be seen as another form of “tell, don’t ask”, and can also be seen as a form of Law of Demeter. The object that maintains the resource is the final authority on the resource’s state, and tracking the state elsewhere is setting yourself up for a TOCTOU bug.