---
title: Graceful Switching of Worker Processes
teaser: 'How to avoid lost work from your job queues and worker processes.

  '
tags: web,heroku,ruby
author: Oli Peate
published_on: 2017-11-28
---

A recent client project featured a subtle misconfiguration which left the
application susceptible to losing jobs performed by worker processes. This post
explores the issue and builds a demonstration on Heroku.

## A Primer: Why Use Worker Processes?

When a user makes an HTTP request sometimes it's necessary to perform a slow
task like sending an email or generating a report. If this slow task is
performed by the web server before sending an HTTP response then the user can be
left staring at a blank page. That's a poor experience for the user.

It's also accompanied by a bunch of potential headaches: What if the request
times out? And what happens if the user hits refresh in their browser?

To keep a web application responsive for users it's common to offload slow tasks
to worker processes which are distinct from the web processes handling HTTP
requests & responses. This is outlined in the [twelve-factor app guide]:

[twelve-factor app guide]: https://12factor.net/concurrency

> [...] the developer can architect their app to handle diverse workloads by
assigning each type of work to a *process type*. For example, HTTP requests may
be handled by a web process, and long-running background tasks handled by a
worker process

Using worker processes can provide a slicker user experience; allowing the web
processes to return an HTTP response quickly — stating the task is underway —
followed by updates with the outcome of the asynchronous task. Behind the scenes
one or more workers then crunch through the tasks which are passed via a job
queue.

Offloading tasks to workers is particularly important if your application runs
on a Platform-as-a-Service (PaaS) like Heroku or Google App Engine because
request durations are capped at the routing layer. Slow responses to HTTP
requests will be timed out automatically. For example this timeout is thirty
seconds on Heroku.

## Reliability Is Key

In adopting this pattern it's important the background job processing system is
robust. We don't want jobs to be lost or crash without completing the task. Many
background processing frameworks, like [Sidekiq] or [Faktory], feature
excellent retry mechanisms, integrations with exception trackers, and web
dashboards to address and monitor reliability.

[Sidekiq]: https://github.com/mperham/sidekiq
[Faktory]: https://github.com/contribsys/faktory

One under-appreciated aspect of reliable job processing is having a clean
switchover when a new version of the code is deployed. Teams embracing
continuous deployment may release new code to production anywhere from tens to
hundreds of times per day.

Often new code is deployed for both web and worker processes at the same time,
even if only the web-related code has changed, to ensure compatibility. This
involves stopping the existing processes running the previous version and
starting fresh processes with the new version.

As we are about to discover the switching technique is important...

![A graceful switch by Indiana Jones in Raiders of the Lost Ark](https://images.thoughtbot.com/blog-vellum-image-uploads/vuHPpt2FTqi8sJ1snVnC_graceful-switcheroo.gif)

## & Now... A Problem

The misconfiguration in question lay in the [`Procfile`][procfile], a common
file format used on platforms like Heroku to describe which process types should
be run as part of the application. The `worker` definition was launching two
processes:

[procfile]: https://devcenter.heroku.com/articles/procfile

```procfile
web: bundle exec puma -C ./config/puma.rb
worker: bundle exec sidekiq -q critical & bundle exec sidekiq -q default
```

What's happening on line #2 is:

1. The first Sidekiq process (handling the 'critical' queue) is started in the
   background
2. The second Sidekiq process (handling the 'default' queue) is then
   started

The `&` informs the shell to run the preceding command in the background (a
feature of [Unix job control]). Background in this context relates to the
execution of the Unix process, not to be confused with 'background' workers, as
in the code which is being run. *Author mops brow as the overlapping terms pile
up left, right and center. Deep breaths.*

[Unix job control]: https://en.wikibooks.org/wiki/A_Quick_Introduction_to_Unix/Job_Control

What's wrong with this? When a new version of the code is deployed the existing
processes won't be shutdown gracefully. Graceful in this context means either
the work being finished within the grace period or the job can be put back on
the queue to be picked up by another worker.

If the two worker processes don't shutdown gracefully then there's potential for
jobs to be lost between deployments. Running workers in the background can also
lead to crashed processes not being restarted by the platform.

## Is This Really a Problem? Prove It!

For this experiment highly sophisticated `tick` & `tock` worker scripts are
required. Here's `tick.rb` (and you can imagine how `tock.rb` works):

```ruby
#!/usr/bin/env ruby

STDOUT.sync = true

def info
  "PID=#{Process.pid} WORKER_VERSION=1"
end

puts "TICK: Starting process... #{info}"

trap "TERM" do
  puts "TICK: Received SIGTERM #{info}"
  sleep 5 # Pretend to be finishing some hard work
  puts "TICK: Graceful shutdown #{info}"
  exit
end

loop do
  puts "TICK #{info}"
  sleep 2 # Pretend to be doing hard work
end
```

If I run this script locally in a terminal window:

```sh
chmod +x ./tick.rb
./tick.rb
```

It produces output like this indefinitely:

<pre>
<samp>TICK: Starting process... PID=10346 WORKER_VERSION=4
TICK PID=10346 WORKER_VERSION=1
TICK PID=10346 WORKER_VERSION=1
TICK PID=10346 WORKER_VERSION=1</samp>
</pre>

If one terminates the `tick` process by reference to its [PID][PID] (in another
terminal window):

[PID]: https://en.wikipedia.org/wiki/Process_identifier

```sh
kill 10346
```

Then the original terminal window displays:

<pre>
<samp>TICK: Received SIGTERM PID=10346 WORKER_VERSION=1
TICK: Graceful shutdown PID=10346 WORKER_VERSION=1</samp>
</pre>

The first line is displayed immediately - showing the request for termination
has been received by the `trap` line of the Ruby script. The second line is
displayed five seconds later, which is the delay the script requested. This
proves the tick process was given time to complete its 'hard work'.

### Lift-off: To the Cloud!

Now it's time to deploy `tick` and `tock` to Heroku. First up craft a `Procfile`
like this:

```procfile
worker: ./tick.rb & ./tock.rb
```

After committing the `Procfile` along with `tick.rb` and `tock.rb` the app is
ready to be deployed. Once the `worker` process type is scaled to one dyno, then
the Heroku logs should show alternating lines from the `tick` and `tock`
processes:

<pre>
<samp>app[worker.1]: TICK PID=5 WORKER_VERSION=1
app[worker.1]: TOCK PID=6 WORKER_VERSION=1
app[worker.1]: TICK PID=5 WORKER_VERSION=1
app[worker.1]: TOCK PID=6 WORKER_VERSION=1</samp>
</pre>

This works as expected until it's time to deploy the next version of the
workers:

<pre>
<samp>app[worker.1]: TICK PID=5 WORKER_VERSION=1
app[worker.1]: TOCK PID=6 WORKER_VERSION=1
app[api]: Build started by user oli@redacted.com
app[worker.1]: TICK PID=5 WORKER_VERSION=1
app[worker.1]: TOCK PID=6 WORKER_VERSION=1
app[api]: Deploy 46be478c by user oli@redacted.com
app[api]: Release v2 created by user oli@redacted.com
heroku[worker.1]: Restarting
heroku[worker.1]: State changed from up to starting
app[worker.1]: TOCK PID=6 WORKER_VERSION=1
app[worker.1]: TOCK: Received SIGTERM PID=6 WORKER_VERSION=1
heroku[worker.1]: Stopping all processes with SIGTERM
heroku[worker.1]: Process exited with status 143
heroku[worker.1]: Starting process with command `./tick.rb & ./tock.rb`
heroku[worker.1]: State changed from starting to up
app[worker.1]: TICK: Starting process... PID=5 WORKER_VERSION=2
app[worker.1]: TICK PID=5 WORKER_VERSION=2
app[worker.1]: TOCK: Starting process... PID=6 WORKER_VERSION=2
app[worker.1]: TOCK PID=6 WORKER_VERSION=2</samp>
</pre>

The exit status was `143` which suggests an error occurred. Also there's no
graceful shutdown message from either the `tick` or `tock` process. If the
processes had been handling real jobs these may have been lost.

To address this the `Procfile` needs to be updated to declare two separate
process types:

```procfile
tick: ./tick.rb
tock: ./tock.rb
```

After scaling these new process types to one dyno each, it's time to deploy yet
another new version to observe the deployment transition:

<pre>
<samp>app[tick.1]: TICK PID=4 WORKER_VERSION=3
app[tock.1]: TOCK PID=4 WORKER_VERSION=3
app[api]: Build started by user oli@redacted.com
app[tick.1]: TICK PID=4 WORKER_VERSION=3
app[tock.1]: TOCK PID=4 WORKER_VERSION=3
app[api]: Deploy 3546fe29 by user oli@redacted.com
app[api]: Release v4 created by user oli@redacted.com
heroku[tick.1]: Restarting
heroku[tick.1]: State changed from up to starting
heroku[tock.1]: Restarting
heroku[tock.1]: State changed from up to starting
heroku[tick.1]: Stopping all processes with SIGTERM
heroku[tock.1]: Stopping all processes with SIGTERM
app[tick.1]: TICK: Received SIGTERM PID=4 WORKER_VERSION=3
app[tock.1]: TOCK: Received SIGTERM PID=4 WORKER_VERSION=3
heroku[tick.1]: Starting process with command `./tick.rb`
heroku[tock.1]: Starting process with command `./tock.rb`
heroku[tick.1]: State changed from starting to up
heroku[tock.1]: State changed from starting to up
app[tick.1]: TICK: Starting process... PID=4 WORKER_VERSION=4
app[tick.1]: TICK PID=4 WORKER_VERSION=4
app[tock.1]: TOCK: Starting process... PID=4 WORKER_VERSION=4
app[tock.1]: TOCK PID=4 WORKER_VERSION=4
app[tick.1]: TICK PID=4 WORKER_VERSION=4
app[tick.1]: TICK: Graceful shutdown PID=4 WORKER_VERSION=3
app[tock.1]: TOCK: Graceful shutdown PID=4 WORKER_VERSION=3
heroku[tick.1]: Process exited with status 0
heroku[tock.1]: Process exited with status 0
app[tick.1]: TICK PID=4 WORKER_VERSION=4
app[tock.1]: TOCK PID=4 WORKER_VERSION=4</samp>
</pre>

Hooray! Now the graceful shutdown message is printed for both `tick` and `tock`
worker processes. The exit status of both of the outgoing processes was `0`,
indicating there was no error. This means Heroku waited five seconds for the
workers to shutdown (as requested by our scripts).

On Heroku you get up to [30 seconds to shut down a process][heroku-shutdown]:

[heroku-shutdown]: https://devcenter.heroku.com/articles/dynos#shutdown

> The application processes have 30 seconds to shut down cleanly (ideally, they
will do so more quickly than that). During this time they should stop accepting
new requests or jobs and attempt to finish their current requests, or put jobs
back on the queue for other worker processes to handle. If any processes remain
after that time period, the dyno manager will terminate them forcefully with
`SIGKILL`.

## Multiple Processes Per Dyno

Why were two processes being declared together on one line of the `Procfile`?
This decision was probably made to increase the resource utilization of the dyno
running the worker process, which can be more cost effective than paying for
more dynos which are less heavily utilized. In practice this implementation made
the system less robust so I'd argue it's not a good trade-off.

It is safe to run multiple processes per dyno *if* there's a master process to
coordinate a bunch of child processes. This is how Puma webserver runs in
[clustered mode]. Sidekiq also has the ability to coordinate [multiple
processes] in the enterprise version.

[clustered mode]: https://github.com/puma/puma#clustered-mode
[multiple processes]: https://github.com/mperham/sidekiq/wiki/Ent-Multi-Process

## Time for a Checkup

It's worth checking *any* long running process exits gracefully when it receives
a message to shutdown, not just job queue workers. Time to go review the
`Procfile`s (and equivalents) in your codebases!
