A recent client project featured a subtle misconfiguration which left the application susceptible to losing jobs performed by worker processes. This post explores the issue and builds a demonstration on Heroku.
A Primer: Why Use Worker Processes?
When a user makes an HTTP request sometimes it’s necessary to perform a slow task like sending an email or generating a report. If this slow task is performed by the web server before sending an HTTP response then the user can be left staring at a blank page. That’s a poor experience for the user.
It’s also accompanied by a bunch of potential headaches: What if the request times out? And what happens if the user hits refresh in their browser?
To keep a web application responsive for users it’s common to offload slow tasks to worker processes which are distinct from the web processes handling HTTP requests & responses. This is outlined in the twelve-factor app guide:
[…] the developer can architect their app to handle diverse workloads by assigning each type of work to a process type. For example, HTTP requests may be handled by a web process, and long-running background tasks handled by a worker process
Using worker processes can provide a slicker user experience; allowing the web processes to return an HTTP response quickly — stating the task is underway — followed by updates with the outcome of the asynchronous task. Behind the scenes one or more workers then crunch through the tasks which are passed via a job queue.
Offloading tasks to workers is particularly important if your application runs on a Platform-as-a-Service (PaaS) like Heroku or Google App Engine because request durations are capped at the routing layer. Slow responses to HTTP requests will be timed out automatically. For example this timeout is thirty seconds on Heroku.
Reliability Is Key
In adopting this pattern it’s important the background job processing system is robust. We don’t want jobs to be lost or crash without completing the task. Many background processing frameworks, like Sidekiq or Faktory, feature excellent retry mechanisms, integrations with exception trackers, and web dashboards to address and monitor reliability.
One under-appreciated aspect of reliable job processing is having a clean switchover when a new version of the code is deployed. Teams embracing continuous deployment may release new code to production anywhere from tens to hundreds of times per day.
Often new code is deployed for both web and worker processes at the same time, even if only the web-related code has changed, to ensure compatibility. This involves stopping the existing processes running the previous version and starting fresh processes with the new version.
As we are about to discover the switching technique is important…
& Now… A Problem
The misconfiguration in question lay in the Procfile
, a common
file format used on platforms like Heroku to describe which process types should
be run as part of the application. The worker
definition was launching two
processes:
web: bundle exec puma -C ./config/puma.rb
worker: bundle exec sidekiq -q critical & bundle exec sidekiq -q default
What’s happening on line #2 is:
- The first Sidekiq process (handling the ‘critical’ queue) is started in the background
- The second Sidekiq process (handling the ‘default’ queue) is then started
The &
informs the shell to run the preceding command in the background (a
feature of Unix job control). Background in this context relates to the
execution of the Unix process, not to be confused with ‘background’ workers, as
in the code which is being run. Author mops brow as the overlapping terms pile
up left, right and center. Deep breaths.
What’s wrong with this? When a new version of the code is deployed the existing processes won’t be shutdown gracefully. Graceful in this context means either the work being finished within the grace period or the job can be put back on the queue to be picked up by another worker.
If the two worker processes don’t shutdown gracefully then there’s potential for jobs to be lost between deployments. Running workers in the background can also lead to crashed processes not being restarted by the platform.
Is This Really a Problem? Prove It!
For this experiment highly sophisticated tick
& tock
worker scripts are
required. Here’s tick.rb
(and you can imagine how tock.rb
works):
#!/usr/bin/env ruby
STDOUT.sync = true
def info
"PID=#{Process.pid} WORKER_VERSION=1"
end
puts "TICK: Starting process... #{info}"
trap "TERM" do
puts "TICK: Received SIGTERM #{info}"
sleep 5 # Pretend to be finishing some hard work
puts "TICK: Graceful shutdown #{info}"
exit
end
loop do
puts "TICK #{info}"
sleep 2 # Pretend to be doing hard work
end
If I run this script locally in a terminal window:
chmod +x ./tick.rb
./tick.rb
It produces output like this indefinitely:
TICK: Starting process... PID=10346 WORKER_VERSION=4 TICK PID=10346 WORKER_VERSION=1 TICK PID=10346 WORKER_VERSION=1 TICK PID=10346 WORKER_VERSION=1
If one terminates the tick
process by reference to its PID (in another
terminal window):
kill 10346
Then the original terminal window displays:
TICK: Received SIGTERM PID=10346 WORKER_VERSION=1 TICK: Graceful shutdown PID=10346 WORKER_VERSION=1
The first line is displayed immediately - showing the request for termination
has been received by the trap
line of the Ruby script. The second line is
displayed five seconds later, which is the delay the script requested. This
proves the tick process was given time to complete its ‘hard work’.
Lift-off: To the Cloud!
Now it’s time to deploy tick
and tock
to Heroku. First up craft a Procfile
like this:
worker: ./tick.rb & ./tock.rb
After committing the Procfile
along with tick.rb
and tock.rb
the app is
ready to be deployed. Once the worker
process type is scaled to one dyno, then
the Heroku logs should show alternating lines from the tick
and tock
processes:
app[worker.1]: TICK PID=5 WORKER_VERSION=1 app[worker.1]: TOCK PID=6 WORKER_VERSION=1 app[worker.1]: TICK PID=5 WORKER_VERSION=1 app[worker.1]: TOCK PID=6 WORKER_VERSION=1
This works as expected until it’s time to deploy the next version of the workers:
app[worker.1]: TICK PID=5 WORKER_VERSION=1 app[worker.1]: TOCK PID=6 WORKER_VERSION=1 app[api]: Build started by user oli@redacted.com app[worker.1]: TICK PID=5 WORKER_VERSION=1 app[worker.1]: TOCK PID=6 WORKER_VERSION=1 app[api]: Deploy 46be478c by user oli@redacted.com app[api]: Release v2 created by user oli@redacted.com heroku[worker.1]: Restarting heroku[worker.1]: State changed from up to starting app[worker.1]: TOCK PID=6 WORKER_VERSION=1 app[worker.1]: TOCK: Received SIGTERM PID=6 WORKER_VERSION=1 heroku[worker.1]: Stopping all processes with SIGTERM heroku[worker.1]: Process exited with status 143 heroku[worker.1]: Starting process with command `./tick.rb & ./tock.rb` heroku[worker.1]: State changed from starting to up app[worker.1]: TICK: Starting process... PID=5 WORKER_VERSION=2 app[worker.1]: TICK PID=5 WORKER_VERSION=2 app[worker.1]: TOCK: Starting process... PID=6 WORKER_VERSION=2 app[worker.1]: TOCK PID=6 WORKER_VERSION=2
The exit status was 143
which suggests an error occurred. Also there’s no
graceful shutdown message from either the tick
or tock
process. If the
processes had been handling real jobs these may have been lost.
To address this the Procfile
needs to be updated to declare two separate
process types:
tick: ./tick.rb
tock: ./tock.rb
After scaling these new process types to one dyno each, it’s time to deploy yet another new version to observe the deployment transition:
app[tick.1]: TICK PID=4 WORKER_VERSION=3 app[tock.1]: TOCK PID=4 WORKER_VERSION=3 app[api]: Build started by user oli@redacted.com app[tick.1]: TICK PID=4 WORKER_VERSION=3 app[tock.1]: TOCK PID=4 WORKER_VERSION=3 app[api]: Deploy 3546fe29 by user oli@redacted.com app[api]: Release v4 created by user oli@redacted.com heroku[tick.1]: Restarting heroku[tick.1]: State changed from up to starting heroku[tock.1]: Restarting heroku[tock.1]: State changed from up to starting heroku[tick.1]: Stopping all processes with SIGTERM heroku[tock.1]: Stopping all processes with SIGTERM app[tick.1]: TICK: Received SIGTERM PID=4 WORKER_VERSION=3 app[tock.1]: TOCK: Received SIGTERM PID=4 WORKER_VERSION=3 heroku[tick.1]: Starting process with command `./tick.rb` heroku[tock.1]: Starting process with command `./tock.rb` heroku[tick.1]: State changed from starting to up heroku[tock.1]: State changed from starting to up app[tick.1]: TICK: Starting process... PID=4 WORKER_VERSION=4 app[tick.1]: TICK PID=4 WORKER_VERSION=4 app[tock.1]: TOCK: Starting process... PID=4 WORKER_VERSION=4 app[tock.1]: TOCK PID=4 WORKER_VERSION=4 app[tick.1]: TICK PID=4 WORKER_VERSION=4 app[tick.1]: TICK: Graceful shutdown PID=4 WORKER_VERSION=3 app[tock.1]: TOCK: Graceful shutdown PID=4 WORKER_VERSION=3 heroku[tick.1]: Process exited with status 0 heroku[tock.1]: Process exited with status 0 app[tick.1]: TICK PID=4 WORKER_VERSION=4 app[tock.1]: TOCK PID=4 WORKER_VERSION=4
Hooray! Now the graceful shutdown message is printed for both tick
and tock
worker processes. The exit status of both of the outgoing processes was 0
,
indicating there was no error. This means Heroku waited five seconds for the
workers to shutdown (as requested by our scripts).
On Heroku you get up to 30 seconds to shut down a process:
The application processes have 30 seconds to shut down cleanly (ideally, they will do so more quickly than that). During this time they should stop accepting new requests or jobs and attempt to finish their current requests, or put jobs back on the queue for other worker processes to handle. If any processes remain after that time period, the dyno manager will terminate them forcefully with
SIGKILL
.
Multiple Processes Per Dyno
Why were two processes being declared together on one line of the Procfile
?
This decision was probably made to increase the resource utilization of the dyno
running the worker process, which can be more cost effective than paying for
more dynos which are less heavily utilized. In practice this implementation made
the system less robust so I’d argue it’s not a good trade-off.
It is safe to run multiple processes per dyno if there’s a master process to coordinate a bunch of child processes. This is how Puma webserver runs in clustered mode. Sidekiq also has the ability to coordinate multiple processes in the enterprise version.
Time for a Checkup
It’s worth checking any long running process exits gracefully when it receives
a message to shutdown, not just job queue workers. Time to go review the
Procfile
s (and equivalents) in your codebases!