Enhancing job reliability with Sidekiq Pro's super fetch strategy

When working with Sidekiq, the framework makes a guarantee that a background job will execute at least once.¹ Under normal operation, the process runs smoothly and queues of jobs are processed without issue. It’s critical to realize, however, that there is a small chance for job loss when using the open-source version of Sidekiq. The guarantee that jobs execute at least once does not guarantee job reliability.

Jobs that exist in memory may be lost

To clarify, the open-source core of Sidekiq provides a basic job-fetching strategy (basic_fetch) which aims to be simple and efficient. This algorithm is ideal in many scenarios as it avoids polling Redis with additional requests. It uses the Redis BRPOP command to fetch jobs in constant time (O(1) complexity) while also offering blocking — allowing the client to wait till a new job is pushed to the queue.

Despite this efficiency, the tradeoff is that there is a period of time where a job exists only in memory. That is, as a job is fetched for processing, it’s data is popped off the queue list and is no longer persisted in Redis. Sidekiq will then execute the job and handle errors. The danger for job loss occurs when the process terminates before the job completes or has it’s data written back to Redis.

Sidekiq recovers jobs most of the time

Fortunately, Sidekiq recovers jobs 99% of the time² as it attempts a graceful shutdown procedure under normal shutdown conditions (such as when it handles termination signals like SIGINT or SIGTERM). During this graceful shutdown, the fetching of new jobs stops; running jobs are give a time limit to complete; and any running jobs that remain at the time limit are written back to Redis.

Job loss occurs when Sidekiq is unable to shutdown properly. Should the process crash or receive a KILL signal, Sidekiq is then unable to complete the steps necessary to write data back to Redis. As a result, the data for all running jobs is lost.

The SuperFetch strategy prevents job loss

To solve this, Sidekiq Pro provides a super_fetch strategy³ to increase job reliability.⁴ Once you have a license and Sidekiq Pro installed, the configuration to switch from basic_fetch to super_fetch is easy:

Sidekiq::Client.reliable_push! unless Rails.env.test?

Sidekiq.configure_server do |config|
  config.super_fetch!
  config.reliable_scheduler!
end

The super_fetch strategy utilizes the Redis LMOVE command to maintain a list of running jobs in Redis and ensures jobs are never removed from Redis until their completion is acknowledged. While this command still runs in constant time, it requires Sidekiq to use request polling to check for jobs in queues. So while this increases job resiliency, it comes at a cost of increased network traffic and CPU usage. Yet, when job loss threatens significant impact and harm to the end-user experience, the tradeoff pays off.

Sidekiq makes a guarantee that jobs will run at least once, not exactly-once: https://github.com/sidekiq/sidekiq/wiki/Best-Practices#2-make-your-job-idempotent-and-transactional ↩
The Sidekiq Wiki makes note that graceful shutdowns are effective at recovering unfinished jobs 99% of the time: https://github.com/sidekiq/sidekiq/wiki/Pro-Reliability-Server ↩
The super_fetch strategy attempts to solve the existing drawbacks of the basic_fetch strategy: https://github.com/sidekiq/sidekiq/wiki/Pro-Reliability-Server#super_fetch ↩
The Sidekiq Wiki discusses considerations and details regarding job reliability: https://github.com/sidekiq/sidekiq/wiki/Reliability ↩

Enhancing job reliability with Sidekiq Pro's super fetch strategy

Jobs that exist in memory may be lost

Sidekiq recovers jobs most of the time

The SuperFetch strategy prevents job loss

About thoughtbot

Leave the Maintenance to Us

Jobs that exist in memory may be lost

Sidekiq recovers jobs most of the time

The SuperFetch strategy prevents job loss

Sign up to receive a weekly recap from thoughtbot

About thoughtbot

Leave the Maintenance to Us