When working with Sidekiq, the framework makes a guarantee that a background job will execute at least once.1 Under normal operation, the process runs smoothly and queues of jobs are processed without issue. It’s critical to realize, however, that there is a small chance for job loss when using the open-source version of Sidekiq. The guarantee that jobs execute at least once does not guarantee job reliability.
Jobs that exist in memory may be lost
To clarify, the open-source core of Sidekiq provides a basic job-fetching strategy (basic_fetch) which aims to be simple and efficient. This algorithm is ideal in many scenarios as it avoids polling Redis with additional requests. It uses the Redis BRPOP command to fetch jobs in constant time (O(1) complexity) while also offering blocking — allowing the client to wait till a new job is pushed to the queue.
Despite this efficiency, the tradeoff is that there is a period of time where a job exists only in memory. That is, as a job is fetched for processing, it’s data is popped off the queue list and is no longer persisted in Redis. Sidekiq will then execute the job and handle errors. The danger for job loss occurs when the process terminates before the job completes or has it’s data written back to Redis.
Sidekiq recovers jobs most of the time
Fortunately, Sidekiq recovers jobs 99% of the time2 as it attempts a graceful shutdown procedure under normal shutdown conditions (such as when it handles termination signals like SIGINT or SIGTERM). During this graceful shutdown, the fetching of new jobs stops; running jobs are give a time limit to complete; and any running jobs that remain at the time limit are written back to Redis.
Job loss occurs when Sidekiq is unable to shutdown properly. Should the process crash or receive a KILL signal, Sidekiq is then unable to complete the steps necessary to write data back to Redis. As a result, the data for all running jobs is lost.
The SuperFetch strategy prevents job loss
To solve this, Sidekiq Pro provides a super_fetch strategy3 to increase job reliability.4 Once you have a license and Sidekiq Pro installed, the configuration to switch from basic_fetch to super_fetch is easy:
Sidekiq::Client.reliable_push! unless Rails.env.test?
Sidekiq.configure_server do |config|
config.super_fetch!
config.reliable_scheduler!
end
The super_fetch strategy utilizes the Redis LMOVE command to maintain a list of running jobs in Redis and ensures jobs are never removed from Redis until their completion is acknowledged. While this command still runs in constant time, it requires Sidekiq to use request polling to check for jobs in queues. So while this increases job resiliency, it comes at a cost of increased network traffic and CPU usage. Yet, when job loss threatens significant impact and harm to the end-user experience, the tradeoff pays off.
-
Sidekiq makes a guarantee that jobs will run at least once, not exactly-once: https://github.com/sidekiq/sidekiq/wiki/Best-Practices#2-make-your-job-idempotent-and-transactional ↩
-
The Sidekiq Wiki makes note that graceful shutdowns are effective at recovering unfinished jobs 99% of the time: https://github.com/sidekiq/sidekiq/wiki/Pro-Reliability-Server ↩
-
The
super_fetchstrategy attempts to solve the existing drawbacks of thebasic_fetchstrategy: https://github.com/sidekiq/sidekiq/wiki/Pro-Reliability-Server#super_fetch ↩ -
The Sidekiq Wiki discusses considerations and details regarding job reliability: https://github.com/sidekiq/sidekiq/wiki/Reliability ↩