May 8, 2018
Sidekiq is a background job processing library for Ruby. Sidekiq offers three versions: OSS, Pro and Enterprise.
OSS is free and open source and has basic features. Pro and Enterprise versions are closed source and paid, thus comes with more advanced features. To compare the list of features offered by each of these versions, please visit Sidekiq website.
Sidekiq Pro 3.4.0
introduced
super_fetch
strategy
to reliably fetch jobs from the queue in Redis.
In this post, we will discuss the benefits of using super_fetch
strategy.
Open source version of Sidekiq comes with basic_fetch
strategy.
Let's see an example to understand how it works.
Let's add Sidekiq to our Gemfile
and run bundle install
to install it.
gem 'sidekiq'
Add following Sidekiq worker in app/workers/sleep_worker.rb
.
class SleepWorker
include Sidekiq::Worker
def perform(name)
puts "Started #{name}"
sleep 30
puts "Finished #{name}"
end
end
This worker does nothing great but sleeps for 30 seconds.
Let's open Rails console and schedule this worker to run as a background job asynchronously.
>> require "sidekiq/api"
=> true
>> Sidekiq::Queue.new.size
=> 0
>> SleepWorker.perform_async("A")
=> "5d8bf898c36a60a1096cf4d3"
>> Sidekiq::Queue.new.size
=> 1
As we can see, queue now has 1 job scheduled to be processed.
Let's start Sidekiq in another terminal tab.
$ bundle exec sidekiq
40510 TID-owu1swr1i INFO: Booting Sidekiq 5.1.3 with redis options {:id=>"Sidekiq-server-PID-40510", :url=>nil}
40510 TID-owu1swr1i INFO: Starting processing, hit Ctrl-C to stop
40510 TID-owu1tr5my SleepWorker JID-5d8bf898c36a60a1096cf4d3 INFO: start
Started A
As we can see, the job with ID 5d8bf898c36a60a1096cf4d3
was picked up by Sidekiq
and it started processing the job.
If we check the Sidekiq queue size in the Rails console, it will be zero now.
>> Sidekiq::Queue.new.size
=> 0
Let's shutdown the Sidekiq process gracefully
while Sidekiq is still in the middle of processing our scheduled job.
Press either Ctrl-C
or run kill -SIGINT <PID>
command.
$ kill -SIGINT 40510
40510 TID-owu1swr1i INFO: Shutting down
40510 TID-owu1swr1i INFO: Terminating quiet workers
40510 TID-owu1x00rm INFO: Scheduler exiting...
40510 TID-owu1swr1i INFO: Pausing to allow workers to finish...
40510 TID-owu1swr1i WARN: Terminating 1 busy worker threads
40510 TID-owu1swr1i WARN: Work still in progress [#<struct Sidekiq::BasicFetch::UnitOfWork queue="queue:default", job="{\"class\":\"SleepWorker\",\"args\":[\"A\"],\"retry\":true,\"queue\":\"default\",\"jid\":\"5d8bf898c36a60a1096cf4d3\",\"created_at\":1525427293.956314,\"enqueued_at\":1525427293.957355}">]
40510 TID-owu1swr1i INFO: Pushed 1 jobs back to Redis
40510 TID-owu1tr5my SleepWorker JID-5d8bf898c36a60a1096cf4d3 INFO: fail: 19.576 sec
40510 TID-owu1swr1i INFO: Bye!
As we can see, Sidekiq pushed back the unfinished job back to Redis queue
when Sidekiq received a SIGINT
signal.
Let's verify it.
>> Sidekiq::Queue.new.size
=> 1
Before we move on, let's learn some basics about signals such as SIGINT
.
SIGINT
is an interrupt signal.
It is an alternative to hitting
Ctrl-C
from the keyboard.
When a process is running in foreground,
we can hit Ctrl-C
to signal the process to shut down.
When the process is running in background,
we can use kill
command to send a SIGINT
signal to the process' PID.
A process can optionally catch this signal and shutdown itself gracefully.
If the process does not respect this signal and ignores it,
then nothing really happens and the process keeps running.
Both INT
and SIGINT
are identical signals.
Another useful signal is SIGTERM
.
It is called a termination signal.
A process can either catch it
and perform necessary cleanup or just ignore it.
Similar to a SIGINT
signal,
if a process ignores this signal, then the process keeps running.
Note that, if no signal is supplied to the kill
command,
SIGTERM
is used by default.
Both TERM
and SIGTERM
are identical signals.
SIGTSTP
or TSTP
is called terminal stop signal.
It is an alternative to hitting Ctrl-Z
on the keyboard.
This signal causes a process to suspend further execution.
SIGKILL
is known as kill signal.
This signal is intended to kill the process immediately and forcefully.
A process cannot catch this signal,
therefore the process cannot perform cleanup or graceful shutdown.
This signal is used
when a process does not respect and respond
to both SIGINT
and SIGTERM
signals.
KILL
, SIGKILL
and 9
are identical signals.
There are a lot of other signals besides these, but they are not relevant for this post. Please check them out here.
A Sidekiq process pays respect
to all of these signals and behaves as we expect.
When Sidekiq receives a TERM
or SIGTERM
signal,
Sidekiq terminates itself gracefully.
Coming back to our example from above,
we had sent a SIGINT
signal to the Sidekiq process.
$ kill -SIGINT 40510
On receiving this SIGINT
signal,
Sidekiq process having PID 40510 terminated quiet workers,
paused the queue and waited for a while
to let busy workers finish their jobs.
Since our busy SleepWorker did not finish quickly,
Sidekiq terminated that busy worker
and pushed it back to the queue in Redis.
After that, Sidekiq gracefully terminated itself with an exit code 0.
Note that, the default timeout is 8 seconds
until which Sidekiq can wait to let the busy workers finish
otherwise it pushes the unfinished jobs back to the queue in Redis.
This timeout can be changed with -t
option
given at the startup of Sidekiq process.
Sidekiq recommends
to send a TSTP
and a TERM
together
to ensure that the Sidekiq process shuts down safely and gracefully.
On receiving a TSTP
signal,
Sidekiq stops pulling new work
and
finishes the work which is in-progress.
The idea is to first send a TSTP
signal,
wait as much as possible (by default for 8 seconds as discussed above)
to ensure that busy workers finish their jobs
and then send a TERM
signal
to shutdown the process.
Sidekiq pushes back the unprocessed job in Redis when terminated gracefully. It means that Sidekiq pulls the unfinished job and starts processing again when we restart the Sidekiq process.
$ bundle exec sidekiq
45916 TID-ovfq8ll0k INFO: Booting Sidekiq 5.1.3 with redis options {:id=>"Sidekiq-server-PID-45916", :url=>nil}
45916 TID-ovfq8ll0k INFO: Starting processing, hit Ctrl-C to stop
45916 TID-ovfqajol4 SleepWorker JID-5d8bf898c36a60a1096cf4d3 INFO: start
Started A
Finished A
45916 TID-ovfqajol4 SleepWorker JID-5d8bf898c36a60a1096cf4d3 INFO: done: 30.015 sec
We can see that Sidekiq pulled the previously terminated job
with ID 5d8bf898c36a60a1096cf4d3
and processed that job again.
So far so good.
This behavior is implemented using
basic_fetch
strategy which is present in the open sourced version of Sidekiq.
Sidekiq uses BRPOP Redis command
to fetch a scheduled job from the queue.
When a job is fetched,
that job gets removed from the queue and
that job no longer exists in Redis.
If this fetched job is processed, then all is good.
Also, if the Sidekiq process is terminated gracefully on
receiving either a SIGINT
or a SIGTERM
signal,
Sidekiq will push back the unfinished jobs back to the queue in Redis.
But what if the Sidekiq process crashes in the middle while processing that fetched job?
A process is considered as crashed
if the process does not shutdown gracefully.
As we discussed before,
when we send a SIGKILL
signal to a process,
the process cannot receive or catch this signal.
Because the process cannot shutdown gracefully and nicely,
it gets crashed.
When a Sidekiq process is crashed, the fetched jobs by that Sidekiq process which are not yet finished get lost forever.
Let's try to reproduce this scenario.
We will schedule another job.
>> SleepWorker.perform_async("B")
=> "37a5ab4139796c4b9dc1ea6d"
>> Sidekiq::Queue.new.size
=> 1
Now, let's start Sidekiq process and kill it using SIGKILL
or 9
signal.
$ bundle exec sidekiq
47395 TID-ow8q4nxzf INFO: Starting processing, hit Ctrl-C to stop
47395 TID-ow8qba0x7 SleepWorker JID-37a5ab4139796c4b9dc1ea6d INFO: start
Started B
[1] 47395 killed bundle exec sidekiq
$ kill -SIGKILL 47395
Let's check if Sidekiq had pushed the busy (unprocessed) job back to the queue in Redis before terminating.
>> Sidekiq::Queue.new.size
=> 0
No. It does not.
Actually, the Sidekiq process did not get a chance to shutdown gracefully
when it received the SIGKILL
signal.
If we restart the Sidekiq process, it cannot fetch that unprocessed job since the job was not pushed back to the queue in Redis at all.
$ bundle exec sidekiq
47733 TID-ox1lau26l INFO: Booting Sidekiq 5.1.3 with redis options {:id=>"Sidekiq-server-PID-47733", :url=>nil}
47733 TID-ox1lau26l INFO: Starting processing, hit Ctrl-C to stop
Therefore,
the job having name argument as B
or ID as 37a5ab4139796c4b9dc1ea6d
is completely lost.
There is no way to get that job back.
Losing job like this may not be a problem for some applications but for some critical applications this could be a huge issue.
We faced a similar problem like this.
One of our clients' application is deployed on a Kubernetes cluster.
Our Sidekiq process runs in a Docker container
in the Kubernetes
pods
which we call background
pods.
Here's our stripped down version of Kubernetes deployment manifest which creates a Kubernetes deployment resource. Our Sidekiq process runs in the pods spawned by that deployment resource.
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: background
spec:
replicas: 2
template:
spec:
terminationGracePeriodSeconds: 60
containers:
- name: background
image: <%= ENV['IMAGE'] %>
env:
- name: POD_TYPE
value: background
lifecycle:
preStop:
exec:
command:
- /bin/bash
- -l
- -c
- for pid in tmp/pids/sidekiq*.pid; do bin/bundle exec sidekiqctl stop $pid 60; done
When we apply an updated version of this manifest ,for say, changing the Docker image, the running pods are terminated and new pods are created.
Before terminating the only container in the pod,
Kubernetes executes sidekiqctl stop $pid 60
command
which we have defined using the
preStop
event handler.
Note that, Kubernetes already sends SIGTERM
signal
to the container being terminated inside the pod
before invoking the preStop
event handler.
The default termination grace period is 30 seconds and it is configurable.
If the container doesn't terminate within the termination grace period,
a SIGKILL
signal will be sent to forcefully terminate the container.
The sidekiqctl stop $pid 60
command executed in the preStop
handler does
three things.
SIGTERM
signal to the Sidekiq process running in the container.SIGKILL
signal to kill the Sidekiq process forcefully
if the process has not terminated gracefully yet.This worked for us when the count of busy jobs was relatively small.
When the number of processing jobs is higher, Sidekiq does not get enough time to quiet the busy workers and fails to push some of them back on the Redis queue.
We found that some of the jobs were getting lost
when our background
pod restarted.
We had to restart our background pod for
reasons such as
updating the Kubernetes deployment manifest,
pod being automatically evicted by Kubernetes
due to host node encountering OOM (out of memory) issue, etc.
We tried increasing both
terminationGracePeriodSeconds
in the deployment manifest
as well as the sidekiqctl stop
command's timeout.
Despite that,
we still kept facing the same issue
of losing jobs whenever pod restarts.
We even tried sending TSTP
and then TERM
after a timeout
relatively longer than 60 seconds.
But the pod was getting harshly terminated
without gracefully terminating Sidekiq process running inside it.
Therefore we kept losing the busy jobs
which were running during the pod termination.
We were looking for a way to stop losing our Sidekiq jobs
or a way to recover them reliably when our background
Kubernetes pod restarts.
We realized that the commercial version of Sidekiq,
Sidekiq Pro offers an additional fetch strategy,
super_fetch
,
which seemed more efficient and reliable
compared to basic_fetch
strategy.
Let's see what difference super_fetch
strategy
makes over basic_fetch
.
We will need to use sidekiq-pro
gem which needs to be purchased.
Since Sidekiq Pro gem is close sourced, we cannot fetch it
from the default public gem registry,
https://rubygems.org.
Instead, we will have to fetch it from a private gem registry
which we get after purchasing it.
We add following code to our Gemfile
and run bundle install
.
source ENV['SIDEKIQ_PRO_GEM_URL'] do
gem 'sidekiq-pro'
end
To enable super_fetch
,
we need to add following code
in an initializer config/initializers/sidekiq.rb
.
Sidekiq.configure_server do |config|
config.super_fetch!
end
Well, that's it.
Sidekiq will use super_fetch
instead of basic_fetch
now.
$ bundle exec sidekiq
75595 TID-owsytgvqj INFO: Sidekiq Pro 4.0.2, commercially licensed. Thanks for your support!
75595 TID-owsytgvqj INFO: Booting Sidekiq 5.1.3 with redis options {:id=>"Sidekiq-server-PID-75595", :url=>nil}
75595 TID-owsytgvqj INFO: Starting processing, hit Ctrl-C to stop
75595 TID-owsys5imz INFO: SuperFetch activated
When super_fetch
is activated, Sidekiq process' graceful shutdown behavior
is similar to that of basic_fetch
.
>> SleepWorker.perform_async("C")
=> "f002a41393f9a79a4366d2b5"
>> Sidekiq::Queue.new.size
=> 1
$ bundle exec sidekiq
76021 TID-ow6kdcca5 INFO: Sidekiq Pro 4.0.2, commercially licensed. Thanks for your support!
76021 TID-ow6kdcca5 INFO: Booting Sidekiq 5.1.3 with redis options {:id=>"Sidekiq-server-PID-76021", :url=>nil}
76021 TID-ow6kdcca5 INFO: Starting processing, hit Ctrl-C to stop
76021 TID-ow6klq2cx INFO: SuperFetch activated
76021 TID-ow6kiesnp SleepWorker JID-f002a41393f9a79a4366d2b5 INFO: start
Started C
>> Sidekiq::Queue.new.size
=> 0
$ kill -SIGTERM 76021
76021 TID-ow6kdcca5 INFO: Shutting down
76021 TID-ow6kdcca5 INFO: Terminating quiet workers
76021 TID-ow6kieuwh INFO: Scheduler exiting...
76021 TID-ow6kdcca5 INFO: Pausing to allow workers to finish...
76021 TID-ow6kdcca5 WARN: Terminating 1 busy worker threads
76021 TID-ow6kdcca5 WARN: Work still in progress [#<struct Sidekiq::Pro::SuperFetch::Retriever::UnitOfWork queue="queue:default", job="{\"class\":\"SleepWorker\",\"args\":[\"C\"],\"retry\":true,\"queue\":\"default\",\"jid\":\"f002a41393f9a79a4366d2b5\",\"created_at\":1525500653.404454,\"enqueued_at\":1525500653.404501}", local_queue="queue:sq|vishal.local:76021:3e64c4b08393|default">]
76021 TID-ow6kdcca5 INFO: SuperFetch: Moving job from queue:sq|vishal.local:76021:3e64c4b08393|default back to queue:default
76021 TID-ow6kiesnp SleepWorker JID-f002a41393f9a79a4366d2b5 INFO: fail: 13.758 sec
76021 TID-ow6kdcca5 INFO: Bye!
>> Sidekiq::Queue.new.size
=> 1
That looks good.
As we can see, Sidekiq moved busy job back from a private queue
to the queue in Redis
when Sidekiq received a SIGTERM
signal.
Now, let's try to kill Sidekiq process forcefully
without allowing a graceful shutdown
by sending a SIGKILL
signal.
Since Sidekiq was gracefully shutdown before,
if we restart Sidekiq again,
it will re-process the pushed back job having ID f002a41393f9a79a4366d2b5
.
$ bundle exec sidekiq
76890 TID-oxecurbtu INFO: Sidekiq Pro 4.0.2, commercially licensed. Thanks for your support!
76890 TID-oxecurbtu INFO: Booting Sidekiq 5.1.3 with redis options {:id=>"Sidekiq-server-PID-76890", :url=>nil}
76890 TID-oxecurbtu INFO: Starting processing, hit Ctrl-C to stop
76890 TID-oxecyhftq INFO: SuperFetch activated
76890 TID-oxecyotvm SleepWorker JID-f002a41393f9a79a4366d2b5 INFO: start
Started C
[1] 76890 killed bundle exec sidekiq
$ kill -SIGKILL 76890
>> Sidekiq::Queue.new.size
=> 0
It appears that Sidekiq didn't get any chance
to push the busy job back to the queue in Redis
on receiving a SIGKILL
signal.
So, where is the magic of super_fetch
?
Did we lose our job again?
Let's restart Sidekiq and see it ourself.
$ bundle exec sidekiq
77496 TID-oum04ghgw INFO: Sidekiq Pro 4.0.2, commercially licensed. Thanks for your support!
77496 TID-oum04ghgw INFO: Booting Sidekiq 5.1.3 with redis options {:id=>"Sidekiq-server-PID-77496", :url=>nil}
77496 TID-oum04ghgw INFO: Starting processing, hit Ctrl-C to stop
77496 TID-oum086w9s INFO: SuperFetch activated
77496 TID-oum086w9s WARN: SuperFetch: recovered 1 jobs
77496 TID-oum08eu3o SleepWorker JID-f002a41393f9a79a4366d2b5 INFO: start
Started C
Finished C
77496 TID-oum08eu3o SleepWorker JID-f002a41393f9a79a4366d2b5 INFO: done: 30.011 sec
Whoa, isn't that cool?
See that line where it says SuperFetch: recovered 1 jobs
.
Although the job wasn't pushed back to the queue in Redis,
Sidekiq somehow recovered our lost job having ID f002a41393f9a79a4366d2b5
and reprocessed that job again!
Interested to learn about how Sidekiq did that? Keep on reading.
Note that, since Sidekiq Pro is a close sourced and commercial software,
we cannot explain super_fetch
's exact implementation details.
As we discussed in-depth before,
Sidekiq's basic_fetch
strategy uses BRPOP
Redis command
to fetch a job from the queue in Redis.
It works great to some extent,
but it is prone to losing job
if Sidekiq crashes or does not shutdown gracefully.
On the other hand, Sidekiq Pro offers super_fetch
strategy which uses
RPOPLPUSH Redis command to fetch a job.
RPOPLPUSH
Redis command provides
a unique approach towards implementing a reliable queue.
RPOPLPUSH
command accepts two lists
namely a source list and a destination list.
This command atomically
returns and removes the last element from the source list,
and pushes that element as the first element in the destination list.
Atomically means that both pop and push operations
are performed as a single operation at the same time;
i.e. both should succeed, otherwise both are treated as failed.
super_fetch
registers a private queue in Redis
for each Sidekiq process on start-up.
super_fetch
atomically fetches a scheduled job
from the public queue in Redis
and pushes that job into the private queue (or working queue)
using RPOPLPUSH
Redis command.
Once the job is finished processing,
Sidekiq removes that job from the private queue.
During a graceful shutdown,
Sidekiq moves back the unfinished jobs
from the private queue to the public queue.
If shutdown of Sidekiq process is not graceful,
the unfinished jobs of that Sidekiq process
remain there in the private queue which are called as orphaned jobs.
On restarting or starting another Sidekiq process,
super_fetch
looks for such orphaned jobs in the private queues.
If Sidekiq finds orphaned jobs, Sidekiq re-enqueue them and processes again.
It may happen that
we have multiple Sidekiq processes running at the same time.
If a process dies among them, its unfinished jobs become orphans.
This Sidekiq wiki
describes in detail the criteria which super_fetch
relies upon
for identifying which jobs are orphaned and which jobs are not orphaned.
If we don't restart or start another process,
super_fetch
may take 5 minutes or 3 hours to recover such orphaned jobs.
The recommended approach is to restart or start another Sidekiq process
to signal super_fetch
to look for orphans.
Interestingly, in the older versions of Sidekiq Pro,
super_fetch
performed checks for orphaned jobs and queues
every 24 hours
at the Sidekiq process startup.
Due to this, when the Sidekiq process crashes,
the orphaned jobs of that process remain unpicked for up to 24 hours
until the next restart.
This orphan delay check window
had been later lowered to 1 hour in Sidekiq Pro 3.4.1.
Another fun thing to know is that,
there existed two fetch strategies namely
reliable_fetch
and timed_fetch
in the older versions of Sidekiq Pro.
Apparently, reliable_fetch
did not work with Docker
and timed_fetch
had asymptotic computational complexity O(log N)
,
comparatively
less efficient
than super_fetch
,
which has asymptotic computational complexity O(1)
.
Both of these strategies had been deprecated
in Sidekiq Pro 3.4.0 in favor of super_fetch
.
Later, both of these strategies had been
removed
in Sidekiq Pro 4.0
and are not documented anywhere.
We have enabled super_fetch
in our application and
it seemed to be working without any major issues so far.
Our Kubernetes background
pods does not seem to
be loosing any jobs when these pods are restarted.
Update : Mike Pheram, author of Sidekiq, posted following comment.
Faktory provides all of the beanstalkd functionality, including the same reliability, with a nicer Web UI. It's free and OSS. https://github.com/contribsys/faktory http://contribsys.com/faktory/
If this blog was helpful, check out our full blog archive.