Increase reliability using super_fetch of Sidekiq Pro

May 8, 2018

Sidekiq is a background job processing library for Ruby. Sidekiq offers three versions: OSS, Pro and Enterprise.

OSS is free and open source and has basic features. Pro and Enterprise versions are closed source and paid, thus comes with more advanced features. To compare the list of features offered by each of these versions, please visit Sidekiq website.

Sidekiq Pro 3.4.0 introduced super_fetch strategy to reliably fetch jobs from the queue in Redis.

In this post, we will discuss the benefits of using super_fetch strategy.

Problem

Open source version of Sidekiq comes with basic_fetch strategy. Let's see an example to understand how it works.

Let's add Sidekiq to our Gemfile and run bundle install to install it.

gem 'sidekiq'

Add following Sidekiq worker in app/workers/sleep_worker.rb.

class SleepWorker
  include Sidekiq::Worker

  def perform(name)
    puts "Started #{name}"
    sleep 30
    puts "Finished #{name}"
  end
end

This worker does nothing great but sleeps for 30 seconds.

Let's open Rails console and schedule this worker to run as a background job asynchronously.

>> require "sidekiq/api"
=> true
>> Sidekiq::Queue.new.size
=> 0
>> SleepWorker.perform_async("A")
=> "5d8bf898c36a60a1096cf4d3"
>> Sidekiq::Queue.new.size
=> 1

As we can see, queue now has 1 job scheduled to be processed.

Let's start Sidekiq in another terminal tab.

$ bundle exec sidekiq

40510 TID-owu1swr1i INFO: Booting Sidekiq 5.1.3 with redis options {:id=>"Sidekiq-server-PID-40510", :url=>nil}
40510 TID-owu1swr1i INFO: Starting processing, hit Ctrl-C to stop
40510 TID-owu1tr5my SleepWorker JID-5d8bf898c36a60a1096cf4d3 INFO: start
Started A

As we can see, the job with ID 5d8bf898c36a60a1096cf4d3 was picked up by Sidekiq and it started processing the job.

If we check the Sidekiq queue size in the Rails console, it will be zero now.

>> Sidekiq::Queue.new.size
=> 0

Let's shutdown the Sidekiq process gracefully while Sidekiq is still in the middle of processing our scheduled job. Press either Ctrl-C or run kill -SIGINT <PID> command.

$ kill -SIGINT 40510

40510 TID-owu1swr1i INFO: Shutting down
40510 TID-owu1swr1i INFO: Terminating quiet workers
40510 TID-owu1x00rm INFO: Scheduler exiting...
40510 TID-owu1swr1i INFO: Pausing to allow workers to finish...
40510 TID-owu1swr1i WARN: Terminating 1 busy worker threads
40510 TID-owu1swr1i WARN: Work still in progress [#<struct Sidekiq::BasicFetch::UnitOfWork queue="queue:default", job="{\"class\":\"SleepWorker\",\"args\":[\"A\"],\"retry\":true,\"queue\":\"default\",\"jid\":\"5d8bf898c36a60a1096cf4d3\",\"created_at\":1525427293.956314,\"enqueued_at\":1525427293.957355}">]
40510 TID-owu1swr1i INFO: Pushed 1 jobs back to Redis
40510 TID-owu1tr5my SleepWorker JID-5d8bf898c36a60a1096cf4d3 INFO: fail: 19.576 sec
40510 TID-owu1swr1i INFO: Bye!

As we can see, Sidekiq pushed back the unfinished job back to Redis queue when Sidekiq received a SIGINT signal.

Let's verify it.

>> Sidekiq::Queue.new.size
=> 1

Before we move on, let's learn some basics about signals such as SIGINT.

A crash course on POSIX signals

SIGINT is an interrupt signal. It is an alternative to hitting Ctrl-C from the keyboard. When a process is running in foreground, we can hit Ctrl-C to signal the process to shut down. When the process is running in background, we can use kill command to send a SIGINT signal to the process' PID. A process can optionally catch this signal and shutdown itself gracefully. If the process does not respect this signal and ignores it, then nothing really happens and the process keeps running. Both INT and SIGINT are identical signals.

Another useful signal is SIGTERM. It is called a termination signal. A process can either catch it and perform necessary cleanup or just ignore it. Similar to a SIGINT signal, if a process ignores this signal, then the process keeps running. Note that, if no signal is supplied to the kill command, SIGTERM is used by default. Both TERM and SIGTERM are identical signals.

SIGTSTP or TSTP is called terminal stop signal. It is an alternative to hitting Ctrl-Z on the keyboard. This signal causes a process to suspend further execution.

SIGKILL is known as kill signal. This signal is intended to kill the process immediately and forcefully. A process cannot catch this signal, therefore the process cannot perform cleanup or graceful shutdown. This signal is used when a process does not respect and respond to both SIGINT and SIGTERM signals. KILL, SIGKILL and 9 are identical signals.

There are a lot of other signals besides these, but they are not relevant for this post. Please check them out here.

A Sidekiq process pays respect to all of these signals and behaves as we expect. When Sidekiq receives a TERM or SIGTERM signal, Sidekiq terminates itself gracefully.

Back to our example

Coming back to our example from above, we had sent a SIGINT signal to the Sidekiq process.

$ kill -SIGINT 40510

On receiving this SIGINT signal, Sidekiq process having PID 40510 terminated quiet workers, paused the queue and waited for a while to let busy workers finish their jobs. Since our busy SleepWorker did not finish quickly, Sidekiq terminated that busy worker and pushed it back to the queue in Redis. After that, Sidekiq gracefully terminated itself with an exit code 0. Note that, the default timeout is 8 seconds until which Sidekiq can wait to let the busy workers finish otherwise it pushes the unfinished jobs back to the queue in Redis. This timeout can be changed with -t option given at the startup of Sidekiq process.

Sidekiq recommends to send a TSTP and a TERM together to ensure that the Sidekiq process shuts down safely and gracefully. On receiving a TSTP signal, Sidekiq stops pulling new work and finishes the work which is in-progress. The idea is to first send a TSTP signal, wait as much as possible (by default for 8 seconds as discussed above) to ensure that busy workers finish their jobs and then send a TERM signal to shutdown the process.

Sidekiq pushes back the unprocessed job in Redis when terminated gracefully. It means that Sidekiq pulls the unfinished job and starts processing again when we restart the Sidekiq process.

$ bundle exec sidekiq

45916 TID-ovfq8ll0k INFO: Booting Sidekiq 5.1.3 with redis options {:id=>"Sidekiq-server-PID-45916", :url=>nil}
45916 TID-ovfq8ll0k INFO: Starting processing, hit Ctrl-C to stop
45916 TID-ovfqajol4 SleepWorker JID-5d8bf898c36a60a1096cf4d3 INFO: start
Started A
Finished A
45916 TID-ovfqajol4 SleepWorker JID-5d8bf898c36a60a1096cf4d3 INFO: done: 30.015 sec

We can see that Sidekiq pulled the previously terminated job with ID 5d8bf898c36a60a1096cf4d3 and processed that job again.

So far so good.

This behavior is implemented using basic_fetch strategy which is present in the open sourced version of Sidekiq.

Sidekiq uses BRPOP Redis command to fetch a scheduled job from the queue. When a job is fetched, that job gets removed from the queue and that job no longer exists in Redis. If this fetched job is processed, then all is good. Also, if the Sidekiq process is terminated gracefully on receiving either a SIGINT or a SIGTERM signal, Sidekiq will push back the unfinished jobs back to the queue in Redis.

But what if the Sidekiq process crashes in the middle while processing that fetched job?

A process is considered as crashed if the process does not shutdown gracefully. As we discussed before, when we send a SIGKILL signal to a process, the process cannot receive or catch this signal. Because the process cannot shutdown gracefully and nicely, it gets crashed.

When a Sidekiq process is crashed, the fetched jobs by that Sidekiq process which are not yet finished get lost forever.

Let's try to reproduce this scenario.

We will schedule another job.

>> SleepWorker.perform_async("B")
=> "37a5ab4139796c4b9dc1ea6d"

>> Sidekiq::Queue.new.size
=> 1

Now, let's start Sidekiq process and kill it using SIGKILL or 9 signal.

$ bundle exec sidekiq

47395 TID-ow8q4nxzf INFO: Starting processing, hit Ctrl-C to stop
47395 TID-ow8qba0x7 SleepWorker JID-37a5ab4139796c4b9dc1ea6d INFO: start
Started B
[1]    47395 killed     bundle exec sidekiq

$ kill -SIGKILL 47395

Let's check if Sidekiq had pushed the busy (unprocessed) job back to the queue in Redis before terminating.

>> Sidekiq::Queue.new.size
=> 0

No. It does not.

Actually, the Sidekiq process did not get a chance to shutdown gracefully when it received the SIGKILL signal.

If we restart the Sidekiq process, it cannot fetch that unprocessed job since the job was not pushed back to the queue in Redis at all.

$ bundle exec sidekiq

47733 TID-ox1lau26l INFO: Booting Sidekiq 5.1.3 with redis options {:id=>"Sidekiq-server-PID-47733", :url=>nil}
47733 TID-ox1lau26l INFO: Starting processing, hit Ctrl-C to stop

Therefore, the job having name argument as B or ID as 37a5ab4139796c4b9dc1ea6d is completely lost. There is no way to get that job back.

Losing job like this may not be a problem for some applications but for some critical applications this could be a huge issue.

We faced a similar problem like this. One of our clients' application is deployed on a Kubernetes cluster. Our Sidekiq process runs in a Docker container in the Kubernetes pods which we call background pods.

Here's our stripped down version of Kubernetes deployment manifest which creates a Kubernetes deployment resource. Our Sidekiq process runs in the pods spawned by that deployment resource.

---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: background
spec:
  replicas: 2
  template:
    spec:
      terminationGracePeriodSeconds: 60
      containers:
      - name: background
        image: <%= ENV['IMAGE'] %>
        env:
        - name: POD_TYPE
          value: background
        lifecycle:
          preStop:
            exec:
              command:
              - /bin/bash
              - -l
              - -c
              - for pid in tmp/pids/sidekiq*.pid; do bin/bundle exec sidekiqctl stop $pid 60; done

When we apply an updated version of this manifest ,for say, changing the Docker image, the running pods are terminated and new pods are created.

Before terminating the only container in the pod, Kubernetes executes sidekiqctl stop $pid 60 command which we have defined using the preStop event handler. Note that, Kubernetes already sends SIGTERM signal to the container being terminated inside the pod before invoking the preStop event handler. The default termination grace period is 30 seconds and it is configurable. If the container doesn't terminate within the termination grace period, a SIGKILL signal will be sent to forcefully terminate the container.

The sidekiqctl stop $pid 60 command executed in the preStop handler does three things.

Sends a SIGTERM signal to the Sidekiq process running in the container.
Waits for 60 seconds.
Sends a SIGKILL signal to kill the Sidekiq process forcefully if the process has not terminated gracefully yet.

This worked for us when the count of busy jobs was relatively small.

When the number of processing jobs is higher, Sidekiq does not get enough time to quiet the busy workers and fails to push some of them back on the Redis queue.

We found that some of the jobs were getting lost when our background pod restarted. We had to restart our background pod for reasons such as updating the Kubernetes deployment manifest, pod being automatically evicted by Kubernetes due to host node encountering OOM (out of memory) issue, etc.

We tried increasing both terminationGracePeriodSeconds in the deployment manifest as well as the sidekiqctl stop command's timeout. Despite that, we still kept facing the same issue of losing jobs whenever pod restarts.

We even tried sending TSTP and then TERM after a timeout relatively longer than 60 seconds. But the pod was getting harshly terminated without gracefully terminating Sidekiq process running inside it. Therefore we kept losing the busy jobs which were running during the pod termination.

Sidekiq Pro's super_fetch

We were looking for a way to stop losing our Sidekiq jobs or a way to recover them reliably when our background Kubernetes pod restarts.

We realized that the commercial version of Sidekiq, Sidekiq Pro offers an additional fetch strategy, super_fetch, which seemed more efficient and reliable compared to basic_fetch strategy.

Let's see what difference super_fetch strategy makes over basic_fetch.

We will need to use sidekiq-pro gem which needs to be purchased. Since Sidekiq Pro gem is close sourced, we cannot fetch it from the default public gem registry, https://rubygems.org. Instead, we will have to fetch it from a private gem registry which we get after purchasing it. We add following code to our Gemfile and run bundle install.

source ENV['SIDEKIQ_PRO_GEM_URL'] do
  gem 'sidekiq-pro'
end

To enable super_fetch, we need to add following code in an initializer config/initializers/sidekiq.rb.

Sidekiq.configure_server do |config|
  config.super_fetch!
end

Well, that's it. Sidekiq will use super_fetch instead of basic_fetch now.

$ bundle exec sidekiq

75595 TID-owsytgvqj INFO: Sidekiq Pro 4.0.2, commercially licensed.  Thanks for your support!
75595 TID-owsytgvqj INFO: Booting Sidekiq 5.1.3 with redis options {:id=>"Sidekiq-server-PID-75595", :url=>nil}
75595 TID-owsytgvqj INFO: Starting processing, hit Ctrl-C to stop
75595 TID-owsys5imz INFO: SuperFetch activated

When super_fetch is activated, Sidekiq process' graceful shutdown behavior is similar to that of basic_fetch.

>> SleepWorker.perform_async("C")
=> "f002a41393f9a79a4366d2b5"
>> Sidekiq::Queue.new.size
=> 1

$ bundle exec sidekiq

76021 TID-ow6kdcca5 INFO: Sidekiq Pro 4.0.2, commercially licensed.  Thanks for your support!
76021 TID-ow6kdcca5 INFO: Booting Sidekiq 5.1.3 with redis options {:id=>"Sidekiq-server-PID-76021", :url=>nil}
76021 TID-ow6kdcca5 INFO: Starting processing, hit Ctrl-C to stop
76021 TID-ow6klq2cx INFO: SuperFetch activated
76021 TID-ow6kiesnp SleepWorker JID-f002a41393f9a79a4366d2b5 INFO: start
Started C

>> Sidekiq::Queue.new.size
=> 0

$ kill -SIGTERM 76021

76021 TID-ow6kdcca5 INFO: Shutting down
76021 TID-ow6kdcca5 INFO: Terminating quiet workers
76021 TID-ow6kieuwh INFO: Scheduler exiting...
76021 TID-ow6kdcca5 INFO: Pausing to allow workers to finish...
76021 TID-ow6kdcca5 WARN: Terminating 1 busy worker threads
76021 TID-ow6kdcca5 WARN: Work still in progress [#<struct Sidekiq::Pro::SuperFetch::Retriever::UnitOfWork queue="queue:default", job="{\"class\":\"SleepWorker\",\"args\":[\"C\"],\"retry\":true,\"queue\":\"default\",\"jid\":\"f002a41393f9a79a4366d2b5\",\"created_at\":1525500653.404454,\"enqueued_at\":1525500653.404501}", local_queue="queue:sq|vishal.local:76021:3e64c4b08393|default">]
76021 TID-ow6kdcca5 INFO: SuperFetch: Moving job from queue:sq|vishal.local:76021:3e64c4b08393|default back to queue:default
76021 TID-ow6kiesnp SleepWorker JID-f002a41393f9a79a4366d2b5 INFO: fail: 13.758 sec
76021 TID-ow6kdcca5 INFO: Bye!

>> Sidekiq::Queue.new.size
=> 1

That looks good. As we can see, Sidekiq moved busy job back from a private queue to the queue in Redis when Sidekiq received a SIGTERM signal.

Now, let's try to kill Sidekiq process forcefully without allowing a graceful shutdown by sending a SIGKILL signal.

Since Sidekiq was gracefully shutdown before, if we restart Sidekiq again, it will re-process the pushed back job having ID f002a41393f9a79a4366d2b5.

$ bundle exec sidekiq

76890 TID-oxecurbtu INFO: Sidekiq Pro 4.0.2, commercially licensed.  Thanks for your support!
76890 TID-oxecurbtu INFO: Booting Sidekiq 5.1.3 with redis options {:id=>"Sidekiq-server-PID-76890", :url=>nil}
76890 TID-oxecurbtu INFO: Starting processing, hit Ctrl-C to stop
76890 TID-oxecyhftq INFO: SuperFetch activated
76890 TID-oxecyotvm SleepWorker JID-f002a41393f9a79a4366d2b5 INFO: start
Started C
[1]    76890 killed     bundle exec sidekiq

$ kill -SIGKILL 76890

>> Sidekiq::Queue.new.size
=> 0

It appears that Sidekiq didn't get any chance to push the busy job back to the queue in Redis on receiving a SIGKILL signal.

So, where is the magic of super_fetch?

Did we lose our job again?

Let's restart Sidekiq and see it ourself.

$ bundle exec sidekiq

77496 TID-oum04ghgw INFO: Sidekiq Pro 4.0.2, commercially licensed.  Thanks for your support!
77496 TID-oum04ghgw INFO: Booting Sidekiq 5.1.3 with redis options {:id=>"Sidekiq-server-PID-77496", :url=>nil}
77496 TID-oum04ghgw INFO: Starting processing, hit Ctrl-C to stop
77496 TID-oum086w9s INFO: SuperFetch activated
77496 TID-oum086w9s WARN: SuperFetch: recovered 1 jobs
77496 TID-oum08eu3o SleepWorker JID-f002a41393f9a79a4366d2b5 INFO: start
Started C
Finished C
77496 TID-oum08eu3o SleepWorker JID-f002a41393f9a79a4366d2b5 INFO: done: 30.011 sec

Whoa, isn't that cool?

See that line where it says SuperFetch: recovered 1 jobs.

Although the job wasn't pushed back to the queue in Redis, Sidekiq somehow recovered our lost job having ID f002a41393f9a79a4366d2b5 and reprocessed that job again!

Interested to learn about how Sidekiq did that? Keep on reading.

Note that, since Sidekiq Pro is a close sourced and commercial software, we cannot explain super_fetch's exact implementation details.

As we discussed in-depth before, Sidekiq's basic_fetch strategy uses BRPOP Redis command to fetch a job from the queue in Redis. It works great to some extent, but it is prone to losing job if Sidekiq crashes or does not shutdown gracefully.

On the other hand, Sidekiq Pro offers super_fetch strategy which uses RPOPLPUSH Redis command to fetch a job.

RPOPLPUSH Redis command provides a unique approach towards implementing a reliable queue. RPOPLPUSH command accepts two lists namely a source list and a destination list. This command atomically returns and removes the last element from the source list, and pushes that element as the first element in the destination list. Atomically means that both pop and push operations are performed as a single operation at the same time; i.e. both should succeed, otherwise both are treated as failed.

super_fetch registers a private queue in Redis for each Sidekiq process on start-up. super_fetch atomically fetches a scheduled job from the public queue in Redis and pushes that job into the private queue (or working queue) using RPOPLPUSH Redis command. Once the job is finished processing, Sidekiq removes that job from the private queue. During a graceful shutdown, Sidekiq moves back the unfinished jobs from the private queue to the public queue. If shutdown of Sidekiq process is not graceful, the unfinished jobs of that Sidekiq process remain there in the private queue which are called as orphaned jobs. On restarting or starting another Sidekiq process, super_fetch looks for such orphaned jobs in the private queues. If Sidekiq finds orphaned jobs, Sidekiq re-enqueue them and processes again.

It may happen that we have multiple Sidekiq processes running at the same time. If a process dies among them, its unfinished jobs become orphans. This Sidekiq wiki describes in detail the criteria which super_fetch relies upon for identifying which jobs are orphaned and which jobs are not orphaned. If we don't restart or start another process, super_fetch may take 5 minutes or 3 hours to recover such orphaned jobs. The recommended approach is to restart or start another Sidekiq process to signal super_fetch to look for orphans.

Interestingly, in the older versions of Sidekiq Pro, super_fetch performed checks for orphaned jobs and queues every 24 hours at the Sidekiq process startup. Due to this, when the Sidekiq process crashes, the orphaned jobs of that process remain unpicked for up to 24 hours until the next restart. This orphan delay check window had been later lowered to 1 hour in Sidekiq Pro 3.4.1.

Another fun thing to know is that, there existed two fetch strategies namely reliable_fetch and timed_fetch in the older versions of Sidekiq Pro. Apparently, reliable_fetch did not work with Docker and timed_fetch had asymptotic computational complexity O(log N), comparatively less efficient than super_fetch, which has asymptotic computational complexity O(1). Both of these strategies had been deprecated in Sidekiq Pro 3.4.0 in favor of super_fetch. Later, both of these strategies had been removed in Sidekiq Pro 4.0 and are not documented anywhere.

Final result

We have enabled super_fetch in our application and it seemed to be working without any major issues so far. Our Kubernetes background pods does not seem to be loosing any jobs when these pods are restarted.

Update : Mike Pheram, author of Sidekiq, posted following comment.

Faktory provides all of the beanstalkd functionality, including the same reliability, with a nicer Web UI. It's free and OSS. https://github.com/contribsys/faktory http://contribsys.com/faktory/

If this blog was helpful, check out our full blog archive.