Issue with Delayed Job lifecycle and Postgres Errors

Vipul

Vipul

July 23, 2014

Recently in one of our projects, we experienced some strange errors from Delayed::Job. Delayed::Job workers started successfully, but when they were starting to lock the jobs, workers failed with PG::Error: no connection to server or PG::Error: FATAL: invalid frontend message type 60errors.

After some search, we found there had been such issues already experienced by others (Link is not available) .

We started isolating the problem and digging through the recent changes we had made to the project. Since the last release the only significant modification had been made to internationalization. We had started using I18n-active_record .

1
2# config/initializers/locale.rb
3
4require 'i18n/backend/active_record'
5Translation = I18n::Backend::ActiveRecord::Translation
6
7if (ActiveRecord::Base.connected? && Translation.table_exists?) ||
8in_delayed_job_process?
9
10I18n.backend = I18n::Backend::ActiveRecord.new
11I18n::Backend::ActiveRecord.send(:include, I18n::Backend::Memoize)
12I18n::Backend::ActiveRecord.send(:include, I18n::Backend::Flatten)
13I18n::Backend::Simple.send(:include, I18n::Backend::Memoize)
14I18n::Backend::Simple.send(:include, I18n::Backend::Pluralization)
15I18n.backend = I18n::Backend::Chain.new(I18n::Backend::Simple.new, I18n.backend)
16
17end
18

for Delayed Job we had extra check as

1
2def in_delayed_job_process?
3executable_name = File.basename $0
4  arguments = $\*
5rake_args_regex = /\Ajobs:/
6(executable_name == 'delayed_job') || (executable_name == 'rake' && arguments.find{ |v| v =~ rake_args_regex })
7end

After some serious searching and digging through both Delayed::Job source code and how we were using to setup its config, we started noticing some issues.

The first thing we found was that the problem did not turn up when delayed job workers were started using rake jobs:work task.

After looking at DelayedJob internals we found that the main difference between a rake task and a binstub was in the fork method that was invoked in the binstub version. The binstub version was being executed seamlessly using Daemons#run_process method and had a slightly different lifecycle of execution.

DelayedJob lifecycle

Let's take a look into DelayedJob internals before proceeding. DelayedJob has systems of the hooks that can be used by plugin-writers and in our applications. All this events functionality is hidden in Delayed::Lifecycle class. Each worker has its own instance of that class.

So, which events exactly do we have here?

Job-related events:

1
2:enqueue
3:perform
4:error
5:failure
6:invoke_job

Worker-related events:

1
2:execute
3:loop
4:perform
5:error
6:failure

You can setup callbacks to be run on before, after or around events simply using Delayed::Worker.lifecycle.before, Delayed::Worker.lifecycle.after and Delayed::Worker.lifecycle.around methods.

The Solution

Let's move on to our problem. It turned out that delayed job active record gem was closing all database connections in before_fork hook and reestablishing them in after_fork hook. It was clear that I18n-active-record did not play well with this, causing the issue at hand.

We looked into DelayedJob lifecycle and chose before :execute hook, which was executed after all DelayedJob ActiveRecord backend connections manipulations.

Finally the locales initializer for delayed_job workers was changed to match as below:

1
2require 'i18n/backend/active_record'
3Translation = I18n::Backend::ActiveRecord::Translation
4
5Delayed::Worker.lifecycle.before :execute do
6if (ActiveRecord::Base.connected? && Translation.table_exists?) || in_delayed_job_process?
7I18n.backend = I18n::Backend::ActiveRecord.new
8
9    I18n::Backend::ActiveRecord.send(:include, I18n::Backend::Memoize)
10    I18n::Backend::ActiveRecord.send(:include, I18n::Backend::Flatten)
11    I18n::Backend::Simple.send(:include, I18n::Backend::Memoize)
12    I18n::Backend::Simple.send(:include, I18n::Backend::Pluralization)
13
14    I18n.backend = I18n::Backend::Chain.new(I18n::Backend::Simple.new, I18n.backend)
15
16end
17end
18

This helped us to mitigate the connection errors, and connections stopped dying abruptly.

If this blog was helpful, check out our full blog archive.

Stay up to date with our blogs.

Subscribe to receive email notifications for new blog posts.