November 15, 2016
This is a story of how we found and fixed memory leak in shoryuken.
We use shoryuken to process SQS messages inside of docker containers. A while back we noticed that memory was growing without bound. After every few days, we had to restart all the docker containers as a temporary workaround.
Since the workers were inside of a docker container we had limited tools. So we went ahead with the UNIX way of investigating the issue.
First we noticed that the number of threads inside the worker was high, 115 in our case. shoryuken boots up all the worker threads at startup.
# ps --no-header uH p <PID> | wc -l
#=> 115
The proc filesystem exposes a lot of useful information of all the running processes. ``/proc/[pid]/task` directory has information about all the threads of a process.
Some of the threads with lower ID's were executing syscall 23 (select) and 271 (ppoll). These threads were waiting for a message to arrive in the SQS queue, but most of the threads were executing syscall 202 (futex).
At this point we had an idea about the root cause of the memory leak - it was due to the worker starting a lot of threads which were not getting terminated. We wanted to know how and when these threads are started.
Ruby 2.0.0 introduced tracepoint, which provides an interface to a lot of internal ruby events like when a exception is raised, when a method is called or when a method returns, etc.
We added the following code to our workers.
tc = TracePoint.new(:thread_begin, :thread_end) do |tp|
puts tp.event
puts tp.self.class
end
tc.enable
Executing the ruby workers with tracing enabled revealed that a new
Celluloid::Thread
was being created before each method was processed and that
thread was never terminated. Hence the number of zombie threads in the worker
was growing with the number messages processed.
thread_begin
Celluloid::Thread
[development] [306203a5-3c07-4174-b974-77390e8a4fc3] SQS Message: ...snip...
thread_begin
Celluloid::Thread
[development] [2ce2ed3b-d314-46f1-895a-f1468a8db71e] SQS Message: ...snip...
Unfortunately tracepoint didn't pinpoint the place where the thread was started, hence we added a couple of puts statements to investigate the issue further.
After a lot of debugging, we were able to find that a new thread was started to increase the visibility time of the SQS message in a shoryuken middleware when auto_visibility_timeout was true.
The fix was to terminate the thread after the work is done.
If this blog was helpful, check out our full blog archive.