This is a story of how we found and fixed memory leak in shoryuken.
We use shoryuken to process SQS messages inside of docker containers. A while back we noticed that memory was growing without bound. After every few days, we had to restart all the docker containers as a temporary workaround.
Since the workers were inside of a docker container we had limited tools. So we went ahead with the UNIX way of investigating the issue.
First we noticed that the number of threads inside the worker was high, 115 in our case. shoryuken boots up all the worker threads at startup.
1 2# ps --no-header uH p <PID> | wc -l 3#=> 115
The proc filesystem exposes a lot of useful information of all the running processes. ``/proc/[pid]/task` directory has information about all the threads of a process.
Some of the threads with lower ID's were executing syscall 23 (select) and 271 (ppoll). These threads were waiting for a message to arrive in the SQS queue, but most of the threads were executing syscall 202 (futex).
At this point we had an idea about the root cause of the memory leak - it was due to the worker starting a lot of threads which were not getting terminated. We wanted to know how and when these threads are started.
Ruby 2.0.0 introduced tracepoint, which provides an interface to a lot of internal ruby events like when a exception is raised, when a method is called or when a method returns, etc.
We added the following code to our workers.
1 2tc = TracePoint.new(:thread_begin, :thread_end) do |tp| 3 puts tp.event 4 puts tp.self.class 5end 6 7tc.enable 8
Executing the ruby workers with tracing enabled revealed that a new Celluloid::Thread was being created before each method was processed and that thread was never terminated. Hence the number of zombie threads in the worker was growing with the number messages processed.
1 2thread_begin 3Celluloid::Thread 4[development] [306203a5-3c07-4174-b974-77390e8a4fc3] SQS Message: ...snip... 5 6thread_begin 7Celluloid::Thread 8[development] [2ce2ed3b-d314-46f1-895a-f1468a8db71e] SQS Message: ...snip... 9
Unfortunately tracepoint didn't pinpoint the place where the thread was started, hence we added a couple of puts statements to investigate the issue further.
After a lot of debugging, we were able to find that a new thread was started to increase the visibility time of the SQS message in a shoryuken middleware when auto_visibility_timeout was true.
The fix was to terminate the thread after the work is done.