---
title: "Hunting down a memory leak in shoryuken"
description: "Hunting down a memory leak in shoryuken worker"
canonical_url: "https://www.bigbinary.com/blog/hunting-down-a-memory-leak-in-shoryuken"
markdown_url: "https://www.bigbinary.com/blog/hunting-down-a-memory-leak-in-shoryuken.md"
---

# Hunting down a memory leak in shoryuken

Hunting down a memory leak in shoryuken worker

- Author: Rohit Kumar
- Published: November 15, 2016
- Categories: Ruby

This is a story of how we found and fixed memory leak in
[shoryuken](https://github.com/phstc/shoryuken/).

We use shoryuken to process SQS messages inside of docker containers. A while
back we noticed that memory was growing without bound. After every few days, we
had to restart all the docker containers as a temporary workaround.

Since the workers were inside of a docker container we had limited tools. So we
went ahead with the UNIX way of investigating the issue.

First we noticed that the number of threads inside the worker was high, 115 in
our case. shoryuken boots up all the worker threads at
[startup](https://github.com/phstc/shoryuken/blob/cea4f5f475b2e756f82e367152db831f9e2cc6f5/lib/shoryuken/manager.rb#L22).

```ruby

# ps --no-header uH p <PID> | wc -l
#=> 115
```

The proc filesystem exposes a lot of useful information of all the running
processes. ``/proc/[pid]/task` directory has information about all the threads
of a process.

Some of the threads with lower ID's were executing syscall 23
[(select)](http://man7.org/linux/man-pages/man2/select.2.html) and 271
[(ppoll)](https://linux.die.net/man/2/ppoll). These threads were waiting for a
message to arrive in the SQS queue, but most of the threads were executing
syscall 202 [(futex)](http://man7.org/linux/man-pages/man2/futex.2.html).

At this point we had an idea about the root cause of the memory leak - it was
due to the worker starting a lot of threads which were not getting terminated.
We wanted to know how and when these threads are started.

Ruby 2.0.0 introduced
[tracepoint](http://ruby-doc.org/core-2.0.0/TracePoint.html), which provides an
interface to a lot of internal ruby events like when a exception is raised, when
a method is called or when a method returns, etc.

We added the following code to our workers.

```ruby

tc = TracePoint.new(:thread_begin, :thread_end) do |tp|
  puts tp.event
  puts tp.self.class
end

tc.enable

```

Executing the ruby workers with tracing enabled revealed that a new
`Celluloid::Thread` was being created before each method was processed and that
thread was never terminated. Hence the number of zombie threads in the worker
was growing with the number messages processed.

```ruby

thread_begin
Celluloid::Thread
[development] [306203a5-3c07-4174-b974-77390e8a4fc3] SQS Message: ...snip...

thread_begin
Celluloid::Thread
[development] [2ce2ed3b-d314-46f1-895a-f1468a8db71e] SQS Message: ...snip...

```

Unfortunately tracepoint didn't pinpoint the place where the thread was started,
hence we added a couple of puts statements to investigate the issue further.

After a lot of debugging, we were able to find that a new thread was started to
increase the
[visibility time](https://github.com/phstc/shoryuken/blob/a40e6b26e2dfbc4a78a36198d02010dce66a977e/lib/shoryuken/middleware/server/auto_extend_visibility.rb#L48)
of the SQS message in a shoryuken middleware when auto_visibility_timeout was
true.

The fix was to [terminate](https://github.com/phstc/shoryuken/pull/267) the
thread after the work is done.

## Links

- [Human page](https://www.bigbinary.com/blog/hunting-down-a-memory-leak-in-shoryuken)
