Three case studies of debugging redis running out of memory

September 12, 2022

In this blog we would be discussing three separate case studies of Redis running out of memory. All the three case studies have videos demonstrating how the debugging was done.

All the three videos were prepared for my team members to show how to go about debugging. The videos are being presented "as it was recorded".

First Case Study

When a job fails in Sidekiq, Sidekiq puts that job in RetrySet and retries that job until the job succeeds or the job reaches the maximum number of retries. By default the maximum number of retries is 25. If a job fails 25 times then that job is moved to the DeadSet. By default Sidekiq will store up to 10,000 jobs in the deadset.

We had a situation where Redis was running out of memory. Here is how the debugging was done.

How to inspect the deadset

ds = Sidekiq::DeadSet.new
ds.each do |job|
  puts "Job #{job['jid']}: #{job['class']} failed at #{job['failed_at']}"
end

Running the following to view the latest entry to the dataset:

ds.first
ds.count

To see the memory usage following commands were executed in the Redis console.

> memory usage dead
30042467

> type dead
zset

As discussed in the video large amount of payload was being sent. This is not the right way to send data to the worker. Ideally some sort of id should be sent to the worker and the worker should be able to get the necessary data from the database based on the received id.

References

Second case study

In this case the Redis instance of neetochat was running out of memory. The Redis instance had 50MB capacity but we were getting the following error.

ERROR: heartbeat: OOM command not allowed when used memory > 'maxmemory'.

We were pushing too many geo info records to Redis and that caused the memory to fill up. Here is the video capturing the debugging session.

Followings are the commands that were executed while debugging.

> ping
PONG

> info

> info memory

> info keyspace

> keys *failed*

> keys *process*

> keys *geocoder*

> get getocoder:http://ipinfo.io/41.174.30.55/geo?

Third Case Study

In this case the authentication service of neeto was failing because of memory exhaustion.

Here the number of keys was limited but the payload data was huge and all that payload data was hogging the memory. Here is the video capturing the debugging session.