In this blog we would be discussing three separate case studies of Redis running out of memory. All the three case studies have videos demonstrating how the debugging was done.
All the three videos were prepared for my team members to show how to go about debugging. The videos are being presented "as it was recorded".
First Case Study
When a job fails in Sidekiq, Sidekiq puts that job in RetrySet and retries that job until the job succeeds or the job reaches the maximum number of retries. By default the maximum number of retries is 25. If a job fails 25 times then that job is moved to the DeadSet. By default Sidekiq will store up to 10,000 jobs in the deadset.
We had a situation where Redis was running out of memory. Here is how the debugging was done.
How to inspect the deadset
1ds = Sidekiq::DeadSet.new 2ds.each do |job| 3 puts "Job #{job['jid']}: #{job['class']} failed at #{job['failed_at']}" 4end
Running the following to view the latest entry to the dataset:
1ds.first 2ds.count
To see the memory usage following commands were executed in the Redis console.
1> memory usage dead 230042467 3 4> type dead 5zset
As discussed in the video large amount of payload was being sent. This is not the right way to send data to the worker. Ideally some sort of id should be sent to the worker and the worker should be able to get the necessary data from the database based on the received id.
References
- How to increase the number of jobs in the Sidekiq deadset or disable deadset
- Maximum number of job retries in Sidekiq
- Maximum number of jobs in Sidekiq Deadset
Second case study
In this case the Redis instance of neetochat was running out of memory. The Redis instance had 50MB capacity but we were getting the following error.
1ERROR: heartbeat: OOM command not allowed when used memory > 'maxmemory'.
We were pushing too many geo info records to Redis and that caused the memory to fill up. Here is the video capturing the debugging session.
Followings are the commands that were executed while debugging.
1> ping 2PONG 3 4> info 5 6> info memory 7 8> info keyspace 9 10> keys *failed* 11 12> keys *process* 13 14> keys *geocoder* 15 16> get getocoder:http://ipinfo.io/41.174.30.55/geo?
Third Case Study
In this case the authentication service of neeto was failing because of memory exhaustion.
Here the number of keys was limited but the payload data was huge and all that payload data was hogging the memory. Here is the video capturing the debugging session.
Followings are the commands that were executed while debugging.
1> ping 2 3> info keyspace 4db0:keys=106, expires=86,avg_ttl=1233332728573 5 6> key * (to see all the keys)
Last command listed all the 106 keys. Next we needed to find how much memory each of these keys are using. For that the following commands were executed.
1> memory usage organizations/subdomains/bigbinary/neeto_app_links 2736 bytes 3 4> memory usage failed 510316224 (10MB) 6 7> memory usage dead 829871174 (29MB)