draffensperger/explanation.md

In MPDX, the Google Contacts sync job takes a long time and the google accounts loop to sync each job could benefit from find_each(batch_size: 1). Basically it seems likes find_each pulls in the records in batches and then saves them in an array to enumerate through. Here's a comparison of the memory results with different batch_sizes that I did using a similar, but contrived MemHungry model. To setup, first do 6.times { MemHungry.create }.

Using the default batch_size of 1000 the memory at the end reflecs all 6 objects and the RAM they hold onto:

[1] pry(main)> MemHungry.all.find_each { |m| puts m.id; m.eat_memory; }
  MemHungry Load (0.5ms)  SELECT  "mem_hungries".* FROM "mem_hungries"   ORDER BY "mem_hungries"."id" ASC LIMIT 1000
1
Memory before GC: 112.63671875
Memory before allocation: 159.90234375
Memory after allocation: 1759.90234375
2
Memory before GC: 1759.90234375
Memory before allocation: 1759.90234375
Memory after allocation: 3359.90234375
3
Memory before GC: 3359.90234375
Memory before allocation: 3359.90234375
Memory after allocation: 4959.90234375
4
Memory before GC: 4959.90234375
Memory before allocation: 4959.90234375
Memory after allocation: 6559.90234375
5
Memory before GC: 6559.90234375
Memory before allocation: 6559.90234375
Memory after allocation: 8159.90234375
6
Memory before GC: 8159.90234375
Memory before allocation: 8159.90234375
Memory after allocation: 9150.45703125
=> nil

But when we run with batch_size: 1 then the memory is lower because find_each won't hold onto all of the models at once (assuming there are less than 1000):

[1] pry(main)> MemHungry.all.find_each(batch_size: 1) { |m| puts m.id; m.eat_memory; }
  MemHungry Load (0.5ms)  SELECT  "mem_hungries".* FROM "mem_hungries"   ORDER BY "mem_hungries"."id" ASC LIMIT 1
1
Memory before GC: 116.43359375
Memory before allocation: 159.27734375
Memory after allocation: 1759.28125
  MemHungry Load (0.6ms)  SELECT  "mem_hungries".* FROM "mem_hungries"  WHERE ("mem_hungries"."id" > 1)  ORDER BY "mem_hungries"."id" ASC LIMIT 1
2
Memory before GC: 1759.28515625
Memory before allocation: 1759.28515625
Memory after allocation: 3359.28515625
  MemHungry Load (0.5ms)  SELECT  "mem_hungries".* FROM "mem_hungries"  WHERE ("mem_hungries"."id" > 2)  ORDER BY "mem_hungries"."id" ASC LIMIT 1
3
Memory before GC: 3359.28515625
Memory before allocation: 1759.28515625
Memory after allocation: 3359.28515625
  MemHungry Load (0.5ms)  SELECT  "mem_hungries".* FROM "mem_hungries"  WHERE ("mem_hungries"."id" > 3)  ORDER BY "mem_hungries"."id" ASC LIMIT 1
4
Memory before GC: 3359.28515625
Memory before allocation: 1759.28515625
Memory after allocation: 3359.28515625
  MemHungry Load (0.5ms)  SELECT  "mem_hungries".* FROM "mem_hungries"  WHERE ("mem_hungries"."id" > 4)  ORDER BY "mem_hungries"."id" ASC LIMIT 1
5
Memory before GC: 3359.28515625
Memory before allocation: 1759.28515625
Memory after allocation: 3359.28515625
  MemHungry Load (0.5ms)  SELECT  "mem_hungries".* FROM "mem_hungries"  WHERE ("mem_hungries"."id" > 5)  ORDER BY "mem_hungries"."id" ASC LIMIT 1
6
Memory before GC: 3359.28515625
Memory before allocation: 1759.28515625
Memory after allocation: 3359.28515625
  MemHungry Load (0.5ms)  SELECT  "mem_hungries".* FROM "mem_hungries"  WHERE ("mem_hungries"."id" > 6)  ORDER BY "mem_hungries"."id" ASC LIMIT 1
=> nil

The trade-off of doing a smaller batch size is the cost of the query to retrieve each record one-by-one. In the case when the records involved are related to a big Sidekiq background job operation that takes a long time and allocates a lot of memory, then doing it one-by-one makes the most sense in that case. If the operation on the models were faster (or the models didn't hold references to the allocated objects), then a larger batch size would make more sense.

	class MemHungry < ActiveRecord::Base
	def eat_memory
	puts "Memory before GC: #{rss_mb}"
	GC.start
	puts "Memory before allocation: #{rss_mb}"
	@reference = Array.new(200 * 2 ** 20, 1) # ~200 million 1's
	puts "Memory after allocation: #{rss_mb}"
	end

	def rss_mb
	NewRelic::Agent::Samplers::MemorySampler.new.sampler.get_sample
	end
	end