In MPDX, the Google Contacts sync job takes a long time and the google accounts loop to sync each job could benefit from find_each(batch_size: 1)
. Basically it seems likes find_each
pulls in the records in batches and then saves them in an array to enumerate through. Here's a comparison of the memory results with different batch_sizes
that I did using a similar, but contrived MemHungry
model. To setup, first do 6.times { MemHungry.create }
.
Using the default batch_size
of 1000 the memory at the end reflecs all 6 objects and the RAM they hold onto:
[1] pry(main)> MemHungry.all.find_each { |m| puts m.id; m.eat_memory; }
MemHungry Load (0.5ms) SELECT "mem_hungries".* FROM "mem_hungries" ORDER BY "mem_hungries"."id" ASC LIMIT 1000
1
Memory before GC: 112.63671875
Memory before allocation: 159.90234375
Memory after allocation: 1759.90234375
2
Memory before GC: 1759.90234375
Memory before allocation: 1759.90234375
Memory after allocation: 3359.90234375
3
Memory before GC: 3359.90234375
Memory before allocation: 3359.90234375
Memory after allocation: 4959.90234375
4
Memory before GC: 4959.90234375
Memory before allocation: 4959.90234375
Memory after allocation: 6559.90234375
5
Memory before GC: 6559.90234375
Memory before allocation: 6559.90234375
Memory after allocation: 8159.90234375
6
Memory before GC: 8159.90234375
Memory before allocation: 8159.90234375
Memory after allocation: 9150.45703125
=> nil
But when we run with batch_size: 1
then the memory is lower because find_each
won't hold onto all of the models at once (assuming there are less than 1000):
[1] pry(main)> MemHungry.all.find_each(batch_size: 1) { |m| puts m.id; m.eat_memory; }
MemHungry Load (0.5ms) SELECT "mem_hungries".* FROM "mem_hungries" ORDER BY "mem_hungries"."id" ASC LIMIT 1
1
Memory before GC: 116.43359375
Memory before allocation: 159.27734375
Memory after allocation: 1759.28125
MemHungry Load (0.6ms) SELECT "mem_hungries".* FROM "mem_hungries" WHERE ("mem_hungries"."id" > 1) ORDER BY "mem_hungries"."id" ASC LIMIT 1
2
Memory before GC: 1759.28515625
Memory before allocation: 1759.28515625
Memory after allocation: 3359.28515625
MemHungry Load (0.5ms) SELECT "mem_hungries".* FROM "mem_hungries" WHERE ("mem_hungries"."id" > 2) ORDER BY "mem_hungries"."id" ASC LIMIT 1
3
Memory before GC: 3359.28515625
Memory before allocation: 1759.28515625
Memory after allocation: 3359.28515625
MemHungry Load (0.5ms) SELECT "mem_hungries".* FROM "mem_hungries" WHERE ("mem_hungries"."id" > 3) ORDER BY "mem_hungries"."id" ASC LIMIT 1
4
Memory before GC: 3359.28515625
Memory before allocation: 1759.28515625
Memory after allocation: 3359.28515625
MemHungry Load (0.5ms) SELECT "mem_hungries".* FROM "mem_hungries" WHERE ("mem_hungries"."id" > 4) ORDER BY "mem_hungries"."id" ASC LIMIT 1
5
Memory before GC: 3359.28515625
Memory before allocation: 1759.28515625
Memory after allocation: 3359.28515625
MemHungry Load (0.5ms) SELECT "mem_hungries".* FROM "mem_hungries" WHERE ("mem_hungries"."id" > 5) ORDER BY "mem_hungries"."id" ASC LIMIT 1
6
Memory before GC: 3359.28515625
Memory before allocation: 1759.28515625
Memory after allocation: 3359.28515625
MemHungry Load (0.5ms) SELECT "mem_hungries".* FROM "mem_hungries" WHERE ("mem_hungries"."id" > 6) ORDER BY "mem_hungries"."id" ASC LIMIT 1
=> nil
The trade-off of doing a smaller batch size is the cost of the query to retrieve each record one-by-one. In the case when the records involved are related to a big Sidekiq background job operation that takes a long time and allocates a lot of memory, then doing it one-by-one makes the most sense in that case. If the operation on the models were faster (or the models didn't hold references to the allocated objects), then a larger batch size would make more sense.