Dive into Ruby enumeration

Enumeration is Ruby's term for iteration, and there are significant differences in how it is handled within Ruby's existing conventions when compared to other programming languages. Let us first establish some terms:

Enumerable, a built-in mixin. Other languages usually call this "iterable".
Enumerator, a built-in class. Other languages usually call this "iterator".

Refresher: `each` with a block

The most basic way to iterate over something in Ruby is to call the #each method on an object, with a block:

[1, 2, 3].each { |x| puts x }

The #each method forms the backbone of enumeration in Ruby. For example, it is very common to make something enumerable by including the mixin Enumerable in a class, and the only pre-requisite for using this mixin is implementing the #each method. This allows the class to get a lot of methods "for free", e.g. #any? and #all? can all be implemented in terms of #each.

Blocks and `yield`

Let's say we want to make an object that produces 1, 2, and 3. The easy way is to write (1..3), but let's do this from scratch:

class Podium
  include Enumerable

  def each
    yield 1
    yield 2
    yield 3
  end
end

Here, we are on the other side of the fence: rather than using #each, we are now implementing #each. The yield keyword is of interest: it calls the block that is passed to the method. In other words, the following code is equivalent to the code above:

class Podium
  include Enumerable

  def each(&blk)
    blk.call(1)
    blk.call(2)
    blk.call(3)
  end
end

The & sigil in &blk binds the block — provided by the caller — to the name blk. If the caller failed to provide a block, then blk becomes nil. The following two are equivalent:

yield v if block_given?
blk.call(v) unless blk.nil?

Remember that a block by itself is not an object. Consult "What is the difference between a block, a proc, and a lambda in Ruby?" for more information on this topic.

Warning: if you are coming from Python, the yield keyword in Ruby does not turn a function into a generator. The strict Ruby equivalent of a Python generator is called a Fiber, and fibers are usually not used for iteration, unlike their Python counterparts. Hang tight for a straightforward Ruby analog to Pythonic iterative generators! We will explore them near the end.

Materialization and enumerators

The #each method effectively divorces how to iterate over a container from what to do with each iterated element. The method handles the how, and the given block handles the what. In other words, the method is responsible for materializing every element inside the container.

What if we only want the very first element from #each, and we do not have access to other methods, such as #[]? That is where an enumerator comes in. Starting from Ruby 1.9, the convention is that if we call #each without a block, then it returns an enumerator object. We can then call the #next method on that enumerator to produce just one element. We can see this behavior in a built-in array:

irb(main):001:0> a = [3, 5, 7]
=> [3, 5, 7]
irb(main):002:0> e = a.each
=> #<Enumerator: [3, 5, 7]:each>
irb(main):003:0> e.next
=> 3

Our Podium class doesn't have this behavior yet, but Ruby provides the Object#to_enum method on all objects to make it easy to support this behavior:

class Podium
  include Enumerable

  def each
    return to_enum unless block_given?
    yield 1
    yield 2
    yield 3
  end
end

It works exactly the same way:

irb(main):011:0> p = Podium.new
=> #<Podium:0x00007fe189942358>
irb(main):012:0> e = p.each
=> #<Enumerator: #<Podium:0x00007fe189942358>:each>
irb(main):013:0> e.next
=> 1

A deeper question is: why would we want to materialize one element at a time? The value lies in space efficiency: by supporting one-at-a-time materialization, we can operate in a low-memory environment without fear of running out of memory when given an enormous data source.

Aside: `#to_proc`

In Ruby, the #to_proc method is the convention for converting something into a callable. For example, the built-in type Symbol implements #to_proc, so that we can abbreviate { |x| x.odd? } into :odd?.to_proc. This process is so common that the unary operator & does the exact same thing, allowing us to further abbreviate :odd?.to_proc as &:odd.

The choice of the punctuation & visually connects to writing &blk in an arguments list, where the meaning of & similarly means "capture a block as a proc".

For a deeper dive, consult "How Does Symbol#to_proc Work?".

Operation chaining: intermediate data structures

A common idiom in Ruby is to chain multiple enumerable methods in order to produce a list of results. For example, if we want to take the numbers from 0 to 99 (inclusive), add one to each of them, pick only the odd ones, and sum them up, we could write:

(0..99)
  .map(&:succ)
  .select(&:odd?)
  .reduce(0, &:+)

The (0..99) is syntax sugar for constructing a Range object. The desugared form is Range.new(0, 99). This chain of methods looks great until we look at what each step of the process does:

irb(main):001:0> (0..99).map(&:succ)
=> [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100]

It turns out that each call produces an entirely new array! If our data source had a million elements, we would be creating an additional, intermediate million-long array for each additional method we add to the chain.

In the case of arrays, we can cheat this cost away by using mutative methods:

(0..99)
  .to_a
  .map!(&:succ)
  .select!(&:odd?)
  .reduce(0, &:+)

Here, we first create a new array from (0..99) using #to_a, which guarantees us access to those mutative methods. But this forces us to remember when mutative methods are available, and which ones are possible. The cleaner solution should be that every method call avoids immediately doing any work until we actually want the result!

Aside: ActiveRecord prior art

In Rails, when dealing with data models, ActiveRecord actually does avoid doing any work until you want the result. Every method you call gives back a relation object, which records the SQL query it has built up so far. It's not until you try to use a model object that the query gets executed!

Operation chaining: removing intermediates

It turns out that there is a method called Enumerator#lazy that gives us the ability to eliminate intermediates for free:

(0..99)
  .each
  .lazy
  .map(&:succ)
  .select(&:odd?)
  .reduce(0, &:+)

The call above does not build any intermediate arrays. This is so common, in fact, that another convention is that if an object supports returning an enumerator from #each, it also supports calling #lazy as a shorthand for .each.lazy:

(0..99)
  .lazy
  .map(&:succ)
  .select(&:odd?)
  .reduce(0, &:+)

We can now apply the same behavior to podium:

class Podium
  include Enumerable

  def each
    return to_enum unless block_given?
    yield 1
    yield 2
    yield 3
  end

  def lazy
    each.lazy
  end
end

We can verify that it's not until we actually try to use each value that those values are calculated:

irb(main):015:0> p = Podium.new
=> #<Podium:0x00007fb64cf13ce0>
irb(main):016:0> p.lazy.map(&:succ).reject(&:odd?)
=> #<Enumerator::Lazy: #<Enumerator::Lazy: #<Enumerator::Lazy: #<Enumerator: #<Podium:0x00007fb64cf13ce0>:each>>:map>:reject>
irb(main):017:0> p.lazy.map(&:succ).reject(&:odd?).each { |i| puts i }
2
4
=> nil

The Enumerator::Lazy class, used to implement Enumerator#lazy, also supports a #force method that lets use force an eager evaluation all at once:

irb(main):018:0> p.lazy.map(&:succ).reject(&:odd?).force
=> [2, 4]

Its behavior is the same as calling #to_a.

Converted enumerator vs enumerator from scratch

So far we've built an enumerator by having an object wrap itself in an enumerator. It turns out there is a second way to do this: by passing a block to Enumerator.new:

def podium
  Enumerator.new do |yielder|
    yielder.yield 1
    yielder.yield 2
    yielder.yield 3
  end
end

It can be used in exactly the same way:

irb(main):008:0> p = podium
=> #<Enumerator: #<Enumerator::Generator:0x00007fdaa15e9d08>:each>
irb(main):009:0> p.lazy.map(&:succ).reject(&:odd?)
=> #<Enumerator::Lazy: #<Enumerator::Lazy: #<Enumerator::Lazy: #<Enumerator: #<Enumerator::Generator:0x00007fdaa15e9d08>:each>>:map>:reject>
irb(main):010:0> p.lazy.map(&:succ).reject(&:odd?).each { |i| puts i }
2
4
=> nil

Making an enumerator from scratch is very handy if we find ourselves not needing a class.

If you're coming from Python, the block form of Enumerator.new is the closest Ruby analog to Python generators. In fact, the low-level way to use an enumerator works almost exactly the same way:
irb(main):008:0> p = podium
=> #<Enumerator: #<Enumerator::Generator:0x00007fa2021f2b78>:each>
irb(main):009:0> p.next
=> 1
irb(main):010:0> p.next
=> 2
irb(main):011:0> p.next
=> 3
irb(main):012:0> p.next
Traceback (most recent call last):
        2: from (irb):12
        1: from (irb):12:in `next'
StopIteration (iteration reached an end)
Enumerators in Ruby are also powered by fibers, which brings us to the question: when should one be used over the other? If the intent is to make something iterable, then enumerators are a better choice, because they include the Enumerable mixin by default. Otherwise, if the purpose is to use coroutines as a way to perform flow control, then fibers provide a much more direct API.

kourge/dive-into-ruby-enumeration.md

Dive into Ruby enumeration

Refresher: `each` with a block

Blocks and `yield`

Materialization and enumerators

Aside: `#to_proc`

Operation chaining: intermediate data structures

Aside: ActiveRecord prior art

Operation chaining: removing intermediates

Converted enumerator vs enumerator from scratch

ianwremmel commented May 16, 2018

kourge/dive-into-ruby-enumeration.md

Dive into Ruby enumeration

Refresher: each with a block

Blocks and yield

Materialization and enumerators

Aside: #to_proc

Operation chaining: intermediate data structures

Aside: ActiveRecord prior art

Operation chaining: removing intermediates

Converted enumerator vs enumerator from scratch

ianwremmel commented May 16, 2018

Refresher: `each` with a block

Blocks and `yield`

Aside: `#to_proc`