How We Used a Custom Enumerator to Fix a Production Problem

Enumerators are a powerful Ruby feature. Did you know that you can create custom enumerators with custom logic? Today, I will tell you how we fixed a production problem with a custom enumerator.

Problem

We had a method like the following:

def customer_property_ids
  sql = "SELECT DISTINCT PropertyId from AddressMatch"
  client.execute(sql).map { _1["PropertyId"] }
end

It runs a Microsoft SQL Server query that returns an array of property IDs. In Ruby, arrays implement the Enumerable interface by default, so we can use the rich assortment of Enumerable methods to iterate over them.

Other methods from the same class relied on the array returned by customer_property_ids; we had three occurrences of the following call throughout the class:

customer_property_ids.each_slice(BATCH_SIZE) do |batch_ids|
  # ...
end

We were able to pinpoint the cause of the crash: the SQL Server query returned more than 5 million property IDs, which made the Ruby process core dump due to memory exhaustion.

We needed to push a quick fix to production to get the application back up while limiting the scope of the changes for maximum safety. Getting back to business was a matter of the utmost urgency. What to do?

Solution

We would avoid the core dump if we kept fewer IDs in memory at a time, and that’s where our custom enumerator comes in. The first thing we did was figure out how to paginate a query in SQL Server. Here’s how we do it:

SELECT DISTINCT PropertyId
FROM AddressMatch
ORDER BY PropertyId ASC
OFFSET 0 ROWS FETCH FIRST 100000 ROWS ONLY

This query retrieves the first 100,000 IDs. We can interpolate an OFFSET and FETCH FIRST into the SQL string to get the subsequent pages.

Finally, we limited the changes to a single spot: the customer_property_ids method. Putting everything together, the enumerator code looked as follows:

def customer_property_ids
  Enumerator.new do |yielder|
    limit = 100_000
    offset = 0

    loop do
      sql = <<~SQL
        SELECT DISTINCT PropertyId
        FROM AddressMatch
        ORDER BY PropertyId ASC
        OFFSET #{offset} ROWS FETCH FIRST #{limit} ROWS ONLY
      SQL

      ids = client.execute(sql)
        .map { |row| yielder << row["PropertyId"] }

      break if ids.none?

      offset += limit
    end
  end
end

In the enumerator, we manage a limit and an offset and execute the query within a loop until it returns an empty result. Each time the query returns the results for a page, we give the property IDs to the yielder.

With that scoped change, we avoided further changes to the methods that called customer_property_ids.each_slice. We built our own custom enumerator that paginates the query results and avoids memory bloat! However, pagination is an internal detail encapsulated by the enumerator.

Explanation

Let’s explain the solution from a mental model perspective, which doesn’t necessarily correspond to the internal Ruby implementation details.

The magic of enumerators is that they use Fibers underneath, so in our example, whenever we shove a value into the yielder, it pauses execution and returns control to the each_slice method. If each_slice does not have enough values to work with, it resumes the execution and asks for more. Here’s a brief example:

enum = Enumerator.new do |yielder|
  puts "Starting the custom enumerator..."

  [1, 2].each do |n|
    puts "Giving #{n} to the yielder"
    yielder << n
  end

  puts "each_slice is still asking for more values..."

  [3, 4].each do |n|
    puts "Giving #{n} to the yielder"
    yielder << n
  end
end

enum.each_slice(3) do |slice|
  puts "We have enough, let's take a slice: #{slice}"
end

And the output:

Starting the custom enumerator...
Giving 1 to the yielder...
Giving 2 to the yielder...
each_slice is still asking for more values...
Giving 3 to the yielder
We have enough, let's take a slice: [1, 2, 3]
Giving 4 to the yielder
We have enough, let's take a slice: [4]

Back to the customer_property_ids enumerator, that continuous context switching guarantees that the values are processed as the yielder gets fed with them, which allows the garbage collector to discard the results already used. Compare that to eager loading all the 5 million IDs in memory!

While working on that fix, we discovered that the code was not optimal and could benefit from other refactors to make it scale better, but that simple change quickly put our client back in business.