HyperLogLogs in Redis

A hyper-what-now?

A HyperLogLog is a probabilistic data structure used to count unique values — or as it’s referred to in mathematics: calculating the cardinality of a set.

These values can be anything: for example, IP addresses for the visitors of a website, search terms, or email addresses.

Counting unique values with exact precision requires an amount of memory proportional to the number of unique values. The reason for this is that there is no way of determining if a value has already been seen other than by comparing it to the previously seen values.

Since memory is a limited resource, doing this becomes problematic when working with large sets of values.

A HyperLogLog solves this problem by allowing to trade memory consumption for precision making it possible to estimate cardinalities larger than 10⁹ with a standard error of 2% using only 1.5 kilobytes of memory^[1].

How to use HyperLogLogs in Redis

HyperLogLogs have been available in Redis since version 2.8.9 and are accessed using the PFADD, PFCOUNT, and PFMERGE commands.

A Redis HyperLogLog consumes at most 12 kilobytes of memory and produces approximations with a standard error of 0.81%. The 12 kilobytes do not include the bytes required to store the actual key.

Ignoring PFMERGE, using a HyperLogLog is very similar to using a Set for the same purpose: instead of adding members with SADD, add them with PFADD and instead of retrieving the cardinality with SCARD, retrieve it with PFCOUNT.

The example below shows the output of an interactive Redis session. You can follow along by running redis-cli and entering the commands shown on the lines beginning with >. Make sure an instance of redis-server is also running.

> PFADD visitors alice bob carol
(integer) 1
> PFCOUNT visitors
(integer) 3

PFADD returns 1 if the approximated cardinality of the HyperLogLog was changed when adding the element. If not, it returns 0.

After calculating the cardinality, PFCOUNT will store the calculated value in the last 8 bytes of the HyperLogLog to serve as a cache. This cache is invalidated on the next PFADD.

The PFMERGE command is used to produce a HyperLogLog that approximates the cardinality of the union of two or more existing HyperLogLogs:

> PFADD customers alice dan
(integer) 1
> PFMERGE everyone visitors customers
OK
> PFCOUNT everyone
(integer) 4

The same result can be achieved by supplying multiple keys to PFCOUNT. In this case an on-the-fly merge is performed:

> PFCOUNT visitors customers
(integer) 4

Under the hood, HyperLogLogs are actually stored as strings and can be GET and SET as such:

> GET visitors
"HYLL\x01\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00E<\x94X\x10\x84Qi\x8cQD"
> SET visitors "HYLL\x01\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00E<\x94X\x10\x84Qi\x8cQD"
OK
> PFCOUNT visitors
(integer) 3

This is useful if the HyperLogLog needs to be persisted elsewhere and later restored.

The PFADD, PFCOUNT, and PFMERGE commands are prefixed with PF in memory of Philippe Flajolet, the inventor of the HyperLogLog algorithm.

The theory behind HyperLogLogs

To get started, let’s look at something more familiar than HyperLogLogs.

Imagine someone flipping a coin multiple times, while recording the maximum number of heads they get in a row. If they told you this number — would you be able to estimate how many times they’d flipped the coin?

Not with great accuracy, but since a longer run of heads is more probable when flipping the coin many times, the number of consecutive heads or tails in a series of coin flips can indicate how many times the coin was flipped: if the number of heads in a row is low, the number of coin flips is most probably also low. Conversely, if the number of heads in a row is high the number of coin flips is most probably high.

If the same number of coin flips were done and recorded separately, a much more precise estimate could be made by combining these numbers. This can be observed using the following methods:

def coin_flips(n)
  Array.new(n) { [:heads, :tails].sample }
end

def heads_in_a_row(flips)
  run = max = 0

  flips.each do |flip|
    if flip == :heads
      run += 1
    else
      max = [run, max].max
      run = 0
    end
  end

  max
end

class Array
  def average
    reduce(:+).to_f / size
  end
end

The bigger the number of coin flips, the bigger the number of heads in a row.

heads_in_a_row coin_flips(10)    #=> 3
heads_in_a_row coin_flips(100)   #=> 5
heads_in_a_row coin_flips(1000)  #=> 9
heads_in_a_row coin_flips(10000) #=> 15

The bigger the number of series, the better the stability of the output. With a more stable output, a better estimate can be made:

heads_in_a_row coin_flips(10) #=> 3
heads_in_a_row coin_flips(10) #=> 7
heads_in_a_row coin_flips(10) #=> 4

Array.new(1000) { heads_in_a_row coin_flips(10) }.average #=> 2.449
Array.new(1000) { heads_in_a_row coin_flips(10) }.average #=> 2.442
Array.new(1000) { heads_in_a_row coin_flips(10) }.average #=> 2.469

Similarly to how the number of coin flips can be estimated by observing the number of heads in a row, the calculations of a HyperLogLog are based on the observation that the cardinality of a set of uniformly distributed random numbers can be estimated by counting the maximum number of leading zeros in the binary representation of the numbers.

For there to be leading zeros, the numbers must be represented as integers with the same number of bits, for instance as 64-bit unsigned integers.

If the maximum number of leading zeros is n, the estimated number of unique values in the set is 2ⁿ.

To make sure values are uniformly distributed and represented as same-size integers, the HyperLogLog algorithm applies a hash function to all values. This transforms the values into a set of uniformly distributed random numbers with the same cardinality as the original set.

The first few bits of the hashed values are used to divide them into different subsets, much like the separate series of coin flips in the example above. For each subset the maximum number of leading zeros within its values are stored in a register.

To calculate the approximate cardinality of the whole set, the estimates for all the subsets are combined using a harmonic mean.

An example use-case

Here’s an example of how to use a Redis HyperLogLog to count unique visitors of a site built with Sinatra using a custom Rack middleware:

require 'redis'
require 'sinatra'

$redis = Redis.new

class VisitorCounter
  def initialize(app)
    @app = app
  end

  def call(env)
    $redis.pfadd 'visitors', Rack::Request.new(env).ip
    @app.call(env)
  end
end

use VisitorCounter

get '/' do
  visitors = $redis.pfcount 'visitors'
  "This website has been visited by #{visitors} unique visitors."
end

References

Flajolet, P.; Fusy, E.; Gandouet, O.; Meunier, F. (2007). “HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm”. AOFA ’07: Proceedings of the 2007 International Conference on the Analysis of Algorithms.

HyperLogLogs in Redis

How to use HyperLogLogs in Redis

The theory behind HyperLogLogs

An example use-case

References

About thoughtbot

1-on-1 and Group Mentoring

How to use HyperLogLogs in Redis

The theory behind HyperLogLogs

An example use-case

References

Sign up to receive a weekly recap from thoughtbot

About thoughtbot

1-on-1 and Group Mentoring