---
title: HyperLogLogs in Redis
teaser: Learn what a HyperLogLog is and how you can use it to count billions of unique
  things.
tags: redis,web,ruby
author: Calle Erlandsson
published_on: 2016-08-01
---

*A hyper-what-now?*

A <dfn>HyperLogLog</dfn> is a probabilistic data structure used to count unique
values — or as it's referred to in mathematics: calculating the cardinality of a
set.

These values can be anything: for example, IP addresses for the visitors of a
website, search terms, or email addresses.

Counting unique values with *exact precision* requires an amount of memory
proportional to the number of unique values. The reason for this is that there
is no way of determining if a value has already been seen other than by
comparing it to the previously seen values.

Since memory is a limited resource, doing this becomes problematic when working
with *large sets of values*.

A HyperLogLog solves this problem by allowing to trade memory consumption for
precision making it possible to estimate cardinalities larger than
10<sup>9</sup> with a standard error of 2% using only 1.5 kilobytes of
memory<sup>[[1]](#1)</sup>.

## How to use HyperLogLogs in Redis

HyperLogLogs have been available in Redis since version 2.8.9 and are accessed
using the [`PFADD`], [`PFCOUNT`], and [`PFMERGE`] commands.

A Redis HyperLogLog consumes at most 12 kilobytes of memory and produces
approximations with a standard error of 0.81%. The 12 kilobytes do not include
the bytes required to store the actual key.

Ignoring [`PFMERGE`], using a HyperLogLog is very similar to using a Set for the
same purpose: instead of adding members with [`SADD`], add them with [`PFADD`]
and instead of retrieving the cardinality with [`SCARD`], retrieve it with
[`PFCOUNT`].

The example below shows the output of an interactive Redis session. You can
follow along by running `redis-cli` and entering the commands shown on the lines
beginning with `>`. Make sure an instance of `redis-server` is also running.

    > PFADD visitors alice bob carol
    (integer) 1
    > PFCOUNT visitors
    (integer) 3

[`PFADD`] returns `1` if the approximated cardinality of the HyperLogLog was
changed when adding the element. If not, it returns `0`.

After calculating the cardinality, [`PFCOUNT`] will store the calculated value
in the last 8 bytes of the HyperLogLog to serve as a cache. This cache is
invalidated on the next [`PFADD`].

The [`PFMERGE`] command is used to produce a HyperLogLog that approximates the
cardinality of the union of two or more existing HyperLogLogs:

    > PFADD customers alice dan
    (integer) 1
    > PFMERGE everyone visitors customers
    OK
    > PFCOUNT everyone
    (integer) 4

The same result can be achieved by supplying multiple keys to [`PFCOUNT`]. In
this case an on-the-fly merge is performed:

    > PFCOUNT visitors customers
    (integer) 4

Under the hood, HyperLogLogs are actually stored as strings and can be [`GET`]
and [`SET`] as such:

    > GET visitors
    "HYLL\x01\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00E<\x94X\x10\x84Qi\x8cQD"
    > SET visitors "HYLL\x01\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00E<\x94X\x10\x84Qi\x8cQD"
    OK
    > PFCOUNT visitors
    (integer) 3

This is useful if the HyperLogLog needs to be persisted elsewhere and later
restored.

The [`PFADD`], [`PFCOUNT`], and [`PFMERGE`] commands are prefixed with `PF` in
memory of [Philippe Flajolet], the inventor of the HyperLogLog algorithm.

## The theory behind HyperLogLogs

To get started, let's look at something more familiar than HyperLogLogs.

Imagine someone flipping a coin multiple times, while recording the maximum
number of heads they get in a row. If they told you this number — would you be
able to estimate how many times they'd flipped the coin?

Not with great accuracy, but since a longer run of heads is more probable when
flipping the coin many times, the number of consecutive heads or tails in a
series of coin flips *can* indicate how many times the coin was flipped: if the
number of heads in a row is low, the number of coin flips is most probably
*also* low. Conversely, if the number of heads in a row is high the number of
coin flips is most probably high.

If the same number of coin flips were done and recorded separately, a much more
precise estimate could be made by combining these numbers. This can be
observed using the following methods:

```ruby
def coin_flips(n)
  Array.new(n) { [:heads, :tails].sample }
end

def heads_in_a_row(flips)
  run = max = 0

  flips.each do |flip|
    if flip == :heads
      run += 1
    else
      max = [run, max].max
      run = 0
    end
  end

  max
end

class Array
  def average
    reduce(:+).to_f / size
  end
end
```

The bigger the number of coin flips, the bigger the number of heads in a row.

```ruby
heads_in_a_row coin_flips(10)    #=> 3
heads_in_a_row coin_flips(100)   #=> 5
heads_in_a_row coin_flips(1000)  #=> 9
heads_in_a_row coin_flips(10000) #=> 15
```

The bigger the number of series, the better the stability of the output. With a
more stable output, a better estimate can be made:

```ruby
heads_in_a_row coin_flips(10) #=> 3
heads_in_a_row coin_flips(10) #=> 7
heads_in_a_row coin_flips(10) #=> 4

Array.new(1000) { heads_in_a_row coin_flips(10) }.average #=> 2.449
Array.new(1000) { heads_in_a_row coin_flips(10) }.average #=> 2.442
Array.new(1000) { heads_in_a_row coin_flips(10) }.average #=> 2.469
```

Similarly to how the number of coin flips can be estimated by observing the
number of heads in a row, the calculations of a HyperLogLog are based on the
observation that the cardinality of a set of *uniformly distributed random
numbers* can be estimated by counting the maximum number of leading zeros in the
binary representation of the numbers.

For there to *be* leading zeros, the numbers must be represented as integers
with the same number of bits, for instance as 64-bit unsigned integers.

If the maximum number of leading zeros is *n*, the estimated number of unique
values in the set is 2<sup>*n*</sup>.

To make sure values are uniformly distributed and represented as same-size
integers, the HyperLogLog algorithm applies a hash function to all values. This
transforms the values into a set of uniformly distributed random numbers with
the same cardinality as the original set.

The first few bits of the hashed values are used to divide them into different
subsets, much like the separate series of coin flips in the example above. For
each subset the maximum number of leading zeros within its values are stored in
a register.

To calculate the approximate cardinality of the whole set, the estimates for all
the subsets are combined using a [harmonic mean].

## An example use-case

Here's an example of how to use a Redis HyperLogLog to count unique visitors of
a site built with [Sinatra] using a custom [Rack] middleware:

```ruby
require 'redis'
require 'sinatra'

$redis = Redis.new

class VisitorCounter
  def initialize(app)
    @app = app
  end

  def call(env)
    $redis.pfadd 'visitors', Rack::Request.new(env).ip
    @app.call(env)
  end
end

use VisitorCounter

get '/' do
  visitors = $redis.pfcount 'visitors'
  "This website has been visited by #{visitors} unique visitors."
end
```

## References

1. <span id="1">Flajolet, P.; Fusy, E.; Gandouet, O.; Meunier, F. (2007).
   ["HyperLogLog: the analysis of a near-optimal cardinality estimation
   algorithm"][paper]. AOFA ’07: Proceedings of the 2007 International
   Conference on the Analysis of Algorithms.</span>

[Philippe Flajolet]: https://en.wikipedia.org/wiki/Philippe_Flajolet
[Rack]: https://rack.github.io
[Sinatra]: http://www.sinatrarb.com/
[`GET`]: http://redis.io/commands/get
[`PFADD`]: http://redis.io/commands/pfadd
[`PFCOUNT`]: http://redis.io/commands/pfcount
[`PFMERGE`]: http://redis.io/commands/pfmerge
[`SADD`]: http://redis.io/commands/sadd
[`SCARD`]: http://redis.io/commands/scard
[`SET`]: http://redis.io/commands/set
[harmonic mean]: https://en.wikipedia.org/wiki/Harmonic_mean
[paper]: http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf
