*A hyper-what-now?*

A HyperLogLog is a probabilistic data structure used to count unique values — or as it’s referred to in mathematics: calculating the cardinality of a set.

These values can be anything: for example, IP addresses for the visitors of a website, search terms, or email addresses.

Counting unique values with *exact precision* requires an amount of memory
proportional to the number of unique values. The reason for this is that there
is no way of determining if a value has already been seen other than by
comparing it to the previously seen values.

Since memory is a limited resource, doing this becomes problematic when working
with *large sets of values*.

A HyperLogLog solves this problem by allowing to trade memory consumption for
precision making it possible to estimate cardinalities larger than
10^{9} with a standard error of 2% using only 1.5 kilobytes of
memory^{[1]}.

## How to use HyperLogLogs in Redis

HyperLogLogs have been available in Redis since version 2.8.9 and are accessed
using the `PFADD`

, `PFCOUNT`

, and `PFMERGE`

commands.

A Redis HyperLogLog consumes at most 12 kilobytes of memory and produces approximations with a standard error of 0.81%. The 12 kilobytes do not include the bytes required to store the actual key.

Ignoring `PFMERGE`

, using a HyperLogLog is very similar to using a Set for the
same purpose: instead of adding members with `SADD`

, add them with `PFADD`

and instead of retrieving the cardinality with `SCARD`

, retrieve it with
`PFCOUNT`

.

The example below shows the output of an interactive Redis session. You can
follow along by running `redis-cli`

and entering the commands shown on the lines
beginning with `>`

. Make sure an instance of `redis-server`

is also running.

```
> PFADD visitors alice bob carol
(integer) 1
> PFCOUNT visitors
(integer) 3
```

`PFADD`

returns `1`

if the approximated cardinality of the HyperLogLog was
changed when adding the element. If not, it returns `0`

.

After calculating the cardinality, `PFCOUNT`

will store the calculated value
in the last 8 bytes of the HyperLogLog to serve as a cache. This cache is
invalidated on the next `PFADD`

.

The `PFMERGE`

command is used to produce a HyperLogLog that approximates the
cardinality of the union of two or more existing HyperLogLogs:

```
> PFADD customers alice dan
(integer) 1
> PFMERGE everyone visitors customers
OK
> PFCOUNT everyone
(integer) 4
```

The same result can be achieved by supplying multiple keys to `PFCOUNT`

. In
this case an on-the-fly merge is performed:

```
> PFCOUNT visitors customers
(integer) 4
```

Under the hood, HyperLogLogs are actually stored as strings and can be `GET`

and `SET`

as such:

```
> GET visitors
"HYLL\x01\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00E<\x94X\x10\x84Qi\x8cQD"
> SET visitors "HYLL\x01\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00E<\x94X\x10\x84Qi\x8cQD"
OK
> PFCOUNT visitors
(integer) 3
```

This is useful if the HyperLogLog needs to be persisted elsewhere and later restored.

The `PFADD`

, `PFCOUNT`

, and `PFMERGE`

commands are prefixed with `PF`

in
memory of Philippe Flajolet, the inventor of the HyperLogLog algorithm.

## The theory behind HyperLogLogs

To get started, let’s look at something more familiar than HyperLogLogs.

Imagine someone flipping a coin multiple times, while recording the maximum number of heads they get in a row. If they told you this number — would you be able to estimate how many times they’d flipped the coin?

Not with great accuracy, but since a longer run of heads is more probable when
flipping the coin many times, the number of consecutive heads or tails in a
series of coin flips *can* indicate how many times the coin was flipped: if the
number of heads in a row is low, the number of coin flips is most probably
*also* low. Conversely, if the number of heads in a row is high the number of
coin flips is most probably high.

If the same number of coin flips were done and recorded separately, a much more precise estimate could be made by combining these numbers. This can be observed using the following methods:

```
def coin_flips(n)
Array.new(n) { [:heads, :tails].sample }
end
def heads_in_a_row(flips)
run = max = 0
flips.each do |flip|
if flip == :heads
run += 1
else
max = [run, max].max
run = 0
end
end
max
end
class Array
def average
reduce(:+).to_f / size
end
end
```

The bigger the number of coin flips, the bigger the number of heads in a row.

```
heads_in_a_row coin_flips(10) #=> 3
heads_in_a_row coin_flips(100) #=> 5
heads_in_a_row coin_flips(1000) #=> 9
heads_in_a_row coin_flips(10000) #=> 15
```

The bigger the number of series, the better the stability of the output. With a more stable output, a better estimate can be made:

```
heads_in_a_row coin_flips(10) #=> 3
heads_in_a_row coin_flips(10) #=> 7
heads_in_a_row coin_flips(10) #=> 4
Array.new(1000) { heads_in_a_row coin_flips(10) }.average #=> 2.449
Array.new(1000) { heads_in_a_row coin_flips(10) }.average #=> 2.442
Array.new(1000) { heads_in_a_row coin_flips(10) }.average #=> 2.469
```

Similarly to how the number of coin flips can be estimated by observing the
number of heads in a row, the calculations of a HyperLogLog are based on the
observation that the cardinality of a set of *uniformly distributed random
numbers* can be estimated by counting the maximum number of leading zeros in the
binary representation of the numbers.

For there to *be* leading zeros, the numbers must be represented as integers
with the same number of bits, for instance as 64-bit unsigned integers.

If the maximum number of leading zeros is *n*, the estimated number of unique
values in the set is 2^{n}.

To make sure values are uniformly distributed and represented as same-size integers, the HyperLogLog algorithm applies a hash function to all values. This transforms the values into a set of uniformly distributed random numbers with the same cardinality as the original set.

The first few bits of the hashed values are used to divide them into different subsets, much like the separate series of coin flips in the example above. For each subset the maximum number of leading zeros within its values are stored in a register.

To calculate the approximate cardinality of the whole set, the estimates for all the subsets are combined using a harmonic mean.

## An example use-case

Here’s an example of how to use a Redis HyperLogLog to count unique visitors of a site built with Sinatra using a custom Rack middleware:

```
require 'redis'
require 'sinatra'
$redis = Redis.new
class VisitorCounter
def initialize(app)
@app = app
end
def call(env)
$redis.pfadd 'visitors', Rack::Request.new(env).ip
@app.call(env)
end
end
use VisitorCounter
get '/' do
visitors = $redis.pfcount 'visitors'
"This website has been visited by #{visitors} unique visitors."
end
```

## References

- Flajolet, P.; Fusy, E.; Gandouet, O.; Meunier, F. (2007). “HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm”. AOFA ’07: Proceedings of the 2007 International Conference on the Analysis of Algorithms.