A hyper-what-now?
A HyperLogLog is a probabilistic data structure used to count unique values — or as it’s referred to in mathematics: calculating the cardinality of a set.
These values can be anything: for example, IP addresses for the visitors of a website, search terms, or email addresses.
Counting unique values with exact precision requires an amount of memory proportional to the number of unique values. The reason for this is that there is no way of determining if a value has already been seen other than by comparing it to the previously seen values.
Since memory is a limited resource, doing this becomes problematic when working with large sets of values.
A HyperLogLog solves this problem by allowing to trade memory consumption for precision making it possible to estimate cardinalities larger than 109 with a standard error of 2% using only 1.5 kilobytes of memory[1].
How to use HyperLogLogs in Redis
HyperLogLogs have been available in Redis since version 2.8.9 and are accessed
using the PFADD
, PFCOUNT
, and PFMERGE
commands.
A Redis HyperLogLog consumes at most 12 kilobytes of memory and produces approximations with a standard error of 0.81%. The 12 kilobytes do not include the bytes required to store the actual key.
Ignoring PFMERGE
, using a HyperLogLog is very similar to using a Set for the
same purpose: instead of adding members with SADD
, add them with PFADD
and instead of retrieving the cardinality with SCARD
, retrieve it with
PFCOUNT
.
The example below shows the output of an interactive Redis session. You can
follow along by running redis-cli
and entering the commands shown on the lines
beginning with >
. Make sure an instance of redis-server
is also running.
> PFADD visitors alice bob carol
(integer) 1
> PFCOUNT visitors
(integer) 3
PFADD
returns 1
if the approximated cardinality of the HyperLogLog was
changed when adding the element. If not, it returns 0
.
After calculating the cardinality, PFCOUNT
will store the calculated value
in the last 8 bytes of the HyperLogLog to serve as a cache. This cache is
invalidated on the next PFADD
.
The PFMERGE
command is used to produce a HyperLogLog that approximates the
cardinality of the union of two or more existing HyperLogLogs:
> PFADD customers alice dan
(integer) 1
> PFMERGE everyone visitors customers
OK
> PFCOUNT everyone
(integer) 4
The same result can be achieved by supplying multiple keys to PFCOUNT
. In
this case an on-the-fly merge is performed:
> PFCOUNT visitors customers
(integer) 4
Under the hood, HyperLogLogs are actually stored as strings and can be GET
and SET
as such:
> GET visitors
"HYLL\x01\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00E<\x94X\x10\x84Qi\x8cQD"
> SET visitors "HYLL\x01\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00E<\x94X\x10\x84Qi\x8cQD"
OK
> PFCOUNT visitors
(integer) 3
This is useful if the HyperLogLog needs to be persisted elsewhere and later restored.
The PFADD
, PFCOUNT
, and PFMERGE
commands are prefixed with PF
in
memory of Philippe Flajolet, the inventor of the HyperLogLog algorithm.
The theory behind HyperLogLogs
To get started, let’s look at something more familiar than HyperLogLogs.
Imagine someone flipping a coin multiple times, while recording the maximum number of heads they get in a row. If they told you this number — would you be able to estimate how many times they’d flipped the coin?
Not with great accuracy, but since a longer run of heads is more probable when flipping the coin many times, the number of consecutive heads or tails in a series of coin flips can indicate how many times the coin was flipped: if the number of heads in a row is low, the number of coin flips is most probably also low. Conversely, if the number of heads in a row is high the number of coin flips is most probably high.
If the same number of coin flips were done and recorded separately, a much more precise estimate could be made by combining these numbers. This can be observed using the following methods:
def coin_flips(n)
Array.new(n) { [:heads, :tails].sample }
end
def heads_in_a_row(flips)
run = max = 0
flips.each do |flip|
if flip == :heads
run += 1
else
max = [run, max].max
run = 0
end
end
max
end
class Array
def average
reduce(:+).to_f / size
end
end
The bigger the number of coin flips, the bigger the number of heads in a row.
heads_in_a_row coin_flips(10) #=> 3
heads_in_a_row coin_flips(100) #=> 5
heads_in_a_row coin_flips(1000) #=> 9
heads_in_a_row coin_flips(10000) #=> 15
The bigger the number of series, the better the stability of the output. With a more stable output, a better estimate can be made:
heads_in_a_row coin_flips(10) #=> 3
heads_in_a_row coin_flips(10) #=> 7
heads_in_a_row coin_flips(10) #=> 4
Array.new(1000) { heads_in_a_row coin_flips(10) }.average #=> 2.449
Array.new(1000) { heads_in_a_row coin_flips(10) }.average #=> 2.442
Array.new(1000) { heads_in_a_row coin_flips(10) }.average #=> 2.469
Similarly to how the number of coin flips can be estimated by observing the number of heads in a row, the calculations of a HyperLogLog are based on the observation that the cardinality of a set of uniformly distributed random numbers can be estimated by counting the maximum number of leading zeros in the binary representation of the numbers.
For there to be leading zeros, the numbers must be represented as integers with the same number of bits, for instance as 64-bit unsigned integers.
If the maximum number of leading zeros is n, the estimated number of unique values in the set is 2n.
To make sure values are uniformly distributed and represented as same-size integers, the HyperLogLog algorithm applies a hash function to all values. This transforms the values into a set of uniformly distributed random numbers with the same cardinality as the original set.
The first few bits of the hashed values are used to divide them into different subsets, much like the separate series of coin flips in the example above. For each subset the maximum number of leading zeros within its values are stored in a register.
To calculate the approximate cardinality of the whole set, the estimates for all the subsets are combined using a harmonic mean.
An example use-case
Here’s an example of how to use a Redis HyperLogLog to count unique visitors of a site built with Sinatra using a custom Rack middleware:
require 'redis'
require 'sinatra'
$redis = Redis.new
class VisitorCounter
def initialize(app)
@app = app
end
def call(env)
$redis.pfadd 'visitors', Rack::Request.new(env).ip
@app.call(env)
end
end
use VisitorCounter
get '/' do
visitors = $redis.pfcount 'visitors'
"This website has been visited by #{visitors} unique visitors."
end
References
- Flajolet, P.; Fusy, E.; Gandouet, O.; Meunier, F. (2007). “HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm”. AOFA ’07: Proceedings of the 2007 International Conference on the Analysis of Algorithms.