Histogram Distribution in Ruby is Double Tally

Here’s a fun mental shortcut I recently learned: getting the histogram distribution in Ruby is double #tally.

Consider this opening passage from the novel A Tale of Two Cities:

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness

We’d like to build a histogram of letter frequency. How would we get those numbers?

Per-letter distribution

Enumerable#tally is a method that was introduced in Ruby 2.7, equivalent to the older solution group_by(&:itself).transform_values(&:count). We start by turning the string of text into a cleaned array of normalized characters which I’m calling corpus.

Calling #tally on that array tells us how often each letter is used:

> corpus.tally
=> {"k"=>1,
 "y"=>1,
 "u"=>1,
 "b"=>2,
 "p"=>2,
 "m"=>3,
 "r"=>3,
 "g"=>3,
 "d"=>3,
 "c"=>3,
 "l"=>4,
 "n"=>5,
 "f"=>10,
 "w"=>10,
 "h"=>12,
 "a"=>13,
 "o"=>16,
 "i"=>16,
 "t"=>22,
 "s"=>22,
 "e"=>22}

Graphing it visually we get:

Bar graph showing how frequently each letter is used

Frequency of frequencies

This isn’t quite good enough though. A histogram doesn’t answer the question “how many times did the letter e get used”, instead it answers the question “how many letters get used 22 times?“.

We no longer care about the individual letters, instead we want to #tally the frequencies:

corpus.tally.values.tally
=> {1=>3, 2=>2, 3=>5, 4=>1, 5=>1, 10=>2, 12=>1, 13=>1, 16=>2, 22=>3}

and the graph looks like:

Histogram showing how often each frequency occurs

3 letters got used once, 2 letters got used twice, etc. We can see that most letters get used only a few times but that there are some outliers that get used extremely frequently.

The next time you’re trying to calculate a histogram, bring out the handy double #tally!

Full corpus

And for those who are curious, here’s what the distribution looks like for the whole book. Note that each column represents a range of values in slices of 5000.

Histogram of letter frequency for whole book