Here’s a fun mental shortcut I recently learned: getting the histogram
distribution in Ruby is double #tally
.
Consider this opening passage from the novel A Tale of Two Cities:
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness
We’d like to build a histogram of letter frequency. How would we get those numbers?
Per-letter distribution
Enumerable#tally
is a method that was introduced in Ruby 2.7, equivalent to
the older solution group_by(&:itself).transform_values(&:count)
. We start by
turning the string of text into a cleaned array of normalized characters which
I’m calling corpus
.
Calling #tally
on that array tells us how often each letter is used:
> corpus.tally
=> {"k"=>1,
"y"=>1,
"u"=>1,
"b"=>2,
"p"=>2,
"m"=>3,
"r"=>3,
"g"=>3,
"d"=>3,
"c"=>3,
"l"=>4,
"n"=>5,
"f"=>10,
"w"=>10,
"h"=>12,
"a"=>13,
"o"=>16,
"i"=>16,
"t"=>22,
"s"=>22,
"e"=>22}
Graphing it visually we get:
Frequency of frequencies
This isn’t quite good enough though. A histogram doesn’t answer the question
“how many times did the letter e
get used”, instead it answers the question
“how many letters get used 22 times?“.
We no longer care about the individual letters, instead we want to #tally
the
frequencies:
corpus.tally.values.tally
=> {1=>3, 2=>2, 3=>5, 4=>1, 5=>1, 10=>2, 12=>1, 13=>1, 16=>2, 22=>3}
and the graph looks like:
3 letters got used once, 2 letters got used twice, etc. We can see that most letters get used only a few times but that there are some outliers that get used extremely frequently.
The next time you’re trying to calculate a histogram, bring out the handy double
#tally
!
Full corpus
And for those who are curious, here’s what the distribution looks like for the whole book. Note that each column represents a range of values in slices of 5000.