---
title: Histogram Distribution in Ruby is Double Tally
teaser: Using `#tally` to calculate the distribution of letters in a piece of text.
tags: ruby,development
author: Joël Quenneville
published_on: 2022-10-31
---

Here’s a fun mental shortcut I recently learned: getting the histogram
distribution in Ruby is double `#tally`.

Consider this opening passage from the novel _A Tale of Two Cities_:

> It was the best of times, it was the worst of times, it was the age of wisdom,
> it was the age of foolishness, it was the epoch of belief, it was the epoch of
> incredulity, it was the season of Light, it was the season of Darkness

We'd like to build a histogram of letter frequency. How would we get those numbers?

## Per-letter distribution

[`Enumerable#tally`] is a method that was introduced in Ruby 2.7, equivalent to
the older solution `group_by(&:itself).transform_values(&:count)`. We start by
turning the string of text into a cleaned array of normalized characters which
I'm calling `corpus`.

Calling `#tally` on that array tells us how often each letter is used:

```ruby
> corpus.tally
=> {"k"=>1,
 "y"=>1,
 "u"=>1,
 "b"=>2,
 "p"=>2,
 "m"=>3,
 "r"=>3,
 "g"=>3,
 "d"=>3,
 "c"=>3,
 "l"=>4,
 "n"=>5,
 "f"=>10,
 "w"=>10,
 "h"=>12,
 "a"=>13,
 "o"=>16,
 "i"=>16,
 "t"=>22,
 "s"=>22,
 "e"=>22}
```

Graphing it visually we get:

![Bar graph showing how frequently each letter is used](https://images.thoughtbot.com/jq-tally/PmMSLJciT8SCEG1t0JEo_letter-frequencies.png)

[`Enumerable#tally`]: https://ruby-doc.org/core-3.1.2/Enumerable.html#method-i-tally

## Frequency of frequencies

This isn't quite good enough though. A histogram doesn't answer the question
“how many times did the letter `e` get used”, instead it answers the question
“how many letters get used 22 times?“.

We no longer care about the individual letters, instead we want to `#tally` the
frequencies:

```ruby
corpus.tally.values.tally
=> {1=>3, 2=>2, 3=>5, 4=>1, 5=>1, 10=>2, 12=>1, 13=>1, 16=>2, 22=>3}
```

and the graph looks like:

![Histogram showing how often each frequency occurs](https://images.thoughtbot.com/jq-tally/1FuudsdTZiMb4Tw2lXns_tale-of-two-cities-letter-distribution.png)

3 letters got used once, 2 letters got used twice, etc. We can see that most
letters get used only a few times but that there are some outliers that get used
extremely frequently.

The next time you’re trying to calculate a histogram, bring out the handy double
`#tally`!

## Full corpus

And for those who are curious, here’s what the distribution looks like for the
[whole book]. Note that each column represents a range of values in slices of
5000.

![Histogram of letter frequency for whole book](https://images.thoughtbot.com/jq-tally/g4UbhQmrQ5G3oy6nGkMf_tale-of-two-cities-letter-distribution-complete.png)

[whole book]: https://www.gutenberg.org/files/98/98-0.txt
