Pipelining without pipes

GitHub has a library to syntax highlight code snippets called Linguist. It has an extensive list of languages and their characteristics (e.g., name, file extensions, color, etc.). I had some free time, so I came up with this silly question: what would be the average programming language color?

Hacking it

We can get this done with a few lines of Ruby:

require "net/http"
require "yaml"

# Some helper functions

def fetch_url(url)
  Net::HTTP.get(URI(url))
end

def parse_yaml(yaml_string)
  YAML.safe_load(yaml_string)
end

def hex_color_to_rgb(color)
  color.delete("#").scan(/../).map(&:hex)
end

def rgb_color_to_hex(color)
  color.map { |channel| channel.to_s(16).rjust(2, "0") }.join
end

GITHUB_LANGS = "https://raw.githubusercontent.com/github/linguist/6b02d3bd769d07d1fdc661ba9e37aad0fd70e2ff/lib/linguist/languages.yml"

# Fetch and parse the language list YAML
langs_yaml = fetch_url(GITHUB_LANGS)
langs = parse_yaml(langs_yaml)

# Calculate the average color of the programming languages
red_sum = 0
green_sum = 0
blue_sum = 0
color_count = 0

langs.each do |_name, details|
  next if details["type"] != "programming" # Skip "non-programming" languages
  next if details["color"].nil? # Skip languages without a color

  rgb = hex_color_to_rgb(details["color"])
  red_sum += rgb[0] ** 2
  green_sum += rgb[1] ** 2
  blue_sum += rgb[2] ** 2
  color_count += 1
end

average_red = Math.sqrt(red_sum / color_count).to_i
average_green = Math.sqrt(green_sum / color_count).to_i
average_blue = Math.sqrt(blue_sum / color_count).to_i

average_color = "##{rgb_color_to_hex([average_red, average_green, average_blue])}"

puts average_color

There you go! It prints out the average color. We can close our laptops and call it a day. There are few reasons to improve code that we won’t touch again.

I’m smelling something

Well, for a throwaway project, that solution is indeed enough. Sometimes we have fun coding, and that’s it, no worries. In this particular case, though, I thought it would be an excellent exercise to practice functional programming. So, I decided to refactor it.

Some things bothered me in the original code. That each block does a lot of stuff, and, as generally happens with things with lots of responsibilities, it doesn’t do them well:

The logic to calculate the average color is split into different parts of the code. Some of it is inside the each block, and some of it is outside;
The code is fragile:
- Updating color_count has to happen after the next if ... calls;
- It’s easy to miss why color_count is necessary at all and, instead, use langs.size to calculate the average color, which would give us the wrong result.
The code is very procedural, and it feels weird in Ruby.

It seems like having color_count and the color sums as separate variables is causing some pain, so we could change those variables to be a single array of colors and calculate the mean later. Iteratively building a collection is an anti-pattern, but it does shine a light on a direction we can follow.

Data transformation 🤝 Functional programming

Functional programming teaches us to think in terms of data transformation. Each function takes data and returns it in a new form. We can compose several functions together and form a pipeline.

Let’s walk our code and try to convert it into a pipeline. We can keep the imports and helper functions, so let’s skip to this part:

# Fetch and parse the language list YAML
langs_yaml = fetch_url(GITHUB_LANGS_URL)
langs = parse_yaml(langs_yaml)

In a functional programming language like Elixir, this could be written as:

GITHUB_LANGS_URL
|> fetch_url()
|> parse_yaml()

We hit our first roadblock: we have no pipe operator in Ruby! It is a common feature of functional programming languages that passes the result of an expression as a parameter of another expression.

So how can we do this in Ruby? We could write it as parse_yaml(fetch_url(GITHUB_LANGS_URL)), but keeping this pattern leads to quite unreadable code. Ruby is an object-oriented language, so we have to think in terms of objects and messages (i.e., methods).

We need something that passes the caller to a given function, or, in other words, that yields self to a block. Luckily, Ruby has a method that does exactly that: yield_self, or its nicer-sounding alias then. Here’s how that code would look:

GITHUB_LANGS_URL
  .then { |url| fetch_url(url) }
  .then { |languages_yaml| parse_yaml(languages_yaml) }

Using Ruby’s numbered parameters, we can avoid having to name the block arguments:

GITHUB_LANGS_URL
  .then { fetch_url _1 }
  .then { parse_yaml _1 }

Cool, that is pretty close to the Elixir code. Now, we have to transform that big each block into a pipeline. In essence, that part of the code filters out non-programming languages and languages without color then calculates the average color. Let’s split those two parts into separate steps. parse_yaml returns a hash, so we can use Enumerable#filter to select the languages we want.

  # ...
  .then { parse_yaml _1 }
  .filter { |_lang_name, lang_details|
    lang_details["type"] == "programming" && lang_details["color"]
  }

Then, we get the colors of each language and convert them to RGB:

  # ...
  .filter { |_lang_name, lang_details|
    lang_details["type"] == "programming" && lang_details["color"]
  }
  .map { |_lang_name, lang_details|
    hex_color_to_rgb(details["color"])
  }

This code works, but alas, it iterates over the languages twice (first time on filter and the other on map). We could use Enumerable#reduce to do this in a single pass, but that would be a bit lengthy (and many folks don’t know Enumerable#reduce). Again, Ruby has our back and provides a Enumerable#filter_map. It calls the given block on each element of the enumerable and returns an array containing the truthy elements returned by the block. We can merge those two steps into one:

  .filter_map { |_lang_name, lang_details|
    next if lang_details["type"] != "programming"
    next if lang_details["color"].nil?

    hex_color_to_rgb(details["color"])
  }

I split the filter condition into two steps because I think it’s easier to read. Also note that the if conditions are now inverted.

Now we have an array of colors, with each color as an array of red, green, and blue values. We need to sum all red values together, then all green values, and all blue values. Let’s reshape our data representation to group values by color channel, so this will be easier:

  .filter_map {
    # ...
  }
  .transpose

The pipeline is coming together, but we still have work to do. Calculating the average color now is fairly simple using Enumerable#sum (can we get Enumerable#mean, tho? 😅):

  .transpose
  .map { |channel_values|
    squared_average = channel_values.sum { |value| value ** 2 } / channel_values.size

    Math.sqrt(squared_average).to_i
  }

Readability, performance and balance

Those with sharp eyes will notice that we’re still iterating over the values multiple times (sum, size, plus the call to filter_map and transpose). Again, using Enumerable#reduce would be an option for a single pass solution, but a O(n) solution isn’t a hard requirement for this exercise.

Also, the body of that reduce call could be hard to grasp, so I decided to sacrifice a bit of performance to ease reading/teaching. As developers, we constantly have to balance readability, performance, and maintainability.

Lastly, we convert the color, represented as a 3-element array, to a hex string and print it. Here’s the full solution:

require "net/http"
require "yaml"

def fetch_url(url)
  Net::HTTP.get(URI(url))
end

def parse_yaml(yaml_string)
  YAML.safe_load(yaml_string)
end

def hex_color_to_rgb(color)
  color.delete("#").scan(/../).map(&:hex)
end

def rgb_color_to_hex(color)
  color.map { |channel| channel.to_s(16).rjust(2, "0") }.join
end

GITHUB_LANGS_URL = "https://raw.githubusercontent.com/github/linguist/6b02d3bd769d07d1fdc661ba9e37aad0fd70e2ff/lib/linguist/languages.yml"

GITHUB_LANGS_URL
  .then { fetch_url _1 }
  .then { parse_yaml _1 }
  .filter_map { |_lang_name, lang_details|
    next if lang_details["type"] != "programming"
    next if lang_details["color"].nil?

    hex_color_to_rgb(lang_details["color"])
  }
  .transpose
  .map { |channel_values|
    squared_average = channel_values.sum { |value| value ** 2 } / channel_values.size

    Math.sqrt(squared_average).to_i
  }
  .then { |average_color| puts "##{rgb_color_to_hex(average_color)}" }

One of the neat things about that pipeline is that we can extract any part of it into a separate method, and it still will be chainable.

Takeaways

Ruby is a OOP language, so thinking about objects and methods is the natural way of programming. Whenever you can, use methods (like those on the Enumerable module), or create objects that provide the ones you need.

Ruby also has good support for functional programming, and we can take advantage of that, particularly when doing data transformation. Mixing OOP and FP is not a sin, and Ruby has great features to support it.

Moreover, remember that it’s okay to start with a simple solution and improve it later. That’s the natural flow when doing TDD.

Hey, wait! You forget something!

What? Oh, the color! Here it is:

A shade of puce or grayish mauve picked by a color selector containing a range of color shades and values

Pipelining without pipes

Hacking it

I’m smelling something

Data transformation 🤝 Functional programming

Takeaways

Hey, wait! You forget something!

About thoughtbot

1-on-1 and Group Mentoring

Hacking it

I’m smelling something

Data transformation 🤝 Functional programming

Takeaways

Hey, wait! You forget something!

Sign up to receive a weekly recap from thoughtbot

About thoughtbot

1-on-1 and Group Mentoring