Ruby HTML Sanitization with Loofah

Stefanni Brasil

This post was originally published on the hexdevs blog.


As a Rails developer, when you want to sanitize some user’s HTML input, you just write <%= sanitize some_user_provided_string %> and call it a day.

But sometimes, this helper is not what you really need. You need to sanitize the user-provided HTML string outside of a Rails view or template. For example: when the unsafe HTML string comes from an integration, and you need to clean it before storing it.

That’s when the Loofah gem shines. It is a Ruby library for HTML/XML transformation and sanitization, built on top of Nokogiri. Fun fact: you can provide a custom Loofah scrubber to the Rails sanitize method πŸ’‘

Ruby HTML Sanitation with the Loofah gem

Loofah sanitization provides custom methods called “Scrubbers”. Scrubbers are built-in methods that do amazing things for you:

doc = Loofah.html5_document(input)
doc.scrub!(:strip)       # replaces unknown/unsafe tags with their inner text
doc.scrub!(:prune)       # removes unknown/unsafe tags and their children

Cool, right? Even cooler is the fact that you can create your own Scrubbers entirely, or combine Loofah’s scrubbers with your ones.

Combine built-in and custom Loofah Scrubbers

Time to scrub some HTML from its potential dirtiness. Html::Sanitizer combines built-in scrubbers and a custom one:

module Html
  class Sanitizer
    class InvalidHTMLError < StandardError; end

    def self.clean(content)
      sanitized_html = Loofah.fragment(content)
                             .scrub!(:prune)
                             .scrub!(:noopener)
                             .scrub!(:nofollow)
                             .scrub!(:target_blank)
                             .scrub!(:unprintable)
                             .scrub!(CoolScrubber.new).to_s

      return sanitized_html if !sanitized_html.empty?

      raise InvalidHTMLError, "Invalid HTML received"
    end
  end

  class CoolScrubber < Loofah::Scrubber
    def scrub(node)
      # custom HTML sanitation and transformation
    end
  end
end

Let’s go by parts.

Loofah built-in scrubbers

First, we parse the HTML content with Loofah’s fragment method:

sanitized_html = Loofah.fragment(content)

With a HTML fragment, we apply a mix of HTML transformation and sanitation to scrub the content:

.scrub!(:prune) # => prunes unsafe tags and their subtrees, removing all traces that they ever existed
.scrub!(:noopener) # => adds rel="noopener" attribute to links
.scrub!(:nofollow) # => adds rel="nofollow" attribute to links
.scrub!(:targetblank) # => adds target="_blank" attribute to links
.scrub!(:unprintable) # => removes unprintable characters from text nodes
.scrub!(CoolScrubber.new).to_s # => custom scrubber, see next section

Lastly, we chain a custom scrubber to apply some business logic πŸ‘”

Loofah Custom Scrubbers

We needed this custom Scruber to verify some HTML elements that we don’t accept. Here is how it looks like:

class CoolScrubber < Loofah::Scrubber
  def scrub(node)
    handle_not_allowed_nodes(node)
    handle_method_elements(node, %w[href src data srcset])
    # do what else you need besides what Loofah gives you
  end
end

And to make sure we are sanitizing our HTML as we expect, this class has an extensive test suite.

Test Ruby HTML Sanitation with RSpec

When I was working on this feature, I followed OWASP’s Cheat Sheet to write the tests. It was my first time doing this, so having a guide to verify the HTML cleaning was helpful. Some test examples:

it "removes malicious CSS attributes while retaining safe ones, if safe" do
  html = "<p style=\"display: block; background-image:url('http://www.ragingplatypus.com/i/cam-full.jpg'); background-color: blue;\"></p>"

  result = Html::Sanitizer.clean(html)

  expect(result).to eq "<p style=\"display:block;background-color:blue;\"></p>"
end

it "raises an InvalidHTMLError error message, if there are malicious attributes from different elements" do
  html = "<div style='background-image:url(javascript:alert('XSS'))'>" \
          "<input type='image' src='javascript:alert('XSS');''></div>" \
          "<div style='width: expression(alert('XSS'));'></div>"

  expect do
    Html::Sanitizer.clean(html)
  end.to raise_error(Html::Sanitizer::InvalidHTMLError, /Invalid HTML received/)
end

it "adds a target=_blank to all links even if they already have a target value" do
  html = "<a href=\"www.example.com/event-1\" target=\"_top\">Community Gathering Event</a>"

  result = Html::Sanitizer.clean(html)

  expect(result).to eq "<a href=\"www.example.com/event-1\" target=\"_blank\" rel=\"noopener nofollow\">Community Gathering Event</a>"
end

Resources for HTML sanitization and transformation

These are the resources I used for this work and recommend checking out:

Contributing to Loofah is fun

By learning more about the Loofah gem, I ended up finding an opportunity to contribute to the project. I was adding target=_blank to all links manually in my project. And I thought: “if I have this problem, other might have it too, and they could benefit from having this feature available in the library.”

I co-authored this PR to add the target=_blank to all links as a built-in scrubber, which was a great way to contribute back to the gem. It’s available on version >= 2.22.0.

This was my first time doing HTML Sanitization and it was a great learning opportunity. I had the chance to meet @flavorjones at RubyConf 2023, which was really nice :)


What about you? Have you had to sanitize HTML before? How did you go about it? What other tools have you used?