This post was originally published on the hexdevs blog.
As a Rails developer, when you want to sanitize some user’s HTML input, you just write <%= sanitize some_user_provided_string %>
and call it a day.
But sometimes, this helper is not what you really need. You need to sanitize the user-provided HTML string outside of a Rails view or template. For example: when the unsafe HTML string comes from an integration, and you need to clean it before storing it.
That’s when the Loofah gem shines. It is a Ruby library for HTML/XML transformation and sanitization, built on top of Nokogiri. Fun fact: you can provide a custom Loofah scrubber to the Rails sanitize method 💡
Ruby HTML Sanitation with the Loofah gem
Loofah sanitization provides custom methods called “Scrubbers”. Scrubbers are built-in methods that do amazing things for you:
doc = Loofah.html5_document(input)
doc.scrub!(:strip) # replaces unknown/unsafe tags with their inner text
doc.scrub!(:prune) # removes unknown/unsafe tags and their children
Cool, right? Even cooler is the fact that you can create your own Scrubbers entirely, or combine Loofah’s scrubbers with your ones.
Combine built-in and custom Loofah Scrubbers
Time to scrub some HTML from its potential dirtiness. Html::Sanitizer
combines built-in scrubbers and a custom one:
module Html
class Sanitizer
class InvalidHTMLError < StandardError; end
def self.clean(content)
sanitized_html = Loofah.fragment(content)
.scrub!(:prune)
.scrub!(:noopener)
.scrub!(:nofollow)
.scrub!(:target_blank)
.scrub!(:unprintable)
.scrub!(CoolScrubber.new).to_s
return sanitized_html if !sanitized_html.empty?
raise InvalidHTMLError, "Invalid HTML received"
end
end
class CoolScrubber < Loofah::Scrubber
def scrub(node)
# custom HTML sanitation and transformation
end
end
end
Let’s go by parts.
Loofah built-in scrubbers
First, we parse the HTML content with Loofah’s fragment
method:
sanitized_html = Loofah.fragment(content)
With a HTML fragment, we apply a mix of HTML transformation and sanitation to scrub the content:
.scrub!(:prune) # => prunes unsafe tags and their subtrees, removing all traces that they ever existed
.scrub!(:noopener) # => adds rel="noopener" attribute to links
.scrub!(:nofollow) # => adds rel="nofollow" attribute to links
.scrub!(:targetblank) # => adds target="_blank" attribute to links
.scrub!(:unprintable) # => removes unprintable characters from text nodes
.scrub!(CoolScrubber.new).to_s # => custom scrubber, see next section
Lastly, we chain a custom scrubber to apply some business logic 👔
Loofah Custom Scrubbers
We needed this custom Scruber to verify some HTML elements that we don’t accept. Here is how it looks like:
class CoolScrubber < Loofah::Scrubber
def scrub(node)
handle_not_allowed_nodes(node)
handle_method_elements(node, %w[href src data srcset])
# do what else you need besides what Loofah gives you
end
end
And to make sure we are sanitizing our HTML as we expect, this class has an extensive test suite.
Test Ruby HTML Sanitation with RSpec
When I was working on this feature, I followed OWASP’s Cheat Sheet to write the tests. It was my first time doing this, so having a guide to verify the HTML cleaning was helpful. Some test examples:
it "removes malicious CSS attributes while retaining safe ones, if safe" do
html = "<p style=\"display: block; background-image:url('http://www.ragingplatypus.com/i/cam-full.jpg'); background-color: blue;\"></p>"
result = Html::Sanitizer.clean(html)
expect(result).to eq "<p style=\"display:block;background-color:blue;\"></p>"
end
it "raises an InvalidHTMLError error message, if there are malicious attributes from different elements" do
html = "<div style='background-image:url(javascript:alert('XSS'))'>" \
"<input type='image' src='javascript:alert('XSS');''></div>" \
"<div style='width: expression(alert('XSS'));'></div>"
expect do
Html::Sanitizer.clean(html)
end.to raise_error(Html::Sanitizer::InvalidHTMLError, /Invalid HTML received/)
end
it "adds a target=_blank to all links even if they already have a target value" do
html = "<a href=\"www.example.com/event-1\" target=\"_top\">Community Gathering Event</a>"
result = Html::Sanitizer.clean(html)
expect(result).to eq "<a href=\"www.example.com/event-1\" target=\"_blank\" rel=\"noopener nofollow\">Community Gathering Event</a>"
end
Resources for HTML sanitization and transformation
These are the resources I used for this work and recommend checking out:
Contributing to Loofah is fun
By learning more about the Loofah gem, I ended up finding an opportunity to contribute to the project. I was adding target=_blank
to all links manually in my project. And I thought: “if I have this problem, other might have it too, and they could benefit from having this feature available in the library.”
I co-authored this PR to add the target=_blank to all links as a built-in scrubber, which was a great way to contribute back to the gem. It’s available on version >= 2.22.0.
This was my first time doing HTML Sanitization and it was a great learning opportunity. I had the chance to meet @flavorjones at RubyConf 2023, which was really nice :)
What about you? Have you had to sanitize HTML before? How did you go about it? What other tools have you used?