---
title: Fight back UTF-8 Invalid Byte Sequences
teaser:
tags: web,ruby,testing
author: Joel Oliveira
published_on: 2013-02-09
---

Chances are, some of you have run into the issue with the `invalid byte sequence
in UTF-8` error when dealing with user-submitted data. A [Google
search](https://www.google.com/search?q=ruby+utf-8+Invalid+Byte+Sequences) shows
that my hunch isn't off.

Among the search results are plenty of answers&mdash;some using the deprecated
iconv library&mdash;that might lead you to a sufficient fix. However, among the
slew of queries are few answers on how to reliably replicate and test the issue.

In developing the [Griddler](http://www.github.com/thoughtbot/griddler) gem we
ran into some cases where the data being posted back to our controller had
invalid UTF-8 bytes. For Griddler, our failing case needs to simulate the body
of an email having an invalid byte, and encoded as UTF-8.

What are valid and invalid bytes? [This table on
Wikipedia](http://en.wikipedia.org/wiki/UTF-8#Codepage_layout) tells us bytes
192, 193, and 245-255 are off limits. In ruby's string literal we can represent
this by escaping one of those numbers:

    > "hi \255"
     => "hi \xAD"

There's our string with the invalid byte! How do we know for sure? In that IRB
session we can simulate a comparable issue by sending a message to the string it
won't like - like `split` or `gsub`.

    > "hi \255".split(' ')
    ArgumentError: invalid byte sequence in UTF-8
      from (irb):9:in `split'
      from (irb):9
      from /Users/joel/.rvm/rubies/ruby-1.9.3-p125/bin/irb:16:in `<main>'

Yup. It certainly does not like that.

Let's create a very real-world, enterprise-level, business-critical test case:

`invalid_byte_spec.rb`

    require 'rspec'

    def replace_name(body, name)
      body.gsub(/joel/, name)
    end

    describe 'replace_name' do
      it 'removes my name' do
        body = "hello joel"

        replace_name(body, 'hank').should eq "hello hank"
      end

      it 'clears out invalid UTF-8 bytes' do
        body = "hello joel\255"

        replace_name(body, 'hank').should eq "hello hank"
      end
    end

The first test passes as expected, and the second will fail as expected but not
with the error we want. By adding that extra byte we *should* see an exception
raised similar to what we simulated in IRB. Instead it's failing in the
comparison with the expected value.

    1) replace_name clears out invalid UTF-8 bytes
       Failure/Error: replace_name(body, 'hank').should eq "hello hank"

         expected: "hello hank"
              got: "hello hank\xAD"

         (compared using ==)
       # ./invalid_byte_spec.rb:17:in `block (2 levels) in <top (required)>'

Why isn't it failing properly? If we pry into our running test we find out that
inside our file the strings being passed around are encoded as `ASCII-8BIT`
instead of `UTF-8`.

    [2] pry(#<RSpec::Core::ExampleGroup::Nested_1>)> body.encoding
    => #<Encoding:ASCII-8BIT>

As a result we'll have to force that string's encoding to UTF-8:

    it 'clears out invalid UTF-8 bytes' do
      body = "hello joel\255".force_encoding('UTF-8')

      replace_name(body, 'hank').should_not raise_error(ArgumentError)
      replace_name(body, 'hank').should eq "hello hank"
    end

By running the test now we will see our desired exception

    1) replace_name clears out invalid UTF-8 bytes
       Failure/Error: body.gsub(/joel/, name)
       ArgumentError:
         invalid byte sequence in UTF-8
       # ./invalid_byte_spec.rb:4:in `gsub'
       # ./invalid_byte_spec.rb:4:in `replace_name'
       # ./invalid_byte_spec.rb:17:in `block (2 levels) in <top (required)>'

    Finished in 0.00426 seconds
    2 examples, 1 failure

Now that we're comfortably in the **red** part of *[red/green/refactor][rgr]* we
can move on to getting this passing by updating our `replace_name` method.

[rgr]: http://en.wikipedia.org/wiki/Test-driven_development#Development_style

    def replace_name(body, name)
      body
        .encode('UTF-8', 'binary', invalid: :replace, undef: :replace, replace: '')
        .gsub(/joel/, name)
    end

And the test?

    Finished in 0.04252 seconds
    2 examples, 0 failures

For such a small piece of code we admittedly had to jump through some hoops.
Through that process, however, we learned a bit about character encoding and how
to put ourselves in the right position&mdash;through the red/green/refactor
cycle&mdash;to fix bugs we will undoubtedly run into while writing software.
