Why should I avoid regular expressions?

Summer ☀️

In my experience, most developers dread working with regular expressions, but I love them. I enjoy using their dense syntax to write expressive patterns that solve all sorts of cool challenges. I use them every day: on the command line with grep and sed, in my editor’s search-and-replace tool, in crosswords, and of course in production code.

But their power must be carefully wielded, because it’s very easy to introduce unexpected false positives. Let’s take a look at a regular expression that demonstrates this danger and talk about how to fix it.

An example of a bad regular expression

Imagine that you’re implementing a commenting feature for a well-regarded tech blog. Users should be able to link text to other blog posts, but they shouldn’t be allowed to link to other websites. So you write a regular expression to ensure that link URLs only refer to the blog’s domain name, for example https://thoughtbot.com/blog/write-code-to-be-read:

unless url =~ /thoughtbot.com/
  raise "#{url} is invalid"
end

At first glance, this may seem sensible, but unfortunately it opens several loopholes that can be easily bypassed. Let’s fix them!

In regular expressions, . matches any single character except a newline (\n). While a period/full stop fits that description, so does anything else. For example, a user could circumvent our restrictions with a URL like https://thoughtbot-com.example.com/phishing-page. So let’s use \. to represent the character literally:

/thoughtbot\.com/

Next, while it’s obvious to a human that our regular expression is meant to check the domain-name part of the URL, we didn’t actually specify that. A user could provide a URL like https://example.com/phishing-page?xyz=thoughtbot.com. We should make sure that we’re matching on the domain name:

/https:\/\/thoughtbot\.com/

Better, but there’s still more to fix. We’re not checking that the entire domain name is correct; a user could use subdomains to get around this, e.g. https://thoughtbot.com.example.com/phishing-page. We’ll ensure that the domain name is followed by a / to prevent this:

/https:\/\/thoughtbot\.com\//

Finally, we forgot to make sure that the domain name appears at the beginning of the string, meaning that someone could pass https://example.com/phishing-page/https://thoughtbot.com/. In Ruby, \A is the proper anchor for this; in other languages, it might be ^ (with multiline mode disabled):

/\Ahttps:\/\/thoughtbot\.com\//

Wow. That took a lot of work to create a sufficiently safe regular expression. And for more complex patterns, we might have to consider their efficiency so that we don’t encounter a problem similar to the regular expression that caused a Stack Overflow outage.

Alternatives to regular expressions

You shouldn’t completely reject regular expressions, but it’s very worth it to be judicious about when they’re an appropriate solution. Consider how the strings you’re analyzing are ultimately used. When a browser sends a request for a URL and a server receives it, those systems scan through that string to separate it into its components (scheme, domain, path, query, etc); where possible, use the same technique (perhaps via a widely used third-party library) to reduce the likelihood of mismatches. For the example above, we can use Ruby’s built-in URL parser:

parsed_url = URI.parse(url)
unless parsed_url.scheme == 'https' && parsed_url.host == 'thoughtbot.com'
  raise "#{url} is invalid"
end

For almost any programming/markup language or standardized file format, I’d recommend using an official or popular parser; I feel much more confident with them than with ad-hoc regular expressions. If you’re analyzing strings whose format is trivial and largely nonconsequential (e.g. guarding against typos in postcodes or ID card numbers), then I think regular expressions work well.