In my experience, most developers dread working with
regular expressions,
but I love them. I enjoy using their dense syntax to write expressive patterns
that solve all sorts of cool challenges. I use them every day: on the command
line with grep
and sed
, in my editor’s search-and-replace tool, in
crosswords, and of course in production code.
But their power must be carefully wielded, because it’s very easy to introduce unexpected false positives. Let’s take a look at a regular expression that demonstrates this danger and talk about how to fix it.
An example of a bad regular expression
Imagine that you’re implementing a commenting feature for
a well-regarded tech blog. Users should be able
to link text to other blog posts, but they shouldn’t be allowed to link to other
websites. So you write a regular expression to ensure that link URLs only refer
to the blog’s domain name, for example
https://thoughtbot.com/blog/write-code-to-be-read
:
unless url =~ /thoughtbot.com/
raise "#{url} is invalid"
end
At first glance, this may seem sensible, but unfortunately it opens several loopholes that can be easily bypassed. Let’s fix them!
In regular expressions, .
matches any single character except a newline
(\n
). While a period/full stop fits that description, so does anything else.
For example, a user could circumvent our restrictions with a URL like
https://thoughtbot-com.example.com/phishing-page
. So let’s use \.
to
represent the character literally:
/thoughtbot\.com/
Next, while it’s obvious to a human that our regular expression is meant to
check the domain-name part of the URL, we didn’t actually specify that. A user
could provide a URL like https://example.com/phishing-page?xyz=thoughtbot.com
.
We should make sure that we’re matching on the domain name:
/https:\/\/thoughtbot\.com/
Better, but there’s still more to fix. We’re not checking that the entire
domain name is correct; a user could use subdomains to get around this, e.g.
https://thoughtbot.com.example.com/phishing-page
. We’ll ensure that the domain
name is followed by a /
to prevent this:
/https:\/\/thoughtbot\.com\//
Finally, we forgot to make sure that the domain name appears at the beginning of
the string, meaning that someone could pass
https://example.com/phishing-page/https://thoughtbot.com/
.
In Ruby, \A
is the proper anchor for this;
in other languages, it might be ^
(with
multiline mode
disabled):
/\Ahttps:\/\/thoughtbot\.com\//
Wow. That took a lot of work to create a sufficiently safe regular expression. And for more complex patterns, we might have to consider their efficiency so that we don’t encounter a problem similar to the regular expression that caused a Stack Overflow outage.
Alternatives to regular expressions
You shouldn’t completely reject regular expressions, but it’s very worth it to be judicious about when they’re an appropriate solution. Consider how the strings you’re analyzing are ultimately used. When a browser sends a request for a URL and a server receives it, those systems scan through that string to separate it into its components (scheme, domain, path, query, etc); where possible, use the same technique (perhaps via a widely used third-party library) to reduce the likelihood of mismatches. For the example above, we can use Ruby’s built-in URL parser:
parsed_url = URI.parse(url)
unless parsed_url.scheme == 'https' && parsed_url.host == 'thoughtbot.com'
raise "#{url} is invalid"
end
For almost any programming/markup language or standardized file format, I’d recommend using an official or popular parser; I feel much more confident with them than with ad-hoc regular expressions. If you’re analyzing strings whose format is trivial and largely nonconsequential (e.g. guarding against typos in postcodes or ID card numbers), then I think regular expressions work well.