---
title: Why should I avoid regular expressions?
teaser: 'Regular expressions are wondrous, but they''re ill-suited for many tasks;
  parsers are often more reliable.

  '
tags: regex,ruby,security,web
author: Summer ☀️
published_on: 2024-12-13
---

In my experience, most developers dread working with
[regular expressions](https://thoughtbot.com/blog/back-to-basics-regular-expressions),
but I love them. I enjoy using their dense syntax to write expressive patterns
that solve all sorts of cool challenges. I use them every day: on the command
line with `grep` and `sed`, in my editor's search-and-replace tool, in
[crosswords](https://regexcrossword.com/), and of course in production code.

But their power must be carefully wielded, because it's very easy to introduce
unexpected false positives. Let's take a look at a regular expression that
demonstrates this danger and talk about how to fix it.

## An example of a bad regular expression

Imagine that you're implementing a commenting feature for
[a well-regarded tech blog](https://thoughtbot.com/blog). Users should be able
to link text to other blog posts, but they shouldn't be allowed to link to other
websites. So you write a regular expression to ensure that link URLs only refer
to the blog's domain name, for example
[`https://thoughtbot.com/blog/write-code-to-be-read`](https://thoughtbot.com/blog/write-code-to-be-read):

```ruby
unless url =~ /thoughtbot.com/
  raise "#{url} is invalid"
end
```

At first glance, this may seem sensible, but unfortunately it opens several
loopholes that can be easily bypassed. Let's fix them!

In regular expressions, `.` matches any single character except a newline
(`\n`). While a period/full stop fits that description, so does anything else.
For example, a user could circumvent our restrictions with a URL like
`https://thoughtbot-com.example.com/phishing-page`. So let's use `\.` to
represent the character literally:

```ruby
/thoughtbot\.com/
```

Next, while it's obvious to a human that our regular expression is meant to
check the domain-name part of the URL, we didn't actually specify that. A user
could provide a URL like `https://example.com/phishing-page?xyz=thoughtbot.com`.
We should make sure that we're matching on the domain name:

```ruby
/https:\/\/thoughtbot\.com/
```

Better, but there's still more to fix. We're not checking that the _entire_
domain name is correct; a user could use subdomains to get around this, _e.g._
`https://thoughtbot.com.example.com/phishing-page`. We'll ensure that the domain
name is followed by a `/` to prevent this:

```ruby
/https:\/\/thoughtbot\.com\//
```

Finally, we forgot to make sure that the domain name appears at the beginning of
the string, meaning that someone could pass
`https://example.com/phishing-page/https://thoughtbot.com/`.
[In Ruby, `\A` is the proper anchor for this](https://andycroll.com/ruby/beginning-and-end-of-string-in-regex/);
in other languages, it might be `^` (with
[multiline mode](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/multiline)
disabled):

```ruby
/\Ahttps:\/\/thoughtbot\.com\//
```

Wow. That took a lot of work to create a sufficiently safe regular expression.
And for more complex patterns, we might have to consider their efficiency so
that we don't encounter a problem similar to
[the regular expression that caused a Stack Overflow outage](https://stackstatus.tumblr.com/post/147710624694/outage-postmortem-july-20-2016).

## Alternatives to regular expressions

You shouldn't completely reject regular expressions, but it's very worth it to
be judicious about when they're an appropriate solution. Consider how the
strings you're analyzing are ultimately used. When a browser sends a request for
a URL and a server receives it, those systems scan through that string to
separate it into its components (scheme, domain, path, query, etc); where
possible, use the same technique (perhaps via a widely used third-party library)
to reduce the likelihood of mismatches. For the example above, we can use
[Ruby's built-in URL parser](https://docs.ruby-lang.org/en/3.3/URI.html):

```ruby
parsed_url = URI.parse(url)
unless parsed_url.scheme == 'https' && parsed_url.host == 'thoughtbot.com'
  raise "#{url} is invalid"
end
```

For almost any programming/markup language or standardized file format, I'd
recommend using an official or popular parser; I feel much more confident with
them than with ad-hoc regular expressions. If you're analyzing strings whose
format is trivial and largely nonconsequential (_e.g._ guarding against typos
in postcodes or ID card numbers), then I think regular expressions work well.
