Regular expressions have a reputation for being cryptic and arcane, and with good reason: their syntax is dense and non-obvious. Unfortunately that leads many people to not view them as real code, so they copy-and-paste them without analyzing them to verify their behavior, or they ignore them in code reviews. This isn’t ideal; is there a way to help ensure that regexes are treated with the significance that they warrant?
Yes: we can comment them! For example, let’s take this regex for a USA postal code in Ruby:
usa_postal_code_pattern = /\A\d{5}(-\d{4})?\z/
That’s pretty hard to read; no wonder we want to gloss over it. Using
Ruby’s “extended mode” for regexes via the x
flag
(and a
%r{⋯}
symmetrical percent literal
for better readability across multiple lines), we can split that into parts and
add comments explaining them:
usa_postal_code_pattern = %r{
\A # Beginning of string
\d{5} # 5 digits
( # ZIP+4
- # Hyphen
\d{4} # 4 digits
)? # ZIP+4 is optional
\z # End of string
}x
Beware that because whitespace is deliberately ignored in this mode, you must escape it when you want to represent literal whitespace characters. For example, here’s a pattern for UK postal codes:
uk_postal_code_pattern = %r{
\A # Beginning of string
[A-Z]{1,2} # 1–2 capital letters
\d # Digit
[A-Z\d]? # Optional capital letter or digit
(\ ) # Single space
\d # Digit
[A-Z]{2} # 2 capital letters
\z # End of string
}x
This is possible in other languages too! Perl supports it; Python calls it the “verbose” flag; in JavaScript you can use string concatenation.