Video

Want to see the full-length video right now for free?

Notes

In this episode of the Weekly Iteration, Chris is joined by Matthew Mongeau, aka Goose, to discuss regular expressions. Learn how to work with them, where they fit, and perhaps more importantly where they don't.

Regex Introduction

If you're new to regular expressions, check out these resources to get up to speed:

Interactive Regex Testing Utilities

  • Rubular - Ruby regex testing tool
  • Scriptular - Similar to Rubular, but using JS
  • Reggyapp - OS X app with multiple language support

Regex Visualization Utilities

  • Debugex - See the parsed state machine representation of your regex
  • Regexper - Similar to Debugex, but with more friendly summaries.

Support

  • Ruby 1.9+ has great support thanks to the Oniguruma engine
  • Vim has a relatively complete regex implementation but uses a very non-standard syntax that defaults to treating most characters literally. Use \v for "very no-magic" mode, a more standard regex interpretation. :h \v for more info.
  • JavaScript has solid and mostly standard implementation, although not as complete as Vim & Ruby. Try XRegExp for more advanced features.
  • Git log log -i -E --grep for case insensitive extended regex search of commit message bodies.

Ruby Tricks

String#scan

Use String#scan to slice all matches to a pattern out of a target string.

sample = <<-PHONE
  You can call me on my cell: (555) 555-5123, or my
  home phone: (555) 555-6457 after 7pm
PHONE

phone_number_pattern = /\(\d{3}\)\s\d{3}-\d{4}/

sample.scan(phone_number_pattern).inspect
#=> ["(555) 555-5123", "(555) 555-6457"]

gsub with a block

Use String#gsub to replace all matches to a pattern, optionally yielding the matches to a block for more complex replacement behavior.

sample = "Hello world, this is a test string."

word_pattern = /\w+/

substitutions = {
  "world" => "planet",
  "test" => "experiment",
  "a" => "an"
}

sample.gsub!(word_pattern) do |word|
  puts "Processing '#{word}'"
  substitutions.fetch(word, word)
end
#=> "Hello planet, this is an experiment string."

Named Captures

Use named captures to label the capture groups in a pattern and produce a more expressive and maintainable pattern.

# url_pattern = /^(https?:\/\/)?(\w+\.)?(\w+)\.(com|org|net|biz)$/

url_pattern = %r{
  ^
  (?<protocol>https?:\/\/)?
  (?<subdomain>\w+\.)?
  (?<domain>\w+)\.
  (?<tld>com|org|net|biz)
  $
}x

urls = [
  "http://google.com",
  "https://www.google.com",
  "www.google.com",
  "google.com"
]

urls.each do |url|
  matches = url.match(url_pattern)

  puts matches.inspect
end

# <MatchData "http://google.com" protocol:"http://" subdomain:nil domain:"google" tld:"com">
# <MatchData "https://www.google.com" protocol:"https://" subdomain:"www." domain:"google" tld:"com">
# <MatchData "www.google.com" protocol:nil subdomain:"www." domain:"google" tld:"com">
# <MatchData "google.com" protocol:nil subdomain:nil domain:"google" tld:"com">

Proper String Anchoring

When needing to ensure you match the entirety of a string, use \A and \z rather than ^ and $. Proper string anchoring is needed to ensure security to [avoid script injections][]. See the [Rails Guide to Security][] for additional detail.

dangerous_email = "me@example.com\n<script>alert('pwned');</script>"

# line anchored
# valid_email_pattern = /^\w+@\w+\.\w+$/

# string anchored
valid_email_pattern = /\A\w+@\w+\.\w+\z/

def validate_email(email, pattern)
  if email =~ pattern
    puts "\nValid\n\n"
    puts "email '#{email}' is valid!"
  else
    puts "\nNOT VALID!!!!\n\n"
    puts "email '#{email}' is not valid!!!!"
  end
end

validate_email(dangerous_email, valid_email_pattern)
#=> NOT VALID

[avoid script injections]: http://homakov.blogspot.com/2012/05/saferweb-injects-in-various-ruby.html [Rails Guide to Security]: http://guides.rubyonrails.org/security.html#regular-expressions

Using Comments to Document Your Pattern

You can use a special regex construct to [use comments to document][] your pattern.

[use comments to document]: http://ruby-doc.org/core-1.9.3/Regexp.html#class-Regexp-label-Free-Spacing+Mode+and+Comments

When Not To Use

While Regular Expressions are an incredibly powerful tool, they fall short of a vast number of use cases. Instead, look to parsers and lexers to more intelligently and reliable split up a complex input based on multiple and more subtle rules.

  • [Parsing HTML With Regex][]
  • [CodingHorror - Regular Expressions: Now You Have Two Problems][]

[Parsing HTML With Regex]: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags [CodingHorror - Regular Expressions: Now You Have Two Problems]: http://blog.codinghorror.com/regular-expressions-now-you-have-two-problems/

Additional References

  • [Oniguruma cheat sheet][] - comprehensive summary of the supported operations and syntax in the ruby regex engine called Oniguruma.
  • [Ruby Rogues Episode][] - Great episode covering all things regex.
  • [Regular Expressions.info][] - An impressive resource covering the more nuanced and powerful features of regex engines, oft found in google results.
  • [Mastering Regular Expressions O'Reilly][] - When things get heavy, turn to this book.

[Ruby Rogues Episode]: http://rubyrogues.com/105-rr-regular-expressions-with-nell-shamrell/ [Regular Expressions.info]: http://www.regular-expressions.info [Mastering Regular Expressions O'Reilly]: http://shop.oreilly.com/product/9780596528126.do?sortby=publicationDate [Oniguruma cheat sheet]: https://thoughtbot-images.s3.amazonaws.com/upcase/oniguruma-regex-cheat-sheet.txt