Want to see the full-length video right now for free?
In this episode of the Weekly Iteration, Chris is joined by Matthew Mongeau, aka Goose, to discuss regular expressions. Learn how to work with them, where they fit, and perhaps more importantly where they don't.
If you're new to regular expressions, check out these resources to get up to speed:
\v
for "very no-magic" mode, a more standard regex interpretation. :h \v
for more info.log -i -E --grep
for case insensitive extended regex search of
commit message bodies.Use String#scan
to slice all matches to a pattern out of a target string.
sample = <<-PHONE
You can call me on my cell: (555) 555-5123, or my
home phone: (555) 555-6457 after 7pm
PHONE
phone_number_pattern = /\(\d{3}\)\s\d{3}-\d{4}/
sample.scan(phone_number_pattern).inspect
#=> ["(555) 555-5123", "(555) 555-6457"]
Use String#gsub
to replace all matches to a pattern, optionally yielding
the matches to a block for more complex replacement behavior.
sample = "Hello world, this is a test string."
word_pattern = /\w+/
substitutions = {
"world" => "planet",
"test" => "experiment",
"a" => "an"
}
sample.gsub!(word_pattern) do |word|
puts "Processing '#{word}'"
substitutions.fetch(word, word)
end
#=> "Hello planet, this is an experiment string."
Use named captures to label the capture groups in a pattern and produce a more expressive and maintainable pattern.
# url_pattern = /^(https?:\/\/)?(\w+\.)?(\w+)\.(com|org|net|biz)$/
url_pattern = %r{
^
(?<protocol>https?:\/\/)?
(?<subdomain>\w+\.)?
(?<domain>\w+)\.
(?<tld>com|org|net|biz)
$
}x
urls = [
"http://google.com",
"https://www.google.com",
"www.google.com",
"google.com"
]
urls.each do |url|
matches = url.match(url_pattern)
puts matches.inspect
end
# <MatchData "http://google.com" protocol:"http://" subdomain:nil domain:"google" tld:"com">
# <MatchData "https://www.google.com" protocol:"https://" subdomain:"www." domain:"google" tld:"com">
# <MatchData "www.google.com" protocol:nil subdomain:"www." domain:"google" tld:"com">
# <MatchData "google.com" protocol:nil subdomain:nil domain:"google" tld:"com">
When needing to ensure you match the entirety of a string, use \A
and \z
rather than ^
and $
. Proper string anchoring is needed to ensure security
to [avoid script injections][]. See the [Rails Guide to Security][] for
additional detail.
dangerous_email = "me@example.com\n<script>alert('pwned');</script>"
# line anchored
# valid_email_pattern = /^\w+@\w+\.\w+$/
# string anchored
valid_email_pattern = /\A\w+@\w+\.\w+\z/
def validate_email(email, pattern)
if email =~ pattern
puts "\nValid\n\n"
puts "email '#{email}' is valid!"
else
puts "\nNOT VALID!!!!\n\n"
puts "email '#{email}' is not valid!!!!"
end
end
validate_email(dangerous_email, valid_email_pattern)
#=> NOT VALID
[avoid script injections]: http://homakov.blogspot.com/2012/05/saferweb-injects-in-various-ruby.html [Rails Guide to Security]: http://guides.rubyonrails.org/security.html#regular-expressions
You can use a special regex construct to [use comments to document][] your pattern.
[use comments to document]: http://ruby-doc.org/core-1.9.3/Regexp.html#class-Regexp-label-Free-Spacing+Mode+and+Comments
While Regular Expressions are an incredibly powerful tool, they fall short of a vast number of use cases. Instead, look to parsers and lexers to more intelligently and reliable split up a complex input based on multiple and more subtle rules.
[Parsing HTML With Regex]: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags [CodingHorror - Regular Expressions: Now You Have Two Problems]: http://blog.codinghorror.com/regular-expressions-now-you-have-two-problems/
[Ruby Rogues Episode]: http://rubyrogues.com/105-rr-regular-expressions-with-nell-shamrell/ [Regular Expressions.info]: http://www.regular-expressions.info [Mastering Regular Expressions O'Reilly]: http://shop.oreilly.com/product/9780596528126.do?sortby=publicationDate [Oniguruma cheat sheet]: https://thoughtbot-images.s3.amazonaws.com/upcase/oniguruma-regex-cheat-sheet.txt