Regular expressions have been around since the early days of computer science. They gained widespread adoption with the introduction of Unix. A regular expression is a notation for describing sets of character strings. They are used to identify patterns within strings. There are many useful applications of this functionality, most notably string validations, find and replace, and pulling information out of strings.
Regular expressions are just strings themselves. Each character in a regular expression can either be part of a code that makes up a pattern to search for, or it can represent a letter, character or word itself. Let’s take a look at some examples.
Basics
First let’s look at an example of a regular expression that is made up of only actual characters and none of the special characters or patterns that generally make up regular expressions.
To get started let’s fire up irb
and create our regular expression:
> regex = /back to basics/
=> /back to basics/
Notice we create a regular expression by entering a pattern between two front
slashes. The pattern we’ve used here will only match strings that contain the
string back to basics
. Let’s use the
match
method,
which gives us information about the first match it finds, to look at some
examples of what matches and what doesn’t:
> regex.match('basics to back')
=> nil
We’re getting close, but nothing in this string matches our regular expression,
so we get nil
.
> regex.match('i enjoyback to basics')
=> <MatchData "back to basics">
After an unsuccessful attempt we have a match. Notice that our regular expression matched even though there are no spaces between the pattern and the words before it.
MatchData
The object returned from the
RegularExpression
object’s
match
method
is of type MatchData
.
This object can tell us all sorts of things about a particular match. Let’s take
a look at some of the information we can get about our match.
We can use the
begin
method to find out the offset of the beginning of our match in the original
string:
> match = regex.match('i enjoyback to basics')
=> <MatchData "back to basics">
> match.begin(0)
=> 7
> 'i enjoyback to basics'[7]
=> "b"
The argument we send the method can be used to specify a capture, a concept
which is covered below, within our match. In our above example begin
tells us
that the beginning of our match can be found at index 7 in our string. As we can
see from the code above the 8th character in the string (at the 7th index in our
string) is ‘b’ the first letter of our match.
Similarly we can get the index of the character following the end of our match
using the
end
method:
> match.end(0)
=> 21
> 'i enjoyback to basics'[21]
=> nil
In this case we get nil since the end of our match is also the end of our string.
We can also use the
to_s
method to print our match:
> match.to_s
=> "back to basics"
Patterns
The regular expression’s real power becomes obvious when we introduce patterns. Let’s take a look at some examples.
Metacharacters
A metacharacter is any character that has a meaning within a regular expression.
Let’s start with something simple, let’s say we want to find out if our string
contains a number. This will require we use our first pattern the \d
, which is
a metacharacter that says we’re looking for any digit:
> string_to_match = 'back 2 basics'
> regex = /\d/
=> /\d/
> regex.match(string_to_match)
=> <MatchData "2">
Our regular expression matches the number 2 in our string.
Character Classes
Let’s say we wanted to find out if any of the letters from ‘k’ to ‘s’ were in our string. This will require we use a character class. A character class let’s us specify a list of characters or patterns that we’re looking for:
> string_to_match = 'i enjoy making stuff'
> regex = /[klmnopqrs]/
=> /[klmnopqrs]/
> regex.match(string_to_match)
=> <MatchData "n">
In this example we can see we entered all the letters of the alphabet we were
interested in between the brackets and the first instance of any of those
characters results in a match. We can simplify the above regular expression by
using a range. This is done by entering two character or numbers separated by a
-
:
> string_to_match = 'i enjoy making stuff'
> regex = /[k-s]/
=> /[k-s]/
> regex.match(string_to_match)
=> <MatchData "n">
As expected, we get the same results with our simplified regular expression.
It’s also possible to invert a character class. This is done by adding a ^
to
the beginning of the pattern. If we wanted to look for the first letter not
in between ‘k’ and ‘s’ we would use the pattern /[^k-s]/
:
> string_to_match = 'i enjoy making stuff'
> regex = /[^k-s]/
=> /[^k-s]/
> regex.match(string_to_match)
=> <MatchData "i">
Since ‘i’ isn’t in our range the first letter in our string meets the criteria our regular expression specified.
Another thing worth noting is the \d
character we used above is an alias for
the character class [0-9].
Modifiers
We have the ability to set a regular expression’s matching mode via
modifiers
. In Ruby this
is done by appending characters after the regular expression pattern is
defined. A particularly useful matching modifier is the case insensitive
modifier i
. Let’s take a look:
> string_to_match = 'BACK to BASICS'
> regex = /back to basics/i
=> /back to basics/i
> regex.match(string_to_match)
=> <MatchData "BACK to BASICS">
The regular expression matches our string in spite of the fact that the cases are clearly not the same. We’ll look at another common modifier later on in the blog.
Repetitions
Repetitions give us the ability to look for repeated patterns. We are given the ability to broadly search for that are repeating an indiscriminate number of time, or we can get as granular as the exact number of repetitions we’re looking for.
Let’s try to identify all the numbers in a string again:
> string_to_match = 'The Mavericks beat the Spurs by 21 in game two.'
> regex = /\d/
=> /\d/
> regex.match(string_to_match)
=> <MatchData "2">
Because we used only a single \d
we only got the first digit, in this case
‘2’. What we’re actually looking for is the entire number, not just the first
digit. We can fix this by modifying our pattern. We need to specify a pattern
that will say find any group of contiguous digits. For this we can use the +
metacharacter. This tells the regular expression engine to find one
or more of the character or characters that match the previous pattern. Let’s
take a look:
> string_to_match = 'The Mavericks beat the Spurs by 21 in game two.'
> regex = /\d+/
=> /\d+/
> regex.match(string_to_match)
=> <MatchData "21">
We could also look for an exact number of repetitions. Let’s say we only wanted
to look for numbers between 100 and 999. One way we could do that would be
using the {n}
pattern, where n
indicates the number of repetitions we’re
looking for:
> string_to_match = 'In 30 years the San Francisco Giants have had two 100 win seasons.'
> regex = /\d{3}/
=> /\d{3}/
> regex.match(string_to_match)
=> <MatchData "100">
Our pattern doesn’t match 30, but does match 100 because we told it only three repeating digit characters constituted a match.
Let’s look for words that are only longer than five characters. This will
require a new metacharacter, the \w
that matches any word character. Then
we’ll use the {n,}
pattern, which says look for n
or more of the previous
pattern:
> string_to_match = 'we are only looking for long words'
> regex = /\w{5,}/
=> /\w{5,}/
> regex.match(string_to_match)
=> <MatchData "looking">
You can also specify less than using this pattern {,m}
and in between with
this {n,m}
.
Grouping
Grouping gives us the ability to combine several patterns into one single
cohesive unit. This can be very useful when combined with repetitions. Earlier
we looked at using repetitions with a single metacharacter \d
, but rarely
will that be enough to satisfy our needs. Let’s look at how we could define a
more complex pattern we expect to see repeated.
Let’s look at how we might create a more complicated regular expression that matches phone numbers in several different formats. We’ll use groups and repetitions to do this:
> phone_format_one = '5125551234'
=> "5125551234"
> phone_format_two = '512.555.1234'
=> "512.555.1234"
> phone_format_three = '512-555-1234'
=> "512-555-1234"
regex = /(\d{3,4}[.-]{0,1}){3}/
=> /(\d{3,4}[\.-]{0,1}){3}/
> regex.match(phone_format_one)
=> <MatchData "5125551234" 1:"234">
> regex.match(phone_format_two)
=> <MatchData "512.555.1234" 1:"1234">
> regex.match(phone_format_three)
=> <MatchData "512-555-1234" 1:"1234">
We have successfully created our regular expression, but there is a lot going
on there. Let’s break it down. First we define that our pattern will be made up
of groups of three or four digits with this \d{3,4}
. Next we indicate that we
want to allow for ‘-’ or ‘.’ patterns (we have to escape the ‘.’ because this
character is also a metacharacter that acts as a wild card), but that we don’t
want to require these characters with this pattern [\.-]{0,1}
. Finally we say
we need three of this group of patterns by grouping the previous two patterns together
and apply a repetition of three (\d{3,4}[.-]{0,1}){3}
.
Lazy and Greedy
Regular expressions are by default greedy, which means they’ll find the largest possible match. Often that isn’t the behavior we’re looking for. When creating our patterns it’s possible to tell Ruby we’re looking for a lazy match, or the first possible match that satisfies our pattern.
Let’s look at an example. Let’s say we wanted to parse out the timestamp of a
log entry. We’ll start out just trying to grab everything in between the square
brackets that we know our log is configured to output the date in. In this
pattern we’ll use a new metacharacter. The .
is a wildcard in a regular
expression:
> string_to_match = '[2014-05-09 10:10:14] An error occured in your application. Invalid input [foo] received.'
> regex = /\[.+\]/
=> /\[.+\]/
> regex.match(string_to_match)
=> <MatchData "[2014-05-09 10:10:14] An error occured in your application. Invalid input [foo]">
Instead of matching just the text in between the first two square brackets it
grabbed everything between the first instance of an opening square bracket and
the last instance of a closing square bracket. We can fix this by telling the
regular expression to be lazy using the ?
metacharacter. Let’s take another
shot:
> string_to_match = '[2014-05-09 10:10:14] An error occured in your application. Invalid input [foo] received.'
> regex = /\[.+?\]/
=> /\[.+?\]/
> regex.match(string_to_match)
=> <MatchData "[2014-05-09 10:10:14]">
Notice that we added our ?
after our repetition metacharacter. This tells
the regular expression engine to keep looking for the next part of the pattern
only until it finds a match; not until it finds the last match.
Assertions
Assertions are part of regular expressions that do not add any characters to a match. They just assert that certain patterns are present, or that a match occurs at a certain place within a string. There are two types of assertions, let’s take a closer look.
Anchors
The simplest type of assertion is an anchor. Anchors are metacharacters that let us specify positions in our patterns. The thing that makes these metacharacters different is they don’t match characters only positions.
Let’s look at how we can determine if a line starts with Back to Basics using
the ^
anchor, which denotes the beginning of a line:
> multi_line_string_to_match = <<-STRING
"> I hope Back to Basics is fun to read.
"> Back to Basics is fun to write.
"> STRING
=> "I hope Back to Basics is fun to read.\nBack to Basics is fun to write.\n"
> regex = /^Back to Basics/
=> /^Back to Basics/
> match = regex.match(multi_line_string_to_match)
=> <MatchData "Back to Basics">
> match.begin(0)
=> 38
Looking at where our match begins we can see it’s the second instance of the
string “Back to Basics” we’ve matched. Another thing to take note of is the ^
anchor doesn’t only match the beginning of a string, but the beginning of a
line within a string.
There are many anchors available. I encourage you to review the Regex documentation and check out some of the others.
Lookarounds
The second type of assertion is called a lookaround. Lookarounds allow us to provide a pattern that must be matched in order for a regular expression to be satisfied, but that will not be included in a successful match. These are called lookahead and lookbehind patterns.
Let’s say we had a comma delimited list of companies and the year they were founded. Let’s match the year that thoughtbot was founded. In this case we only want the year, we’re not interested in including the company in the match, but we’re only interested in thoughtbot, not the other two companies. To do this we’ll use a positive lookbehind. This means we’ll provide a pattern we expect to appear before the pattern we want to match.
> string_to_match = 'Dell: 1984, Apple: 1976, thoughtbot: 2003'
> regex = /(?<=.thoughtbot: )\d{4}/
=> /(?<=.thoughtbot: )\d{4}/
> regex.match(string_to_match)
=> <MatchData "2003">
Even though the pattern we use to assert the word thoughtbot precedes our match appears in our regular expression it isn’t included in our match data. This is exactly the behavior we were looking for.
To specify a positive lookbehind we use the ?<=
. If we wanted to use a negative
lookbehind, meaning the match we want isn’t precede by some particular text we
would use ?<!=
.
To do a positive lookahead we use ?=
. A negative look ahead is achieved using ?!=
.
Captures
Another useful tool is called a capture. This gives us the ability to match on a pattern, but only captures parts of the pattern that are of interest to us. We accomplish this by surrounding the pattern data we intend to capture with parenthesis, which is also how we specify a group. Let’s look at how we might pull the quantity and price for an item off of an invoice:
> string_to_match = 'Mac Book Pro - Quantity: 1 Price: 2000.00'
> regex = /[\w\s]+ - Quantity: (\d+) Price: ([\d\.]+)/
=> /[\w\s]+ - Quantity: (\d+) Price: ([\d\.]+)/
> match = regex.match(string_to_match)
=> <MatchData "Mac Book Pro - Quantity: 1 Price: 2000.00" 1:"1" 2:"2000.00">
> match[0]
=> "Mac Book Pro - Quantity: 1 Price: 2000.00"
> match[1]
=> "1"
> match[2]
=> "2000.00"
Notice we have all the match data in an array. The first element is the actual match and the second two are our captures. We indicate we want something to be captured by surrounding it in parentheses.
We can make working with captures simpler by using what is called a named capture. Instead of using the match data array we can provide a name for each capture and access the values out of the match data as a hash of those names after the match has occurred. Let’s take a look:
> string_to_match = 'Mac Book Pro - Quantity: 1 Price: 2000.00'
> regex = /[\w\s]+ - Quantity: (?<quantity>\d+) Price: (?<price>[\d\.]+)/
=> /[\w\s]+ - Quantity: (?<quantity>\d+) Price: (?<price>[\d\.]+)/
> match = regex.match(string_to_match)
=> <MatchData "Mac Book Pro - Quantity: 1 Price: 2000.00" quantity:"1" price:"2000.00">
> match[:quantity]
=> "1"
> match[:price]
=> "2000.00"
Strings
There are also some useful functions that take advantage of regular expressions
in the String
class. Let’s take a look at some of the things we can do.
sub
and gsub
The sub
and gsub
methods both allow us to provide a pattern and a string
to replace instances of that pattern with. The difference between the two
methods is that gsub
will replace all instances of the pattern, while sub
will only replace the first instance.
The gsub
method gets its name from the fact that matching mode (discussed
above) is set to global, which is accomplished using the modifier code g
hence the name.
Let’s take a look at some examples.
> string_to_match = "My home number is 5125551234, so please call me at 5125551234."
=> "My home number is 5125551234, so please call me at 5125551234."
> string_to_match.sub(/5125551234/, '(512) 555-1234')
=> "My home number is (512) 555-1234, so please call me at 5125551234."
When we use sub we can see we still have one instance of our phone
number that isn’t formatted. Let’s use gsub
to fix it.
> string_to_match.gsub(/5125551234/, '(512) 555-1234')
=> "My home number is (512) 555-1234, so please call me at (512) 555-1234."
As expected gsub
replaces both instances of our phone number.
While our previous example demonstrates the way the functions work it isn’t a particularly useful regular expression. If we were trying to format all the phone numbers in a large document we obviously couldn’t make our pattern the number in each case, so let’s revisit our example and see if we can make it more useful.
> string_to_match = "My home number is 5125554321. My office number is 5125559876."
=> "My home number is 5125554321. My office number is 5125559876."
> string_to_match.gsub(/(?<area_code>\d{3})(?<exchange>\d{3})(?<subscriber>\d{4})/, '(\k<area_code>) \k<exchange>-\k<subscriber>')
=> "My home number is (512) 555-4321. My office number is (512) 555-9876."
Now our regular expression will format any phone number in our string. Notice
that we take advantage of named captures in our regular expression and use them
in our replacement by using \k
.
scan
The scan
method lets us pull all regular expression matches out of a string.
Let’s look at some examples.
> string_to_scan = "I've worked in TX and CA so far in my career."
=> "I've worked in TX and CA so far in my career."
> string_to_scan.scan(/[A-Z]{2}/)
=> ["TX", "CA"]
Using a regular expression we pull out all the state codes in our string. One
thing to keep in mind as you continue to learn is pay close attention to the
assorted metacharacters available and how their meanings change depending
context. Just in this introductory blog we saw multiple meanings for both the
^
and ?
character and we didn’t even cover all of the possible meanings of
even those two characters. Sorting out when each metacharacter means what is
one of the more difficult parts of mastering regular expressions.
Regular expressions are one of the most powerful tools we have at our disposal with Ruby. Keep them in mind as you code and you’ll be surprised how often they can provide a nice clean solution to an otherwise daunting task!
What’s next
If you found this useful, you might also enjoy: