---
title: 'Back to Basics: Regular Expressions'
teaser: 'Regular expressions are a core skill for developers in any language. Revisit
  the basics with us to master this foundational technique.

  '
tags: back to basics,regex
author: Britt Ballard
published_on: 2014-05-12
---

Regular expressions have been around since the early days of computer
science. They gained widespread adoption with the introduction of Unix. A
regular expression is a notation for describing sets of character strings. They
are used to identify patterns within strings. There are many useful
applications of this functionality, most notably string validations, find and
replace, and pulling information out of strings.

Regular expressions are just strings themselves. Each character in a regular
expression can either be part of a code that makes up a pattern to search for, or
it can represent a letter, character or word itself. Let's take a look at some
examples.

## Basics

First let's look at an example of a regular expression that is made up of only
actual characters and none of the special characters or patterns that generally
make up regular expressions.

To get started let's fire up `irb` and create our regular expression:

    > regex = /back to basics/
     => /back to basics/

Notice we create a regular expression by entering a pattern between two front
slashes. The pattern we've used here will only match strings that contain the
string `back to basics`. Let's use the
[`match`](http://www.ruby-doc.org/core-2.1.1/Regexp.html#method-i-match) method,
which gives us information about the first match it finds, to look at some
examples of what matches and what doesn't:

    > regex.match('basics to back')
     => nil

We're getting close, but nothing in this string matches our regular expression,
so we get `nil`.

    > regex.match('i enjoyback to basics')
     => <MatchData "back to basics">

After an unsuccessful attempt we have a match. Notice that our regular
expression matched even though there are no spaces between the pattern and the
words before it.

## MatchData

The object returned from the
[`RegularExpression`](http://www.ruby-doc.org/core-2.1.1/Regexp.html) object's
[`match`](http://www.ruby-doc.org/core-2.1.1/Regexp.html#method-i-match) method
is of type [`MatchData`](http://www.ruby-doc.org/core-2.1.1/MatchData.html).
This object can tell us all sorts of things about a particular match. Let's take
a look at some of the information we can get about our match.

We can use the
[`begin`](http://www.ruby-doc.org/core-2.1.1/MatchData.html#method-i-begin)
method to find out the offset of the beginning of our match in the original
string:

```irb
> match = regex.match('i enjoyback to basics')
 => <MatchData "back to basics">

> match.begin(0)
 => 7

> 'i enjoyback to basics'[7]
 => "b"
```

The argument we send the method can be used to specify a capture, a concept
which is covered below, within our match. In our above example `begin` tells us
that the beginning of our match can be found at index 7 in our string. As we can
see from the code above the 8th character in the string (at the 7th index in our
string) is 'b' the first letter of our match.

Similarly we can get the index of the character following the end of our match
using the
[`end`](http://www.ruby-doc.org/core-2.1.1/MatchData.html#method-i-end) method:

```irb
> match.end(0)
 => 21

> 'i enjoyback to basics'[21]
 => nil
```

In this case we get nil since the end of our match is also the end of our string.

We can also use the
[`to_s`](http://www.ruby-doc.org/core-2.1.1/MatchData.html#method-i-to_s)
method to print our match:

    > match.to_s
     => "back to basics"

## Patterns

The regular expression's real power becomes obvious when we introduce patterns.
Let's take a look at some examples.

## Metacharacters

A metacharacter is any character that has a meaning within a regular expression.
Let's start with something simple, let's say we want to find out if our string
contains a number. This will require we use our first pattern the `\d`, which is
a metacharacter that says we're looking for any digit:

```irb
> string_to_match = 'back 2 basics'

> regex = /\d/
 => /\d/

> regex.match(string_to_match)
 => <MatchData "2">
```

Our regular expression matches the number 2 in our string.

## Character Classes

Let's say we wanted to find out if any of the letters from 'k' to 's' were in our
string. This will require we use a character class. A character class let's us
specify a list of characters or patterns that we're looking for:

```irb
> string_to_match = 'i enjoy making stuff'

> regex = /[klmnopqrs]/
 => /[klmnopqrs]/

> regex.match(string_to_match)
 => <MatchData "n">
```

In this example we can see we entered all the letters of the alphabet we were
interested in between the brackets and the first instance of any of those
characters results in a match. We can simplify the above regular expression by
using a range. This is done by entering two character or numbers separated by a
`-`:

```irb
> string_to_match = 'i enjoy making stuff'

> regex = /[k-s]/
 => /[k-s]/

> regex.match(string_to_match)
 => <MatchData "n">
```

As expected, we get the same results with our simplified regular expression.

It's also possible to invert a character class. This is done by adding a `^` to
the beginning of the pattern. If we wanted to look for the first letter not
in between 'k' and 's' we would use the pattern `/[^k-s]/`:

```irb
> string_to_match = 'i enjoy making stuff'

> regex = /[^k-s]/
 => /[^k-s]/

> regex.match(string_to_match)
 => <MatchData "i">
```

Since 'i' isn't in our range the first letter in our string meets the criteria
our regular expression specified.

Another thing worth noting is the `\d` character we used above is an alias for
the character class [0-9].

## Modifiers

We have the ability to set a regular expression's matching mode via
[`modifiers`](http://www.regular-expressions.info/modifiers.html). In Ruby this
is done by appending characters after the regular expression pattern is
defined. A particularly useful matching modifier is the case insensitive
modifier `i`. Let's take a look:

```irb
> string_to_match = 'BACK to BASICS'

> regex = /back to basics/i
 => /back to basics/i

> regex.match(string_to_match)
 => <MatchData "BACK to BASICS">
```

The regular expression matches our string in spite of the fact that the cases are
clearly not the same. We'll look at another common modifier later on in the
blog.

## Repetitions

Repetitions give us the ability to look for repeated patterns. We are given the
ability to broadly search for that are repeating an indiscriminate number of
time, or we can get as granular as the exact number of repetitions we're
looking for.

Let's try to identify all the numbers in a string again:

```irb
> string_to_match = 'The Mavericks beat the Spurs by 21 in game two.'

> regex = /\d/
 => /\d/

> regex.match(string_to_match)
 => <MatchData "2">
```

Because we used only a single `\d` we only got the first digit, in this case
'2'. What we're actually looking for is the entire number, not just the first
digit. We can fix this by modifying our pattern. We need to specify a pattern
that will say find any group of contiguous digits. For this we can use the `+`
metacharacter. This tells the regular expression engine to find one
or more of the character or characters that match the previous pattern. Let's
take a look:

```irb
> string_to_match = 'The Mavericks beat the Spurs by 21 in game two.'

> regex = /\d+/
 => /\d+/

> regex.match(string_to_match)
 => <MatchData "21">
```

We could also look for an exact number of repetitions. Let's say we only wanted
to look for numbers between 100 and 999. One way we could do that would be
using the `{n}` pattern, where `n` indicates the number of repetitions we're
looking for:

```irb
> string_to_match = 'In 30 years the San Francisco Giants have had two 100 win seasons.'

> regex = /\d{3}/
 => /\d{3}/

> regex.match(string_to_match)
 => <MatchData "100">
```

Our pattern doesn't match 30, but does match 100 because we told it only three
repeating digit characters constituted a match.

Let's look for words that are only longer than five characters. This will
require a new metacharacter, the `\w` that matches any word character. Then
we'll use the `{n,}` pattern, which says look for `n` or more of the previous
pattern:

```irb
> string_to_match = 'we are only looking for long words'

> regex = /\w{5,}/
 => /\w{5,}/

> regex.match(string_to_match)
 => <MatchData "looking">
```

You can also specify less than using this pattern `{,m}` and in between with
this `{n,m}`.

## Grouping

Grouping gives us the ability to combine several patterns into one single
cohesive unit. This can be very useful when combined with repetitions. Earlier
we looked at using repetitions with a single metacharacter `\d`, but rarely
will that be enough to satisfy our needs. Let's look at how we could define a
more complex pattern we expect to see repeated.

Let's look at how we might create a more complicated regular expression that
matches phone numbers in several different formats. We'll use groups and
repetitions to do this:

```irb
> phone_format_one = '5125551234'
 => "5125551234"

> phone_format_two = '512.555.1234'
 => "512.555.1234"

> phone_format_three = '512-555-1234'
 => "512-555-1234"

regex = /(\d{3,4}[.-]{0,1}){3}/
 => /(\d{3,4}[\.-]{0,1}){3}/

> regex.match(phone_format_one)
 => <MatchData "5125551234" 1:"234">

> regex.match(phone_format_two)
 => <MatchData "512.555.1234" 1:"1234">

> regex.match(phone_format_three)
 => <MatchData "512-555-1234" 1:"1234">
```

We have successfully created our regular expression, but there is a lot going
on there. Let's break it down. First we define that our pattern will be made up
of groups of three or four digits with this `\d{3,4}`. Next we indicate that we
want to allow for '-' or '.' patterns (we have to escape the '.' because this
character is also a metacharacter that acts as a wild card), but that we don't
want to require these characters with this pattern `[\.-]{0,1}`. Finally we say
we need three of this group of patterns by grouping the previous two patterns together
and apply a repetition of three `(\d{3,4}[.-]{0,1}){3}`.

## Lazy and Greedy

Regular expressions are by default greedy, which means they'll find the
largest possible match. Often that isn't the behavior we're looking for. When
creating our patterns it's possible to tell Ruby we're looking for a lazy
match, or the first possible match that satisfies our pattern.

Let's look at an example. Let's say we wanted to parse out the timestamp of a
log entry. We'll start out just trying to grab everything in between the square
brackets that we know our log is configured to output the date in. In this
pattern we'll use a new metacharacter.  The `.` is a wildcard in a regular
expression:

```irb
> string_to_match = '[2014-05-09 10:10:14] An error occured in your application. Invalid input [foo] received.'

> regex = /\[.+\]/
 => /\[.+\]/

> regex.match(string_to_match)
 => <MatchData "[2014-05-09 10:10:14] An error occured in your application. Invalid input [foo]">
```

Instead of matching just the text in between the first two square brackets it
grabbed everything between the first instance of an opening square bracket and
the last instance of a closing square bracket. We can fix this by telling the
regular expression to be lazy using the `?` metacharacter. Let's take another
shot:

```irb
> string_to_match = '[2014-05-09 10:10:14] An error occured in your application. Invalid input [foo] received.'

> regex = /\[.+?\]/
 => /\[.+?\]/

> regex.match(string_to_match)
 => <MatchData "[2014-05-09 10:10:14]">
```

Notice that we added our `?` after our repetition metacharacter. This tells
the regular expression engine to keep looking for the next part of the pattern
only until it finds a match; not until it finds the last match.

## Assertions

Assertions are part of regular expressions that do not add any characters to a
match. They just assert that certain patterns are present, or that a match
occurs at a certain place within a string. There are two types of assertions,
let's take a closer look.

## Anchors

The simplest type of assertion is an anchor. Anchors are metacharacters that
let us specify positions in our patterns. The thing that makes these
metacharacters different is they don't match characters only positions.

Let's look at how we can determine if a line starts with Back to Basics using
the `^` anchor, which denotes the beginning of a line:

```irb
> multi_line_string_to_match = <<-STRING
"> I hope Back to Basics is fun to read.
"> Back to Basics is fun to write.
"> STRING
 => "I hope Back to Basics is fun to read.\nBack to Basics is fun to write.\n"

> regex = /^Back to Basics/
 => /^Back to Basics/

> match = regex.match(multi_line_string_to_match)
 => <MatchData "Back to Basics">

> match.begin(0)
 => 38
```

Looking at where our match begins we can see it's the second instance of the
string "Back to Basics" we've matched. Another thing to take note of is the `^`
anchor doesn't only match the beginning of a string, but the beginning of a
line within a string.

There are many anchors available. I encourage you to review the
[Regex](http://www.ruby-doc.org/core-2.1.1/Regexp.html#class-Regexp-label-Anchors)
documentation and check out some of the others.

## Lookarounds

The second type of assertion is called a lookaround. Lookarounds allow us to
provide a pattern that must be matched in order for a regular expression to be
satisfied, but that will not be included in a successful match. These are called
lookahead and lookbehind patterns.

Let's say we had a comma delimited list of companies and the year they were
founded. Let's match the year that thoughtbot was founded. In this case we only
want the year, we're not interested in including the company in the match, but
we're only interested in thoughtbot, not the other two companies.  To do this
we'll use a positive lookbehind. This means we'll provide a pattern we expect
to appear before the pattern we want to match.

```irb
> string_to_match = 'Dell: 1984, Apple: 1976, thoughtbot: 2003'

> regex = /(?<=.thoughtbot: )\d{4}/
 => /(?<=.thoughtbot: )\d{4}/

> regex.match(string_to_match)
 => <MatchData "2003">
```

Even though the pattern we use to assert the word thoughtbot precedes our match
appears in our regular expression it isn't included in our match data. This is
exactly the behavior we were looking for.

To specify a positive lookbehind we use the `?<=`. If we wanted to use a negative
lookbehind, meaning the match we want isn't precede by some particular text we
would use `?<!=`.

To do a positive lookahead we use `?=`. A negative look ahead is achieved using `?!=`.

## Captures

Another useful tool is called a capture. This gives us the ability to match on
a pattern, but only captures parts of the pattern that are of interest to us.
We accomplish this by surrounding the pattern data we intend to capture with
parenthesis, which is also how we specify a group. Let's look at how we might
pull the quantity and price for an item off of an invoice:

```irb
> string_to_match = 'Mac Book Pro - Quantity: 1 Price: 2000.00'

> regex = /[\w\s]+ - Quantity: (\d+) Price: ([\d\.]+)/
 => /[\w\s]+ - Quantity: (\d+) Price: ([\d\.]+)/

> match = regex.match(string_to_match)
 => <MatchData "Mac Book Pro - Quantity: 1 Price: 2000.00" 1:"1" 2:"2000.00">

> match[0]
 => "Mac Book Pro - Quantity: 1 Price: 2000.00"

> match[1]
 => "1"

> match[2]
 => "2000.00"
```

Notice we have all the match data in an array. The first element is the actual
match and the second two are our captures. We indicate we want something to be
captured by surrounding it in parentheses.

We can make working with captures simpler by using what is called a named
capture. Instead of using the match data array we can provide a name for each
capture and access the values out of the match data as a hash of those names
after the match has occurred. Let's take a look:

```irb
> string_to_match = 'Mac Book Pro - Quantity: 1 Price: 2000.00'

> regex = /[\w\s]+ - Quantity: (?<quantity>\d+) Price: (?<price>[\d\.]+)/
 => /[\w\s]+ - Quantity: (?<quantity>\d+) Price: (?<price>[\d\.]+)/

> match = regex.match(string_to_match)
 => <MatchData "Mac Book Pro - Quantity: 1 Price: 2000.00" quantity:"1" price:"2000.00">

> match[:quantity]
 => "1"

> match[:price]
 => "2000.00"
```

## Strings

There are also some useful functions that take advantage of regular expressions
in the `String` class. Let's take a look at some of the things we can do.

## `sub` and `gsub`

The `sub` and `gsub` methods both allow us to provide a pattern and a string
to replace instances of that pattern with. The difference between the two
methods is that `gsub` will replace all instances of the pattern, while `sub`
will only replace the first instance.

The `gsub` method gets its name from the fact that matching mode (discussed
above) is set to global, which is accomplished using the modifier code `g`
hence the name.

Let's take a look at some examples.

```irb
> string_to_match = "My home number is 5125551234, so please call me at 5125551234."
 => "My home number is 5125551234, so please call me at 5125551234."

> string_to_match.sub(/5125551234/, '(512) 555-1234')
 => "My home number is (512) 555-1234, so please call me at 5125551234."
```

When we use sub we can see we still have one instance of our phone
number that isn't formatted. Let's use `gsub` to fix it.

    > string_to_match.gsub(/5125551234/, '(512) 555-1234')
     => "My home number is (512) 555-1234, so please call me at (512) 555-1234."

As expected `gsub` replaces both instances of our phone number.

While our previous example demonstrates the way the functions work it isn't a
particularly useful regular expression. If we were trying to format all the
phone numbers in a large document we obviously couldn't make our pattern the
number in each case, so let's revisit our example and see if we can make it
more useful.

```irb
> string_to_match = "My home number is 5125554321. My office number is 5125559876."
 => "My home number is 5125554321. My office number is 5125559876."

> string_to_match.gsub(/(?<area_code>\d{3})(?<exchange>\d{3})(?<subscriber>\d{4})/, '(\k<area_code>) \k<exchange>-\k<subscriber>')
 => "My home number is (512) 555-4321. My office number is (512) 555-9876."
```

Now our regular expression will format any phone number in our string. Notice
that we take advantage of named captures in our regular expression and use them
in our replacement by using `\k`.

## `scan`

The `scan` method lets us pull all regular expression matches out of a string.
Let's look at some examples.

```irb
> string_to_scan = "I've worked in TX and CA so far in my career."
 => "I've worked in TX and CA so far in my career."

> string_to_scan.scan(/[A-Z]{2}/)
 => ["TX", "CA"]
```

Using a regular expression we pull out all the state codes in our string. One
thing to keep in mind as you continue to learn is pay close attention to the
assorted metacharacters available and how their meanings change depending
context. Just in this introductory blog we saw multiple meanings for both the
`^` and `?` character and we didn't even cover all of the possible meanings of
even those two characters. Sorting out when each metacharacter means what is
one of the more difficult parts of mastering regular expressions.

Regular expressions are one of the most powerful tools we have at our disposal
with Ruby. Keep them in mind as you code and you'll be surprised how often they
can provide a nice clean solution to an otherwise daunting task!

## What's next

If you found this useful, you might also enjoy:

* [Regular Expressions on Wikipedia](http://en.wikipedia.org/wiki/Regular_expression)
* [Ruby Regex Docs](http://www.ruby-doc.org/core-2.1.1/Regexp.html)
* [Rubular, a site that lets you test out your regular expressions](http://rubular.com/)
* [Exploring Ruby’s Regular Expression Algorithm](http://patshaughnessy.net/2012/4/3/exploring-rubys-regular-expression-algorithm)
* [Regular Expression Matching Can Be Simple And Fast (but is slow in Java, Perl, PHP, Python, Ruby, ...)](http://swtch.com/~rsc/regexp/regexp1.html)