Given we have the following HTML document (as a string), we want to slice out all of the opening tags.
<div>
<p class="content">
Consider the SUT safe.
<a href="https://www.google.com">
Google
</a>
is your friend. <br />
Another line here.
<hr class="divider" />
Final line
</p>
</div>
Our first attempt comes close:
html_string = <<-HTML
<div>
<p class="content">
Consider the SUT safe.
<a href="https://www.google.com">
Google
</a>
is your friend. <br />
Another line here.
<hr class="divider" />
Final line
</p>
</div>
HTML
pattern = /<([a-z]+) *[^\/]*?>/
html_string.scan(pattern)
#=> [["div"], ["p"]]
but we seem to be missing the a
tag in the list. What can we do to fix
this?
Oh no, we've entered dark territory here! While regexs are very powerful, there are times where the task at hand should not (or cannot!) be properly solved using regex. Check out the section in the Weekly Iteration on when not to use regex for a discussion on when you might want to reach for more than a single regex.
Instead, parsers are a better choice for complex tasks like parsing HTML. Check out our Weekly Iteration episode on parsing for an introduction to the topic, and be sure to check out the classic StackOverflow answer You Can't Parse HTML With Regex. (Seriously, check it out if you haven't. At least skim to the bottom. It's amazing).
Return to Flashcard Results