---
title: Named Entity Recognition
teaser: Extracting meaning from text with machine learning.
tags: machine learning,natural language processing
author: George Brocklehurst
published_on: 2017-09-28
---

## Downloading recipes

I wanted to experiment with a reasonably large recipe dataset, to play around
with some meal planning ideas. The trouble was, I didn't have a dataset.

No problem, I thought, there are _loads_ of recipes on the Web---I'll use some
of those!

Thanks to embedded data formats like the [h-recipe microformat][h-recipe], and
the [recipe schema from Schema.org][recipe-schema], many of the recipes
published on the Web are marked up semantically. Even better, there's a [Ruby
gem called `hangry`][hangry] to parse these formats. In no time, I was turning
recipes into structured data.

The thing I was most interested in was ingredients, and here I hit my next
problem: I had human readable lists of ingredients, but nothing sufficiently
structured to compare quantities, find similarities, or convert units.

## Ingredients are hard

The first few examples I looked at seemed pretty simple:

```ruby
[
  "2 tablespoons butter",
  "2 tablespoons flour",
  "1/2 cup white wine",
  "1 cup chicken broth",
]
```

It seemed like a clear pattern was emerging, and maybe one line of Ruby
code would suffice:

```ruby
quantity, unit, name = description.split(" ", 3)
```

Unfortunately, the reality was much more complex. I found more and more examples
that didn't fit this simple pattern. Some ingredients had multiple quantities
that needed to be combined ("3 cups and 2 tablespoons", or "2 10 ounce
packages"); others had alternative quantities in metric and imperial, or in cups
and ounces; still others followed the ingredient name with preparation
instructions, or listed multiple ingredients together in the same item.

The special cases piled higher and higher, and my simple Ruby code got more and
more tangled. I stopped feeling good about the code, then I stopped feeling like
it would be OK after refactoring, and eventually I threw it away.

I needed a whole new plan.

## Named entity recognition

This seemed like the perfect problem for supervised machine learning---I had
lots of data I wanted to categorise; manually categorising a single example was
pretty easy; but manually identifying a general pattern was at best hard, and at
worst impossible.

After considering my options, a [named entity recogniser][wiki-ner] looked like
the right tool to use. Named entity recognisers identify pre-defined categories
in text; in my case I wanted one to recognise name, quantities, and units of
ingredients.

I opted for the [Stanford <abbr title="named entity
recogniser">NER</abbr>][stanford-ner], which uses a conditional random field
sequence model. To be perfectly honest, I don't understand the maths behind this
particular type of model, but you can read the paper<sup><a href="#fn1" id="r1"
title="Footnote 1">1</a></sup> if you want all the gory details. The important
thing for me was that I could train this NER model on my own dataset.

The process I followed to train my model was based on the [Stanford NER FAQ's
Jane Austen example][ner-faq].

## Training the model

The first thing I did was gather my example data. Within a single recipe, the
way the ingredients are written is quite uniform. I wanted to make sure I had a
good range of formats, so I combined the ingredients from around 30,000 online
recipes into a single list, sorted them randomly, and picked the first 1,500 to
be my training set.

It looked like this:

    confectioners' sugar for dusting the cake
    1 1/2 cups diced smoked salmon
    1/2 cup whole almonds (3 oz), toasted
    ...

Next, I used part of Stanford's suite of <abbr title="natural language
processing">NLP</abbr> tools to split these into tokens.

The following command will read text from standard input, and output tokens to
standard output:

```java
java -cp stanford-ner.jar edu.stanford.nlp.process.PTBTokenizer
```

In this case, I wanted to build a model that will understand a single ingredient
description, not a whole set of ingredient descriptions. In NLP parlance, that
means each ingredient description should be considered a separate document. To
represent that to the Stanford NER tools, we need to separate each set of tokens
with a blank line.

I broke them up using a little shell script:

```java
while read line; do
  echo $line | java -cp stanford-ner.jar edu.stanford.nlp.process.PTBTokenizer >> train.tok
  echo >> train.tok
done < train.txt
```

The output looked like this:

<pre>
<samp>confectioners
'
sugar
for
dusting
the
cake

1 1/2
cups
diced
smoked
salmon

1/2
cup
whole
almonds
-LRB-
3
oz
-RRB-
,
toasted

...</samp>
</pre>

The last manual step was to tag the tokens, indicating which was part of the
name of an ingredient, which was part of the quantity, and which was part of the
unit. 1,500 examples was around 10,000 tokens, each labeled by hand---never let
anyone tell you machine learning is all glamour.

Every token needs a label, even tokens that aren't interesting, which are
labelled with `O`. Stanford NER expects the tokens and label to be separated by
a tab character. To get started, I labelled every token with `O`:

```shell
perl -ne 'chomp; $_ =~ /^$/ ? print "\n" : print "$_\tO\n"' \
  train.tok > train.tsv
```

Several hours in vim later, the results looked something like this:

<pre>
<samp>confectioners  NAME
'              NAME
sugar          NAME
for            O
dusting        O
the            O
cake           O

1 1/2          QUANTITY
cups           UNIT
diced          O
smoked         NAME
salmon         NAME

1/2            QUANTITY
cup            UNIT
whole          O
almonds        NAME
-LRB-          O
3              QUANTITY
oz             UNIT
-RRB-          O
,              O
toasted        O

...</samp>
</pre>

Now the training set was finished, I could build the model:

```shell
java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier \
  -trainFile train.tsv \
  -serializeTo ner-model.ser.gz \
  -prop train.prop
```

The `train.prop` file I used was very similar to the Stanford NER FAQ's example
file, [`austen.prop`][austen-prop].

And there I had it! A model that could classify new examples.

## Testing the model

One of the downsides of machine learning is that it's somewhat opaque. I knew I
had trained a model, but I didn't know how accurate it was going to be.
Fortunately, Stanford provide testing tools to let you know how well your model
can generalise to new examples.

I took about another 500 examples at random from my dataset, went through the
same glamorous process of hand-labelling the tokens. Now I had a test set I
could use to validate my model. Our measures of accuracy will be based on how
the token labels produced by the model differ from the token labels I wrote by
hand.

I tested the model using this command:

```shell
java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier \
  -loadClassifier ner-model.ser.gz \
  -testFile text.tsv
```

This test command outputs the test data with the label I'd given each token
and the label the model predicted for each token, followed by a summary of the
accuracy:

<pre><samp>
CRFClassifier tagged 4539 words in 514 documents at 3953.83 words per second.
         Entity P       R       F1      TP      FP      FN
           NAME 0.8327  0.7764  0.8036  448     90      129
       QUANTITY 0.9678  0.9821  0.9749  602     20      11
           UNIT 0.9501  0.9630  0.9565  495     26      19
         Totals 0.9191  0.9067  0.9129  1545    136     159
</samp></pre>

The column headings are a little opaque, but they're standard machine
learning metrics that make good sense with a little explanation.

- `P` is **precision**: this is the number of tokens of a given type that the
  model identified correctly, out of the total number of tokens the model
  predicted were that type. 83% of the tokens the model identified as `NAME`
  tokens really were `NAME` tokens, 97% of the tokens the model identified as
  `QUANTITY` tokens really were `QUANTITY` tokens, etc.

- `R` is **recall**: this is the number of tokens of a given type that the model
  identified correctly, out of the total number of tokens of that type in the
  test set. The model found 78% of the `NAME` tokens, 98% of the `QUANTITY`
  tokens, etc.

- `F` is the **F<sub>1</sub> score**, which combines precision and recall.
  It's possible for a model to be very inaccurate but still score highly on
  precision or on recall---imagine a model that labeled every token as a `NAME`,
  it would get a great recall score. By combining the two as a F<sub>1</sub>
  score we get a single number that's more representative of overall quality.

- `TP`, `FP`, and `FN` are **true positives**, **false positives**, and **false
  negatives** respectively.

## Using the model

Now I had a model and confidence that it was reasonably accurate, I could use
it to classify new examples that weren't in the training or test sets.

Here's the command to run the model:

```shell
$ echo "1/2 cup of flour" | \
  java -cp stanford-ner/stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier \
  -loadClassifier ner-model.ser.gz \
  -readStdin
Invoked on Wed Sep 27 08:18:42 EDT 2017 with arguments: -loadClassifier
ner-model.ser.gz -readStdin
loadClassifier=ner-model.ser.gz
readStdin=true
Loading classifier from ner-model.ser.gz ... done [0.3 sec].
1/2/QUANTITY cup/UNIT of/O flour/NAME
CRFClassifier tagged 4 words in 1 documents at 18.87 words per second.
```

The output looks quite noisy, but most of it goes to STDERR, so we can throw it
away if we choose to:

```shell
$ echo "1/2 cup of flour" | \
  java -cp stanford-ner/stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier \
  -loadClassifier ner-model.ser.gz \
  -readStdin 2>/dev/null
1/2/QUANTITY cup/UNIT of/O flour/NAME
```

## Iterating on the model

Even with these seemingly high F<sub>1</sub> scores, the model was only as good
as its training set. When I went back and ran my full corpus of ingredient
descriptions through the model I quickly discovered some flaws.

The most obvious problem was that the model couldn't recognise fluid ounces as a
unit of measurement. When I looked back at the training set and the test set,
there wasn't a single example of `fluid ounces`, `fl ounces`, or `fl oz`.

My random sample hadn't been large enough to truly represent the data.

I selected additional training and testing examples, taking care to include
various representations of fluid ounces in my training and test sets. The
updated model got similar scores on the updated test sets, it no longer had
trouble with fluid ounces.

## The moral of the story

It's an exciting time for machine learning. Like Web development a decade ago,
the tools are becoming increasingly accessible, to the point where developers
can focus less on the mechanism and more on the problem we're solving.

It's not a silver bullet---no technology solves every problem---but I'm excited
to have these tools at our disposal, when the right kind of problems come along.

If you want to try this for yourself, I packaged up the commands I used into a
Makefile to avoid typing a lot of long-winded commands. You can find that on
GitHub: <https://github.com/georgebrock/ner-tools>

Named entity recognisers aren't the only form of machine learning. If you want
to learn about other models, get comfortable with ideas like precision, recall,
and F<sub>1</sub> scores, and much more, I'd recommend [Andrew Ng's machine
learning course on Coursera][coursera].

---

<a href="#r1" id="fn1">[1]</a> [Jenny Rose Finkel, Trond Grenager, and
Christopher Manning. 2005. Incorporating Non-local Information into Information
Extraction Systems by Gibbs Sampling. Proceedings of the 43nd Annual Meeting of
the Association for Computational Linguistics (ACL 2005), pp. 363-370.][paper]

[h-recipe]: http://microformats.org/wiki/h-recipe
[recipe-schema]: http://schema.org/Recipe
[hangry]: https://github.com/iancanderson/hangry
[stanford-ner]: https://nlp.stanford.edu/software/CRF-NER.html
[ner-faq]: https://nlp.stanford.edu/software/crf-faq.shtml#a
[wiki-ner]: https://en.wikipedia.org/wiki/Named-entity_recognition
[paper]: http://nlp.stanford.edu/~manning/papers/gibbscrf3.pdf
[austen-prop]: https://nlp.stanford.edu/software/ner-example/austen.prop
[coursera]: https://www.coursera.org/learn/machine-learning