I wanted to experiment with a reasonably large recipe dataset, to play around with some meal planning ideas. The trouble was, I didn’t have a dataset.
No problem, I thought, there are loads of recipes on the Web—I’ll use some of those!
Thanks to embedded data formats like the h-recipe microformat, and
the recipe schema from Schema.org, many of the recipes
published on the Web are marked up semantically. Even better, there’s a Ruby
hangry to parse these formats. In no time, I was turning
recipes into structured data.
The thing I was most interested in was ingredients, and here I hit my next problem: I had human readable lists of ingredients, but nothing sufficiently structured to compare quantities, find similarities, or convert units.
The first few examples I looked at seemed pretty simple:
[ "2 tablespoons butter", "2 tablespoons flour", "1/2 cup white wine", "1 cup chicken broth", ]
It seemed like a clear pattern was emerging, and maybe one line of Ruby code would suffice:
quantity, unit, name = description.split(" ", 3)
Unfortunately, the reality was much more complex. I found more and more examples that didn’t fit this simple pattern. Some ingredients had multiple quantities that needed to be combined (“3 cups and 2 tablespoons”, or “2 10 ounce packages”); others had alternative quantities in metric and imperial, or in cups and ounces; still others followed the ingredient name with preparation instructions, or listed multiple ingredients together in the same item.
The special cases piled higher and higher, and my simple Ruby code got more and more tangled. I stopped feeling good about the code, then I stopped feeling like it would be OK after refactoring, and eventually I threw it away.
I needed a whole new plan.
This seemed like the perfect problem for supervised machine learning—I had lots of data I wanted to categorise; manually categorising a single example was pretty easy; but manually identifying a general pattern was at best hard, and at worst impossible.
After considering my options, a named entity recogniser looked like the right tool to use. Named entity recognisers identify pre-defined categories in text; in my case I wanted one to recognise name, quantities, and units of ingredients.
I opted for the Stanford NER, which uses a conditional random field sequence model. To be perfectly honest, I don’t understand the maths behind this particular type of model, but you can read the paper1 if you want all the gory details. The important thing for me was that I could train this NER model on my own dataset.
The process I followed to train my model was based on the Stanford NER FAQ’s Jane Austen example.
The first thing I did was gather my example data. Within a single recipe, the way the ingredients are written is quite uniform. I wanted to make sure I had a good range of formats, so I combined the ingredients from around 30,000 online recipes into a single list, sorted them randomly, and picked the first 1,500 to be my training set.
It looked like this:
confectioners' sugar for dusting the cake 1 1/2 cups diced smoked salmon 1/2 cup whole almonds (3 oz), toasted ...
Next, I used part of Stanford’s suite of NLP tools to split these into tokens.
The following command will read text from standard input, and output tokens to standard output:
java -cp stanford-ner.jar edu.stanford.nlp.process.PTBTokenizer
In this case, I wanted to build a model that will understand a single ingredient description, not a whole set of ingredient descriptions. In NLP parlance, that means each ingredient description should be considered a separate document. To represent that to the Stanford NER tools, we need to separate each set of tokens with a blank line.
I broke them up using a little shell script:
while read line; do echo $line | java -cp stanford-ner.jar edu.stanford.nlp.process.PTBTokenizer >> train.tok echo >> train.tok done < train.txt
The output looked like this:
confectioners ' sugar for dusting the cake 1 1/2 cups diced smoked salmon 1/2 cup whole almonds -LRB- 3 oz -RRB- , toasted ...
The last manual step was to tag the tokens, indicating which was part of the name of an ingredient, which was part of the quantity, and which was part of the unit. 1,500 examples was around 10,000 tokens, each labeled by hand—never let anyone tell you machine learning is all glamour.
Every token needs a label, even tokens that aren’t interesting, which are
O. Stanford NER expects the tokens and label to be separated by
a tab character. To get started, I labelled every token with
perl -ne 'chomp; $_ =~ /^$/ ? print "\n" : print "$_\tO\n"' \ train.tok > train.tsv
Several hours in vim later, the results looked something like this:
confectioners NAME ' NAME sugar NAME for O dusting O the O cake O 1 1/2 QUANTITY cups UNIT diced O smoked NAME salmon NAME 1/2 QUANTITY cup UNIT whole O almonds NAME -LRB- O 3 QUANTITY oz UNIT -RRB- O , O toasted O ...
Now the training set was finished, I could build the model:
java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier \ -trainFile train.tsv \ -serializeTo ner-model.ser.gz \ -prop train.prop
train.prop file I used was very similar to the Stanford NER FAQ’s example
And there I had it! A model that could classify new examples.
One of the downsides of machine learning is that it’s somewhat opaque. I knew I had trained a model, but I didn’t know how accurate it was going to be. Fortunately, Stanford provide testing tools to let you know how well your model can generalise to new examples.
I took about another 500 examples at random from my dataset, went through the same glamorous process of hand-labelling the tokens. Now I had a test set I could use to validate my model. Our measures of accuracy will be based on how the token labels produced by the model differ from the token labels I wrote by hand.
I tested the model using this command:
java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier \ -loadClassifier ner-model.ser.gz \ -testFile text.tsv
This test command outputs the test data with the label I’d given each token and the label the model predicted for each token, followed by a summary of the accuracy:
CRFClassifier tagged 4539 words in 514 documents at 3953.83 words per second. Entity P R F1 TP FP FN NAME 0.8327 0.7764 0.8036 448 90 129 QUANTITY 0.9678 0.9821 0.9749 602 20 11 UNIT 0.9501 0.9630 0.9565 495 26 19 Totals 0.9191 0.9067 0.9129 1545 136 159
The column headings are a little opaque, but they’re standard machine learning metrics that make good sense with a little explanation.
Pis precision: this is the number of tokens of a given type that the model identified correctly, out of the total number of tokens the model predicted were that type. 83% of the tokens the model identified as
NAMEtokens really were
NAMEtokens, 97% of the tokens the model identified as
QUANTITYtokens really were
Ris recall: this is the number of tokens of a given type that the model identified correctly, out of the total number of tokens of that type in the test set. The model found 78% of the
NAMEtokens, 98% of the
Fis the F1 score, which combines precision and recall. It’s possible for a model to be very inaccurate but still score highly on precision or on recall—imagine a model that labeled every token as a
NAME, it would get a great recall score. By combining the two as a F1 score we get a single number that’s more representative of overall quality.
FNare true positives, false positives, and false negatives respectively.
Now I had a model and confidence that it was reasonably accurate, I could use it to classify new examples that weren’t in the training or test sets.
Here’s the command to run the model:
$ echo "1/2 cup of flour" | \ java -cp stanford-ner/stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier \ -loadClassifier ner-model.ser.gz \ -readStdin Invoked on Wed Sep 27 08:18:42 EDT 2017 with arguments: -loadClassifier ner-model.ser.gz -readStdin loadClassifier=ner-model.ser.gz readStdin=true Loading classifier from ner-model.ser.gz ... done [0.3 sec]. 1/2/QUANTITY cup/UNIT of/O flour/NAME CRFClassifier tagged 4 words in 1 documents at 18.87 words per second.
The output looks quite noisy, but most of it goes to STDERR, so we can throw it away if we choose to:
$ echo "1/2 cup of flour" | \ java -cp stanford-ner/stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier \ -loadClassifier ner-model.ser.gz \ -readStdin 2>/dev/null 1/2/QUANTITY cup/UNIT of/O flour/NAME
Even with these seemingly high F1 scores, the model was only as good as its training set. When I went back and ran my full corpus of ingredient descriptions through the model I quickly discovered some flaws.
The most obvious problem was that the model couldn’t recognise fluid ounces as a
unit of measurement. When I looked back at the training set and the test set,
there wasn’t a single example of
fl ounces, or
My random sample hadn’t been large enough to truly represent the data.
I selected additional training and testing examples, taking care to include various representations of fluid ounces in my training and test sets. The updated model got similar scores on the updated test sets, it no longer had trouble with fluid ounces.
It’s an exciting time for machine learning. Like Web development a decade ago, the tools are becoming increasingly accessible, to the point where developers can focus less on the mechanism and more on the problem we’re solving.
It’s not a silver bullet—no technology solves every problem—but I’m excited to have these tools at our disposal, when the right kind of problems come along.
If you want to try this for yourself, I packaged up the commands I used into a Makefile to avoid typing a lot of long-winded commands. You can find that on GitHub: https://github.com/georgebrock/ner-tools
Named entity recognisers aren’t the only form of machine learning. If you want to learn about other models, get comfortable with ideas like precision, recall, and F1 scores, and much more, I’d recommend Andrew Ng’s machine learning course on Coursera.
 Jenny Rose Finkel, Trond Grenager, and Christopher Manning. 2005. Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling. Proceedings of the 43nd Annual Meeting of the Association for Computational Linguistics (ACL 2005), pp. 363-370.