---
title: Full-text search in your browser
teaser: 'Add search functionality to your web page using the lunr.js JavaScript library.

  '
tags: javascript,search,new bamboo,web
author: Oliver Nightingale
published_on: 2013-02-26
---

_This post was originally published on the New Bamboo blog, before [New Bamboo
joined thoughtbot in London][new-bamboo-thoughtbot]._

---

Search is an important part of any application: users instinctively know where
to look when they want to search within your site.  For many users search is
often the preferred way of getting to the functionality they want in your app.

Implementing search usually involves some external process which manages
indexing and handling queries against your data, such as [Solr] or [Elastic
Search].  This is often a secondary piece of your application's architecture,
and is yet another service to host and maintain. For some sites or applications
this extra overhead can be an additional layer of complexity too far.

[lunr.js] is a small JavaScript library that offers full text search in your
browser, providing simple, yet powerful search.  By removing the need of extra
server side processes, search can be a feature on sites or apps that otherwise
would not have warranted the extra complexity.

## Usage

Adding full-text search to your app or site couldn't be easier, after including
lunr in your page simply define your index:

```javascript
var index = lunr(function () {
    this.field('title', {boost: 10})
    this.field('body')
    this.ref('id')
})
```

This is setting up an index with two fields, 'title' and 'body'. The 'title'
field is given a boost, meaning that terms in the title will have a larger
impact on the overall document score than those in the 'body' field. This also
specifies the reference to use for uniquely identifying documents in the index,
in this case the 'id' field is used.

With the index defined JSON documents can be added:

```javascript
var documents = [{
    id: 1,
    title: 'Twelfth Night',
    body: 'If music be the food of love, play on'
}, {
    id: 2,
    title: 'Macbeth',
    body: 'When shall we three meet again, In thunder, lightning, or in rain?'
}, {
    id: 3,
    title: 'Richard III',
    body: 'Now is the winter of our discontent, Made glorious summer by this sun of York;'
}]

documents.forEach(function (document) {
    index.add(document)
})
```

Searching is equally simple:

```javascript
index.search('love') -> [{ ref: 1, score: 0.7854 }]
```

## Techniques

lunr implements a number of techniques from information retrieval, combining
them into a compact yet powerful library. It applies these in two main areas:
performing text processing on incoming documents and queries, and storing and
retrieving documents by term in an index. This article will go into some of the
techniques used to achieve this.

### Tokenising

The first step when processing the text is tokenisation. This converts strings
containing words and sentences into distinct tokens that are easier on which to
perform further processing. This is a deceptively complex task for a computer to
accomplish.

For example, consider the following sentence:

> Friends, Romans, Countrymen, lend me your ears;

It should be clear that this sentence would be converted to the following
tokens:

    "Friends", "Romans", "Countrymen", "lend", "me", "your", "ears"

However what about the following:

> Mr. O'Neil hasn't visited New York.

Should the first token be "Mr." or "Mr"? How do you distinguish between this
word, where a full-stop follows a word, and a word at the end of sentence, such
as "York."?

The name "O'Neil" is another edge case. As humans we have no problem in seeing
the token as "O'Neil", but the computer doesn't know anything about names. This
same limitation is shown when trying to tokenise "New York". In this sentence
"New York" clearly has meaning as a single token, but a tokenizer would probably
split this into two tokens: "New" and "York".

Splitting the sentence into tokens discards any inherent meaning the words have
based on their relative positions. It does, however, make indexing and search
much simpler, and for most cases will provide decent enough accuracy and recall.
It allows the remaining text processing to work on single tokens  which reduces
complexity.

### Stemming

Once the documents have been tokenised, lunr applies a [stemming] algorithm to
each token. Stemming is a means of reducing a word down to its base, or stem.
For example 'search', 'searching' and 'searched' all get reduced to the stem
'search'.

Stemming greatly reduces the number of terms that the search engine has to
index, which leads to a much more performant search experience.  It also means
that search queries do not have to contain exact word matches to the documents
in the index.  A search for the term 'fish' should match all documents that
contained any word that can be stemmed to 'fish'. This improves recall;
returning results of which more will be relevant to the user.

### Stop word filtering

In any document there will be many words that appear regularly but provide
little or no extra meaning to the document. Words such as 'the', 'and', 'is' and
'on' are very frequent in the English language and most documents will contain
many instances of them.  These words are generally not very useful when
searching, they are not normally what users are searching for when entering
queries.  Because of this it can be beneficial to remove these words from the
index. This has the benefit of reducing the size of the index, as well as
improving the performance of retrieving documents from the index.

Before entering the index, lunr passes documents through a stop word filter,
removing many common words before any further processing takes place. This
filter is fairly generic, but can be paired with a more corpus specific filter
for even better reduction in the number of useless words in the index.

### Text processing Pipeline

lunr uses a configurable pipeline to perform text processing on documents before
they enter the index. The built-in text processors, discussed above, are part of
this pipeline. Following a middleware style architecture allows lunr to have an
easily extensible text processing system. Whilst it ships with English specific
text algorithms, it is simple to replace these with language specific
processors, which allows lunr to work with any language.

This pipeline also allows extra text processing to be added. For example, adding
[Metaphoning] reduces tokens to sounds to allow misspelt queries to still match
documents, or [n-gram] processing to combine preceding tokens into the index.

### TF-IDF

Term frequency-inverse document frequency ([TF-IDF]) is a technique used to
retrieve relevant documents from a collection of documents. It normalises the
importance of a term across many documents of varying sizes. lunr makes use of
TF-IDF to help score terms in documents.

TF-IDF is calculated by multiplying the term frequency by the inverse of the
document frequency of that term.  Term frequency is how often a particular term
appears in this document, and document frequency is how many documents contain
at least one instance of that term.

The term frequency represents how important a term is to a document. It makes
sense that a document with many occurrences of a term is probably very relevant
to a search for that term, hence the term frequency will be high.  This is
normalised to take into account the length of the document, which prevents
longer documents dominating search results.

The inverse document frequency acts to prevent very common terms from affecting
the search results. A term that appears in all documents is not a good term to
differentiate documents, so very common terms across all documents are penalised
and contribute less to a particular document's score.

In this way "stop words, very common words such as 'and', 'the' or 'but', have
little impact on a search. lunr excludes common stop words before this step so
as to reduce the size of the index, but TF-IDF can help to lessen the impact of
any corpus specific stop words that might not be included in the default stop
word filter.

### Cosine Similarity

Once terms have been scored for their relevancy  to a document, lunr can
calculate how relevant each document is to a query. To do this, lunr uses a
measure called cosine similarity.

Both documents and queries can be represented as vectors. These vectors have a
dimension for every token in the entire index. The values of these dimensions
will be the TF-IDF score for a token in the document/query being represented.

The angle, &theta;, between these vectors represents how similar they are to
each other, which in turn tells us how similar a document is to a query, or
another document. Identical vectors will have cos &theta; = 1, whilst completely
opposite vectors will have cos &theta; = 0.

This is much easier to visualise when dealing with just two dimensions, as shown
in the diagram below. Although harder to imagine, vectors with a large number of
dimensions behave in the same way.

![](https://images.thoughtbot.com/new-bamboo/blog/full-text-search-in-your-browser/grAjTCWzRZmkyQG9szme_cosine_similarity.png)

In the diagram, V<sub>q</sub> represents the query, whilst V<sub>1</sub> and
V<sub>2</sub> are documents. The angle, &theta;, is smaller between
V<sub>q</sub> and V<sub>1</sub> and so this document is more similar to the
query than V<sub>2</sub>.

### Inverted Index

An inverted index is a key part of any search engine. They are used to map from
tokens to documents containing that token. lunr uses a [trie] to implement its
inverted index which makes prefix searching simple to achieve. The token lookup
time is dependent only on  the length of the token to be found.

![](https://images.thoughtbot.com/new-bamboo/blog/full-text-search-in-your-browser/69Ks9UR8QHWw2raDf2Kp_trie.png)

A trie is a kind of tree where each node contains a section of the key, and all
of its child nodes represent terms that share a common prefix. By traversing the
tree to the end of the given input key we find a sub-tree where all the nodes
contain words that match the prefix.

![](https://images.thoughtbot.com/new-bamboo/blog/full-text-search-in-your-browser/Xblx4vKnSVm6QiZBcg1G_trie_heat.png)

In the example above we get to the word "heat" by navigating down the trie for
each letter, at the last letter of our term, the node will store all the
documents that contain the term heat.  The sub-tree below this node will contain
all the documents that contain a term starting with "heat".

For suffix matching, a duplicate trie can be created where the tokens are
reversed before being entered into the trie; combining both the prefix and
suffix search tries allows for powerful query expansions.

## Conclusion

Combining the above techniques enables lunr to be a surprisingly powerful search
engine that is easy to start using in your applications today. It is not as
fully featured as server side search software, however it is trivial in
comparison to set up and maintain. Its simplicity also makes it easy to
understand and customise for specific use-cases if required.

It works well even with relatively large datasets, and can be run in a
web-worker if performance starts to become an issue. It has the benefit of being
available to service search queries even when the user is offline, and since it
is running on the client, it has a head start performance wise over server side
searches since there is no network overhead.

[Elastic Search]: http://www.elasticsearch.org/
[lunr.js]: http://lunrjs.com
[Metaphoning]: http://en.wikipedia.org/wiki/Metaphone
[n-gram]: http://en.wikipedia.org/wiki/N-gram
[new-bamboo-thoughtbot]: https://thoughtbot.com/blog/new-bamboo-joins-thoughtbot-in-london
[Solr]: http://lucene.apache.org/solr/
[stemming]: http://en.wikipedia.org/wiki/Stemming
[TF-IDF]: http://en.wikipedia.org/wiki/TF_IDF
[trie]: http://en.wikipedia.org/wiki/Trie
