Full-text search in your browser

This post was originally published on the New Bamboo blog, before New Bamboo joined thoughtbot in London.

Search is an important part of any application: users instinctively know where to look when they want to search within your site. For many users search is often the preferred way of getting to the functionality they want in your app.

Implementing search usually involves some external process which manages indexing and handling queries against your data, such as Solr or Elastic Search. This is often a secondary piece of your application’s architecture, and is yet another service to host and maintain. For some sites or applications this extra overhead can be an additional layer of complexity too far.

lunr.js is a small JavaScript library that offers full text search in your browser, providing simple, yet powerful search. By removing the need of extra server side processes, search can be a feature on sites or apps that otherwise would not have warranted the extra complexity.

Usage

Adding full-text search to your app or site couldn’t be easier, after including lunr in your page simply define your index:

var index = lunr(function () {
    this.field('title', {boost: 10})
    this.field('body')
    this.ref('id')
})

This is setting up an index with two fields, ‘title’ and ‘body’. The ‘title’ field is given a boost, meaning that terms in the title will have a larger impact on the overall document score than those in the ‘body’ field. This also specifies the reference to use for uniquely identifying documents in the index, in this case the ‘id’ field is used.

With the index defined JSON documents can be added:

var documents = [{
    id: 1,
    title: 'Twelfth Night',
    body: 'If music be the food of love, play on'
}, {
    id: 2,
    title: 'Macbeth',
    body: 'When shall we three meet again, In thunder, lightning, or in rain?'
}, {
    id: 3,
    title: 'Richard III',
    body: 'Now is the winter of our discontent, Made glorious summer by this sun of York;'
}]

documents.forEach(function (document) {
    index.add(document)
})

Searching is equally simple:

index.search('love') -> [{ ref: 1, score: 0.7854 }]

Techniques

lunr implements a number of techniques from information retrieval, combining them into a compact yet powerful library. It applies these in two main areas: performing text processing on incoming documents and queries, and storing and retrieving documents by term in an index. This article will go into some of the techniques used to achieve this.

Tokenising

The first step when processing the text is tokenisation. This converts strings containing words and sentences into distinct tokens that are easier on which to perform further processing. This is a deceptively complex task for a computer to accomplish.

For example, consider the following sentence:

Friends, Romans, Countrymen, lend me your ears;

It should be clear that this sentence would be converted to the following tokens:

"Friends", "Romans", "Countrymen", "lend", "me", "your", "ears"

However what about the following:

Mr. O'Neil hasn’t visited New York.

Should the first token be “Mr.” or “Mr”? How do you distinguish between this word, where a full-stop follows a word, and a word at the end of sentence, such as “York.”?

The name “O'Neil” is another edge case. As humans we have no problem in seeing the token as “O'Neil”, but the computer doesn’t know anything about names. This same limitation is shown when trying to tokenise “New York”. In this sentence “New York” clearly has meaning as a single token, but a tokenizer would probably split this into two tokens: “New” and “York”.

Splitting the sentence into tokens discards any inherent meaning the words have based on their relative positions. It does, however, make indexing and search much simpler, and for most cases will provide decent enough accuracy and recall. It allows the remaining text processing to work on single tokens which reduces complexity.

Stemming

Once the documents have been tokenised, lunr applies a stemming algorithm to each token. Stemming is a means of reducing a word down to its base, or stem. For example ‘search’, ‘searching’ and ‘searched’ all get reduced to the stem ‘search’.

Stemming greatly reduces the number of terms that the search engine has to index, which leads to a much more performant search experience. It also means that search queries do not have to contain exact word matches to the documents in the index. A search for the term ‘fish’ should match all documents that contained any word that can be stemmed to ‘fish’. This improves recall; returning results of which more will be relevant to the user.

Stop word filtering

In any document there will be many words that appear regularly but provide little or no extra meaning to the document. Words such as ‘the’, ‘and’, ‘is’ and ‘on’ are very frequent in the English language and most documents will contain many instances of them. These words are generally not very useful when searching, they are not normally what users are searching for when entering queries. Because of this it can be beneficial to remove these words from the index. This has the benefit of reducing the size of the index, as well as improving the performance of retrieving documents from the index.

Before entering the index, lunr passes documents through a stop word filter, removing many common words before any further processing takes place. This filter is fairly generic, but can be paired with a more corpus specific filter for even better reduction in the number of useless words in the index.

Text processing Pipeline

lunr uses a configurable pipeline to perform text processing on documents before they enter the index. The built-in text processors, discussed above, are part of this pipeline. Following a middleware style architecture allows lunr to have an easily extensible text processing system. Whilst it ships with English specific text algorithms, it is simple to replace these with language specific processors, which allows lunr to work with any language.

This pipeline also allows extra text processing to be added. For example, adding Metaphoning reduces tokens to sounds to allow misspelt queries to still match documents, or n-gram processing to combine preceding tokens into the index.

TF-IDF

Term frequency-inverse document frequency (TF-IDF) is a technique used to retrieve relevant documents from a collection of documents. It normalises the importance of a term across many documents of varying sizes. lunr makes use of TF-IDF to help score terms in documents.

TF-IDF is calculated by multiplying the term frequency by the inverse of the document frequency of that term. Term frequency is how often a particular term appears in this document, and document frequency is how many documents contain at least one instance of that term.

The term frequency represents how important a term is to a document. It makes sense that a document with many occurrences of a term is probably very relevant to a search for that term, hence the term frequency will be high. This is normalised to take into account the length of the document, which prevents longer documents dominating search results.

The inverse document frequency acts to prevent very common terms from affecting the search results. A term that appears in all documents is not a good term to differentiate documents, so very common terms across all documents are penalised and contribute less to a particular document’s score.

In this way “stop words, very common words such as ‘and’, ‘the’ or ‘but’, have little impact on a search. lunr excludes common stop words before this step so as to reduce the size of the index, but TF-IDF can help to lessen the impact of any corpus specific stop words that might not be included in the default stop word filter.

Cosine Similarity

Once terms have been scored for their relevancy to a document, lunr can calculate how relevant each document is to a query. To do this, lunr uses a measure called cosine similarity.

Both documents and queries can be represented as vectors. These vectors have a dimension for every token in the entire index. The values of these dimensions will be the TF-IDF score for a token in the document/query being represented.

The angle, θ, between these vectors represents how similar they are to each other, which in turn tells us how similar a document is to a query, or another document. Identical vectors will have cos θ = 1, whilst completely opposite vectors will have cos θ = 0.

This is much easier to visualise when dealing with just two dimensions, as shown in the diagram below. Although harder to imagine, vectors with a large number of dimensions behave in the same way.

In the diagram, V_q represents the query, whilst V₁ and V₂ are documents. The angle, θ, is smaller between V_q and V₁ and so this document is more similar to the query than V₂.

Inverted Index

An inverted index is a key part of any search engine. They are used to map from tokens to documents containing that token. lunr uses a trie to implement its inverted index which makes prefix searching simple to achieve. The token lookup time is dependent only on the length of the token to be found.

A trie is a kind of tree where each node contains a section of the key, and all of its child nodes represent terms that share a common prefix. By traversing the tree to the end of the given input key we find a sub-tree where all the nodes contain words that match the prefix.

In the example above we get to the word "heat” by navigating down the trie for each letter, at the last letter of our term, the node will store all the documents that contain the term heat. The sub-tree below this node will contain all the documents that contain a term starting with “heat”.

For suffix matching, a duplicate trie can be created where the tokens are reversed before being entered into the trie; combining both the prefix and suffix search tries allows for powerful query expansions.

Conclusion

Combining the above techniques enables lunr to be a surprisingly powerful search engine that is easy to start using in your applications today. It is not as fully featured as server side search software, however it is trivial in comparison to set up and maintain. Its simplicity also makes it easy to understand and customise for specific use-cases if required.

It works well even with relatively large datasets, and can be run in a web-worker if performance starts to become an issue. It has the benefit of being available to service search queries even when the user is offline, and since it is running on the client, it has a head start performance wise over server side searches since there is no network overhead.