---
title: A Grand Piano for Your Violin
teaser: Thoughts on safe indexes to add to Rails app databases.
tags: web,postgresql,sql
author: Matt Jankowski
published_on: 2009-08-15
---

## Database indexes

What are database indexes?  If you're building web apps powered by a relational
database, you should know.

A database index is a data structure in the database which improves the speed of
operations (typically row lookups) on a database table, or across tables. Think
of them like the index in the back of a book.  If you're reading a book about
the Wu Tang Clan and you want to find all the Method Man references, it'd be a
lot easier to flip to the index and find "Method Man", then open up directly to
all those pages numbers (as opposed to scanning every page of the book for the
words "Method Man"), wouldn't it?

Well that's basically how database indexes work.  Assuming you have an indexed
`users.id` column on your users table, when you query a database with...

```sql
SELECT name FROM users WHERE id = 1
```

...the database can find that row very quickly, rather than having to scan
through every row in the table and check whether it's "id" value is equal to 1.

So, in a typical rails application, what things should you index?

## Primary keys

Most <abbr title="Structured Query Language">SQL</abbr> databases with the
concept of a primary key will automatically create an index on the primary key
column when it exists.  In Rails, this is typically the "id" column in a table,
and because Rails tells the DB that it's a primary key, you'll get the index
"for free", created by the database.  This is very important for a "show view"
like /users/1, for example.  The request comes in, and the query to "find user
with id equal to 1" occurs very quickly because the `users.id` column is
indexed.

Best practice: index every primary key

## Foreign keys

Now what if that user has\_many comments?  In Rails, you most likely have a
user\_id column in your comments table, which the `Comment` model will use to
determine the user that it belongs to.  You should have an index on every
foreign key column.  When you make a page like /users/1/comments, two things
need to happen.  First, we lookup the user with id equal to 1.  Assuming we've
indexed primary keys, this will be fast.  Second, we want to find all comments
that belong to this user.  If we've indexed `comments.user_id`, this will be
fast too.

We've actually taken this to a "policy" point where we try to avoid ever naming
a column with an `_id` suffix, unless it is an integer foreign key column that
needs to be indexed.

Best practice: index every foreign key column

## Columns used by to_param

Let's say you've got indexes in place on `users.id` and `comments.user_id`, but
then a request comes in to make "pretty urls" on these pages.  Ok, no problem.
We add a `users.keyword` column to users, and allow users to specify a username
with their account.  Now we can make requests to urls like
`/users/matt/comments`, and see all comments by that user.  Well, same situation
as before, we need to do a find, but this time we're matching against the
'keyword' column and not the 'id' column, so this needs to be indexed.

Best practice: index every column used in a WHERE clause

## Composite keys on join models

What if multiple users could write one comment together, and we needed to store
a timestamp to know *when* each of them last edited the comment?  We could
introduce a `Commentary` join model, which would connect users to comments. This
model would have a `user_id` and `comment_id` column, along with a
`last_edited_at` timestamp, or something.

In this case, there should be a composite index (an index across multiple
columns, basically) on the (`user_id`, `comment_id`) combination on this join
model.  That way, when we run a query to find all comments that a certain user
has been an editor on, we can quickly those comments through the join table.

Best practice: index composite keys on join models

## State columns

What if comments can be in "Draft" (not done yet), "Submitted" (done, awaiting
approval), "Published" (approved) or "Unpublished" (approved then taken down)
states?  We're most likely going to have an interface for users which shows "all
of your draft comments" so that they can resume working on them, and "all your
submitted comments" so they can review their past wisdom.

In this case, we'll add a `comments.publication_state` column to comments (I
have an unfounded fear of using 'state' as a column name; sounds like a reserved
word to me, but it's not!), which will be a varchar column and hold one of those
states as a string.  We should index that column so that the queries to find
comments in a certain state are fast.

Best practice: index every varchar/string column that's used for storing state

## Boolean columns

Well, with the introduction of the "submit for approval" concept we also need to
add an "admin" concept for users, with the option to either be an admin or not
be an admin.  We need a "all users who are admins" view so that we can see at a
glance who is doing comment approval and click through to see what they've
approved.

In this case, we'll add a boolean column called 'admin' to the users table (in
mysql, this is implemented with a TINYINT column, may be different elsewhere).
This column should be indexed, so that the query to find all admin users is
fast.

Best practice: index every boolean column

## Datetime columns

As this commenting application enjoys continued success, we've decided that
viewing comments in the order in which they were posted would really improve our
users' ability to find what they were looking for.  To do this, we'll add an
order by clause to the query that gets the comments for them to view.

To keep this performant, we should add an index to the `created_at` column on
comments.  If we had a view sorting by `published_at` or `updated_at`, we'd want
to index that as well.

Best practice: index every datetime column

## Columns used in polymorphic conditional joins

In Rails when you establish a polymorphic relationship you end up with a
condition on a join in the resulting query.  For example, let's say you can
apply a `Tag` to a `Comment` via a `Tagging` association.  There will be queries
like:

```sql
    SELECT * FROM comments
    INNER JOIN taggings
    ON taggings.taggable_type = 'Comment' and taggings.taggable_id = '3'
    INNER JOIN tags on taggings.tag_id = tags.id
```

In this case, you should add a composite index to `taggings(taggable_type,
taggable_id)` pair, so that the initial lookup goes well.  You should also have
already indexed `taggings(tag_id)` because this is a foreign key association.

Best practice: index all \_type/\_id pairs on polymorphic join tables

## Columns used in validations

Nine times out of ten you probably already have this covered by following the
best practices already enumerated in this post, but another thing to watch out
for are Rails data validations on models.  If you have a column which has a
uniqueness constraint, for example, every time you save a record of that class
(regardless of whether that particular column has changed or not), ActiveRecord
is going to run a query to try to find other rows which have the same data in
that column.  It will be much faster to do this comparison on indexed columns
than non-indexed columns.

An example of this would be something like a `users.email` column, where the
email had to be unique across all user rows in the table.

Best practice: index all columns that will generate queries from model
validations

## Columns used for STI

With the [Single Table
Inheritance](http://martinfowler.com/eaaCatalog/singleTableInheritance.html)
pattern in Active Record, there is a 'type' column created to store the parent
class of a subclass.  For example, if `FancyGroup` inherits from `Group` with
<abbr title="Single Table Inheritance">STI</abbr>, there will be groups table
with a 'type' column, which all `FancyGroup` records will populate with the
"FancyGroup" string, to indicate that those records are of that subclass.

This means that every query which looks up `FancyGroup` records is going to have
at least a `WHERE groups.type = 'FancyGroup'` clause in it, and that means that
the 'type' column on that table should be indexed.

Best practice: index all 'type' columns on tables used for <abbr title="Single
Table Inheritance">STI</abbr>

## How to add an index

In Rails, adding an index is really straightforward.  Here's an example from a
recent application where I was converting what was previously a numeric lookup
to be a "pretty url" lookup, and wanted something like
`/products/thirty-six-chambers` instead of `/products/123`...

```ruby
class AddIndexToProductsKeyword < ActiveRecord::Migration
  def self.up
    add_index :products, :keyword
  end
  def self.down
    remove_index :products, :keyword
  end
end
```

The `products.keyword` column already existed in this case, it was a matter of
adding an index to that column.  The Rails guides site has more about [Rails
Migrations](http://guide.rails.info/migrations.html).

To add a composite index (an index across multiple columns), Rails uses an Array
syntax.  Here's an example for the earlier `Tagging` scenario:

```ruby
class AddIndicesToAllPolymorphicTables < ActiveRecord::Migration
  def self.up
    add_index :taggings, [:taggable_type, :taggable_id]
  end
  def self.down
    remove_index :taggings, :column => [:taggable_type, :taggable_id]
  end
end
```

Side note - the pattern by which Rails generates *names* for indexes has changed
over time.  In earlier versions of rails, for example, the table name was
prefixed onto the index name, creating names like `posts_published_index`. In
more recent versions, the same migration would generate an index named
`index_posts_on_published`.  If you are starting a new application, you can
probably afford to not care about this at all.  If you are updating a legacy
application which added indexes to your production database using the old
naming, be careful about how you add/remove/rename indexes, so that local
developer environments don't wind up out of sync with the production
environment.

## But I don't have that much data

Contrary to developer myth, database indexes are not "just for speed" (they are
technically for speed, but they are for achieving it in the first place, not
fixing problems), or only meant to be used once your data reaches a certain
size, or a case of premature optimization.  Sure, maybe you'll get away with
decent performance on your local database running in development mode with you
as the only user, but that's not the real use case for the application, is it?
Database indexes are not tools to troubleshoot speed problems or data growing
pains - **they're a fundamental tool that relational databases use to function
correctly in the first place**.

## What about search

This post doesn't cover doing full text search against string and text fields,
perhaps a future post will.  That being said, basically everything in this post
will likely assist in speeding up any query that is doing search, by virtue of
speeding up the joins and ordering and state restrictions that may be included
in a search query.

## Ok, so this gets me like a 5% speed increase, right?  Who cares

Totally wrong.  It's hard to put an average on the speed increase, because it
depends on the schema and the query.  But that's not the point.  If you're not
properly indexing the columns in the database and making sure your queries are
using them, you're not using the tool correctly.

For the sake of example - it's not too out of the ordinary to see a non-indexed
query that takes 1 minute or more to complete drop down to the tenths or
hundredths of a second in length once it's been made to use proper indexes. It's
a BIG DEAL.  Run your own tests if you don't believe me.  Have someone else run
tests for you if you don't know how to run tests.  Just take my word for it if
you don't know anyone competent enough to run tests for you.

## What's the downside

Well, there's no "catch", other than needing to remember to do this regularly.
There are two "downsides" to indexing.

First, if you haven't introduced indexes from day one, or you are adding an
index to a new column in a table with many rows, the migration can take quite a
bit of time.  For example, when we passed about 12 million rows in the
"messages" table in [Thunder Thimble](http://thunderthimble.com), we started
experiencing some sluggishness on a certain query, and found that indexing a
previously unindexed column was necessary to get things snappy again.  Running
that migration on the 12 million row table took about 7 hours, and the bulk of
that time was spent adding the index.

Second, you can't just go about willy nilly and add database indexes to EVERY
column in every table in your database.  There *is a cost* to the database to
*maintain* those indexes, and every time an INSERT or UPDATE occurs, there is
work to be done that would not need to be done without the index in place.  In
most scenarios, this is more of a theoretical fear than a practical fear, but
definitely do your homework on it before you go index crazy.

## How about some brunch

Adding database indexes to migrations when you should, and doing regular
database schema reviews to look for missing indexes is part of any applications
healthy breakfast.

* [Database Indexes on wikipedia](http://en.wikipedia.org/wiki/Index_(database))
* [20bits interview on database indexes](http://20bits.com/articles/interview-questions-database-indexes/)
* [How mySQL uses indexes](http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html)
* [Jamis Buck explains indexing databases](http://weblog.jamisbuck.org/2006/10/23/indexing-for-db-performance)

You can find missing database indexes with the
[lol_dba](https://github.com/plentz/lol_dba) gem.
