A Grand Piano for Your Violin

Database indexes

What are database indexes? If you’re building web apps powered by a relational database, you should know.

A database index is a data structure in the database which improves the speed of operations (typically row lookups) on a database table, or across tables. Think of them like the index in the back of a book. If you’re reading a book about the Wu Tang Clan and you want to find all the Method Man references, it’d be a lot easier to flip to the index and find “Method Man”, then open up directly to all those pages numbers (as opposed to scanning every page of the book for the words “Method Man”), wouldn’t it?

Well that’s basically how database indexes work. Assuming you have an indexed users.id column on your users table, when you query a database with…

SELECT name FROM users WHERE id = 1

…the database can find that row very quickly, rather than having to scan through every row in the table and check whether it’s “id” value is equal to 1.

So, in a typical rails application, what things should you index?

Primary keys

Most SQL databases with the concept of a primary key will automatically create an index on the primary key column when it exists. In Rails, this is typically the “id” column in a table, and because Rails tells the DB that it’s a primary key, you’ll get the index “for free”, created by the database. This is very important for a “show view” like /users/1, for example. The request comes in, and the query to “find user with id equal to 1” occurs very quickly because the users.id column is indexed.

Best practice: index every primary key

Foreign keys

Now what if that user has_many comments? In Rails, you most likely have a user_id column in your comments table, which the Comment model will use to determine the user that it belongs to. You should have an index on every foreign key column. When you make a page like /users/1/comments, two things need to happen. First, we lookup the user with id equal to 1. Assuming we’ve indexed primary keys, this will be fast. Second, we want to find all comments that belong to this user. If we’ve indexed comments.user_id, this will be fast too.

We’ve actually taken this to a “policy” point where we try to avoid ever naming a column with an _id suffix, unless it is an integer foreign key column that needs to be indexed.

Best practice: index every foreign key column

Columns used by to_param

Let’s say you’ve got indexes in place on users.id and comments.user_id, but then a request comes in to make “pretty urls” on these pages. Ok, no problem. We add a users.keyword column to users, and allow users to specify a username with their account. Now we can make requests to urls like /users/matt/comments, and see all comments by that user. Well, same situation as before, we need to do a find, but this time we’re matching against the ‘keyword’ column and not the ‘id’ column, so this needs to be indexed.

Best practice: index every column used in a WHERE clause

Composite keys on join models

What if multiple users could write one comment together, and we needed to store a timestamp to know when each of them last edited the comment? We could introduce a Commentary join model, which would connect users to comments. This model would have a user_id and comment_id column, along with a last_edited_at timestamp, or something.

In this case, there should be a composite index (an index across multiple columns, basically) on the (user_id, comment_id) combination on this join model. That way, when we run a query to find all comments that a certain user has been an editor on, we can quickly those comments through the join table.

Best practice: index composite keys on join models

State columns

What if comments can be in “Draft” (not done yet), “Submitted” (done, awaiting approval), “Published” (approved) or “Unpublished” (approved then taken down) states? We’re most likely going to have an interface for users which shows “all of your draft comments” so that they can resume working on them, and “all your submitted comments” so they can review their past wisdom.

In this case, we’ll add a comments.publication_state column to comments (I have an unfounded fear of using ‘state’ as a column name; sounds like a reserved word to me, but it’s not!), which will be a varchar column and hold one of those states as a string. We should index that column so that the queries to find comments in a certain state are fast.

Best practice: index every varchar/string column that’s used for storing state

Boolean columns

Well, with the introduction of the “submit for approval” concept we also need to add an “admin” concept for users, with the option to either be an admin or not be an admin. We need a “all users who are admins” view so that we can see at a glance who is doing comment approval and click through to see what they’ve approved.

In this case, we’ll add a boolean column called ‘admin’ to the users table (in mysql, this is implemented with a TINYINT column, may be different elsewhere). This column should be indexed, so that the query to find all admin users is fast.

Best practice: index every boolean column

Datetime columns

As this commenting application enjoys continued success, we’ve decided that viewing comments in the order in which they were posted would really improve our users’ ability to find what they were looking for. To do this, we’ll add an order by clause to the query that gets the comments for them to view.

To keep this performant, we should add an index to the created_at column on comments. If we had a view sorting by published_at or updated_at, we’d want to index that as well.

Best practice: index every datetime column

Columns used in polymorphic conditional joins

In Rails when you establish a polymorphic relationship you end up with a condition on a join in the resulting query. For example, let’s say you can apply a Tag to a Comment via a Tagging association. There will be queries like:

    SELECT * FROM comments
    INNER JOIN taggings
    ON taggings.taggable_type = 'Comment' and taggings.taggable_id = '3'
    INNER JOIN tags on taggings.tag_id = tags.id

In this case, you should add a composite index to taggings(taggable_type, taggable_id) pair, so that the initial lookup goes well. You should also have already indexed taggings(tag_id) because this is a foreign key association.

Best practice: index all _type/_id pairs on polymorphic join tables

Columns used in validations

Nine times out of ten you probably already have this covered by following the best practices already enumerated in this post, but another thing to watch out for are Rails data validations on models. If you have a column which has a uniqueness constraint, for example, every time you save a record of that class (regardless of whether that particular column has changed or not), ActiveRecord is going to run a query to try to find other rows which have the same data in that column. It will be much faster to do this comparison on indexed columns than non-indexed columns.

An example of this would be something like a users.email column, where the email had to be unique across all user rows in the table.

Best practice: index all columns that will generate queries from model validations

Columns used for STI

With the Single Table Inheritance pattern in Active Record, there is a ‘type’ column created to store the parent class of a subclass. For example, if FancyGroup inherits from Group with STI, there will be groups table with a ‘type’ column, which all FancyGroup records will populate with the “FancyGroup” string, to indicate that those records are of that subclass.

This means that every query which looks up FancyGroup records is going to have at least a WHERE groups.type = 'FancyGroup' clause in it, and that means that the ‘type’ column on that table should be indexed.

Best practice: index all ‘type’ columns on tables used for STI

How to add an index

In Rails, adding an index is really straightforward. Here’s an example from a recent application where I was converting what was previously a numeric lookup to be a “pretty url” lookup, and wanted something like /products/thirty-six-chambers instead of /products/123…

class AddIndexToProductsKeyword < ActiveRecord::Migration
  def self.up
    add_index :products, :keyword
  end
  def self.down
    remove_index :products, :keyword
  end
end

The products.keyword column already existed in this case, it was a matter of adding an index to that column. The Rails guides site has more about Rails Migrations.

To add a composite index (an index across multiple columns), Rails uses an Array syntax. Here’s an example for the earlier Tagging scenario:

class AddIndicesToAllPolymorphicTables < ActiveRecord::Migration
  def self.up
    add_index :taggings, [:taggable_type, :taggable_id]
  end
  def self.down
    remove_index :taggings, :column => [:taggable_type, :taggable_id]
  end
end

Side note - the pattern by which Rails generates names for indexes has changed over time. In earlier versions of rails, for example, the table name was prefixed onto the index name, creating names like posts_published_index. In more recent versions, the same migration would generate an index named index_posts_on_published. If you are starting a new application, you can probably afford to not care about this at all. If you are updating a legacy application which added indexes to your production database using the old naming, be careful about how you add/remove/rename indexes, so that local developer environments don’t wind up out of sync with the production environment.

But I don’t have that much data

Contrary to developer myth, database indexes are not “just for speed” (they are technically for speed, but they are for achieving it in the first place, not fixing problems), or only meant to be used once your data reaches a certain size, or a case of premature optimization. Sure, maybe you’ll get away with decent performance on your local database running in development mode with you as the only user, but that’s not the real use case for the application, is it? Database indexes are not tools to troubleshoot speed problems or data growing pains - they’re a fundamental tool that relational databases use to function correctly in the first place.

What about search

This post doesn’t cover doing full text search against string and text fields, perhaps a future post will. That being said, basically everything in this post will likely assist in speeding up any query that is doing search, by virtue of speeding up the joins and ordering and state restrictions that may be included in a search query.

Ok, so this gets me like a 5% speed increase, right? Who cares

Totally wrong. It’s hard to put an average on the speed increase, because it depends on the schema and the query. But that’s not the point. If you’re not properly indexing the columns in the database and making sure your queries are using them, you’re not using the tool correctly.

For the sake of example - it’s not too out of the ordinary to see a non-indexed query that takes 1 minute or more to complete drop down to the tenths or hundredths of a second in length once it’s been made to use proper indexes. It’s a BIG DEAL. Run your own tests if you don’t believe me. Have someone else run tests for you if you don’t know how to run tests. Just take my word for it if you don’t know anyone competent enough to run tests for you.

What’s the downside

Well, there’s no “catch”, other than needing to remember to do this regularly. There are two “downsides” to indexing.

First, if you haven’t introduced indexes from day one, or you are adding an index to a new column in a table with many rows, the migration can take quite a bit of time. For example, when we passed about 12 million rows in the “messages” table in Thunder Thimble, we started experiencing some sluggishness on a certain query, and found that indexing a previously unindexed column was necessary to get things snappy again. Running that migration on the 12 million row table took about 7 hours, and the bulk of that time was spent adding the index.

Second, you can’t just go about willy nilly and add database indexes to EVERY column in every table in your database. There is a cost to the database to maintain those indexes, and every time an INSERT or UPDATE occurs, there is work to be done that would not need to be done without the index in place. In most scenarios, this is more of a theoretical fear than a practical fear, but definitely do your homework on it before you go index crazy.

How about some brunch

Adding database indexes to migrations when you should, and doing regular database schema reviews to look for missing indexes is part of any applications healthy breakfast.

You can find missing database indexes with the lol_dba gem.