Database indexes
What are database indexes? If you’re building web apps powered by a relational database, you should know.
A database index is a data structure in the database which improves the speed of operations (typically row lookups) on a database table, or across tables. Think of them like the index in the back of a book. If you’re reading a book about the Wu Tang Clan and you want to find all the Method Man references, it’d be a lot easier to flip to the index and find “Method Man”, then open up directly to all those pages numbers (as opposed to scanning every page of the book for the words “Method Man”), wouldn’t it?
Well that’s basically how database indexes work. Assuming you have an indexed
users.id
column on your users table, when you query a database with…
SELECT name FROM users WHERE id = 1
…the database can find that row very quickly, rather than having to scan through every row in the table and check whether it’s “id” value is equal to 1.
So, in a typical rails application, what things should you index?
Primary keys
Most SQL databases with the
concept of a primary key will automatically create an index on the primary key
column when it exists. In Rails, this is typically the “id” column in a table,
and because Rails tells the DB that it’s a primary key, you’ll get the index
“for free”, created by the database. This is very important for a “show view”
like /users/1, for example. The request comes in, and the query to “find user
with id equal to 1” occurs very quickly because the users.id
column is
indexed.
Best practice: index every primary key
Foreign keys
Now what if that user has_many comments? In Rails, you most likely have a
user_id column in your comments table, which the Comment
model will use to
determine the user that it belongs to. You should have an index on every
foreign key column. When you make a page like /users/1/comments, two things
need to happen. First, we lookup the user with id equal to 1. Assuming we’ve
indexed primary keys, this will be fast. Second, we want to find all comments
that belong to this user. If we’ve indexed comments.user_id
, this will be
fast too.
We’ve actually taken this to a “policy” point where we try to avoid ever naming
a column with an _id
suffix, unless it is an integer foreign key column that
needs to be indexed.
Best practice: index every foreign key column
Columns used by to_param
Let’s say you’ve got indexes in place on users.id
and comments.user_id
, but
then a request comes in to make “pretty urls” on these pages. Ok, no problem.
We add a users.keyword
column to users, and allow users to specify a username
with their account. Now we can make requests to urls like
/users/matt/comments
, and see all comments by that user. Well, same situation
as before, we need to do a find, but this time we’re matching against the
‘keyword’ column and not the ‘id’ column, so this needs to be indexed.
Best practice: index every column used in a WHERE clause
Composite keys on join models
What if multiple users could write one comment together, and we needed to store
a timestamp to know when each of them last edited the comment? We could
introduce a Commentary
join model, which would connect users to comments. This
model would have a user_id
and comment_id
column, along with a
last_edited_at
timestamp, or something.
In this case, there should be a composite index (an index across multiple
columns, basically) on the (user_id
, comment_id
) combination on this join
model. That way, when we run a query to find all comments that a certain user
has been an editor on, we can quickly those comments through the join table.
Best practice: index composite keys on join models
State columns
What if comments can be in “Draft” (not done yet), “Submitted” (done, awaiting approval), “Published” (approved) or “Unpublished” (approved then taken down) states? We’re most likely going to have an interface for users which shows “all of your draft comments” so that they can resume working on them, and “all your submitted comments” so they can review their past wisdom.
In this case, we’ll add a comments.publication_state
column to comments (I
have an unfounded fear of using ‘state’ as a column name; sounds like a reserved
word to me, but it’s not!), which will be a varchar column and hold one of those
states as a string. We should index that column so that the queries to find
comments in a certain state are fast.
Best practice: index every varchar/string column that’s used for storing state
Boolean columns
Well, with the introduction of the “submit for approval” concept we also need to add an “admin” concept for users, with the option to either be an admin or not be an admin. We need a “all users who are admins” view so that we can see at a glance who is doing comment approval and click through to see what they’ve approved.
In this case, we’ll add a boolean column called ‘admin’ to the users table (in mysql, this is implemented with a TINYINT column, may be different elsewhere). This column should be indexed, so that the query to find all admin users is fast.
Best practice: index every boolean column
Datetime columns
As this commenting application enjoys continued success, we’ve decided that viewing comments in the order in which they were posted would really improve our users’ ability to find what they were looking for. To do this, we’ll add an order by clause to the query that gets the comments for them to view.
To keep this performant, we should add an index to the created_at
column on
comments. If we had a view sorting by published_at
or updated_at
, we’d want
to index that as well.
Best practice: index every datetime column
Columns used in polymorphic conditional joins
In Rails when you establish a polymorphic relationship you end up with a
condition on a join in the resulting query. For example, let’s say you can
apply a Tag
to a Comment
via a Tagging
association. There will be queries
like:
SELECT * FROM comments
INNER JOIN taggings
ON taggings.taggable_type = 'Comment' and taggings.taggable_id = '3'
INNER JOIN tags on taggings.tag_id = tags.id
In this case, you should add a composite index to taggings(taggable_type,
taggable_id)
pair, so that the initial lookup goes well. You should also have
already indexed taggings(tag_id)
because this is a foreign key association.
Best practice: index all _type/_id pairs on polymorphic join tables
Columns used in validations
Nine times out of ten you probably already have this covered by following the best practices already enumerated in this post, but another thing to watch out for are Rails data validations on models. If you have a column which has a uniqueness constraint, for example, every time you save a record of that class (regardless of whether that particular column has changed or not), ActiveRecord is going to run a query to try to find other rows which have the same data in that column. It will be much faster to do this comparison on indexed columns than non-indexed columns.
An example of this would be something like a users.email
column, where the
email had to be unique across all user rows in the table.
Best practice: index all columns that will generate queries from model validations
Columns used for STI
With the Single Table
Inheritance
pattern in Active Record, there is a ‘type’ column created to store the parent
class of a subclass. For example, if FancyGroup
inherits from Group
with
STI, there will be groups table
with a ‘type’ column, which all FancyGroup
records will populate with the
“FancyGroup” string, to indicate that those records are of that subclass.
This means that every query which looks up FancyGroup
records is going to have
at least a WHERE groups.type = 'FancyGroup'
clause in it, and that means that
the ‘type’ column on that table should be indexed.
Best practice: index all ‘type’ columns on tables used for STI
How to add an index
In Rails, adding an index is really straightforward. Here’s an example from a
recent application where I was converting what was previously a numeric lookup
to be a “pretty url” lookup, and wanted something like
/products/thirty-six-chambers
instead of /products/123
…
class AddIndexToProductsKeyword < ActiveRecord::Migration
def self.up
add_index :products, :keyword
end
def self.down
remove_index :products, :keyword
end
end
The products.keyword
column already existed in this case, it was a matter of
adding an index to that column. The Rails guides site has more about Rails
Migrations.
To add a composite index (an index across multiple columns), Rails uses an Array
syntax. Here’s an example for the earlier Tagging
scenario:
class AddIndicesToAllPolymorphicTables < ActiveRecord::Migration
def self.up
add_index :taggings, [:taggable_type, :taggable_id]
end
def self.down
remove_index :taggings, :column => [:taggable_type, :taggable_id]
end
end
Side note - the pattern by which Rails generates names for indexes has changed
over time. In earlier versions of rails, for example, the table name was
prefixed onto the index name, creating names like posts_published_index
. In
more recent versions, the same migration would generate an index named
index_posts_on_published
. If you are starting a new application, you can
probably afford to not care about this at all. If you are updating a legacy
application which added indexes to your production database using the old
naming, be careful about how you add/remove/rename indexes, so that local
developer environments don’t wind up out of sync with the production
environment.
But I don’t have that much data
Contrary to developer myth, database indexes are not “just for speed” (they are technically for speed, but they are for achieving it in the first place, not fixing problems), or only meant to be used once your data reaches a certain size, or a case of premature optimization. Sure, maybe you’ll get away with decent performance on your local database running in development mode with you as the only user, but that’s not the real use case for the application, is it? Database indexes are not tools to troubleshoot speed problems or data growing pains - they’re a fundamental tool that relational databases use to function correctly in the first place.
What about search
This post doesn’t cover doing full text search against string and text fields, perhaps a future post will. That being said, basically everything in this post will likely assist in speeding up any query that is doing search, by virtue of speeding up the joins and ordering and state restrictions that may be included in a search query.
Ok, so this gets me like a 5% speed increase, right? Who cares
Totally wrong. It’s hard to put an average on the speed increase, because it depends on the schema and the query. But that’s not the point. If you’re not properly indexing the columns in the database and making sure your queries are using them, you’re not using the tool correctly.
For the sake of example - it’s not too out of the ordinary to see a non-indexed query that takes 1 minute or more to complete drop down to the tenths or hundredths of a second in length once it’s been made to use proper indexes. It’s a BIG DEAL. Run your own tests if you don’t believe me. Have someone else run tests for you if you don’t know how to run tests. Just take my word for it if you don’t know anyone competent enough to run tests for you.
What’s the downside
Well, there’s no “catch”, other than needing to remember to do this regularly. There are two “downsides” to indexing.
First, if you haven’t introduced indexes from day one, or you are adding an index to a new column in a table with many rows, the migration can take quite a bit of time. For example, when we passed about 12 million rows in the “messages” table in Thunder Thimble, we started experiencing some sluggishness on a certain query, and found that indexing a previously unindexed column was necessary to get things snappy again. Running that migration on the 12 million row table took about 7 hours, and the bulk of that time was spent adding the index.
Second, you can’t just go about willy nilly and add database indexes to EVERY column in every table in your database. There is a cost to the database to maintain those indexes, and every time an INSERT or UPDATE occurs, there is work to be done that would not need to be done without the index in place. In most scenarios, this is more of a theoretical fear than a practical fear, but definitely do your homework on it before you go index crazy.
How about some brunch
Adding database indexes to migrations when you should, and doing regular database schema reviews to look for missing indexes is part of any applications healthy breakfast.
- Database Indexes on wikipedia
- 20bits interview on database indexes
- How mySQL uses indexes
- Jamis Buck explains indexing databases
You can find missing database indexes with the lol_dba gem.