Why Postgres Won't Always Use an Index

We were analyzing slow responses on a project recently and found ourselves questioning the performance of some PostgreSQL queries. As is typical when digging into queries, we had some questions around what the query was doing. So, we used our handy EXPLAIN command to shed some light on the database’s behavior. Upon inspecting the query it turned out an index we had created was not being used. We wanted to know why.

EXPLAIN explained

Let’s first look at how the EXPLAIN command works.

The Postgres docs have a helpful article for learning about EXPLAIN. The basics are that in your psql console prepend the word EXPLAIN in front of your query and you’ll get a query plan.

In SQL

Run psql <name of your database>, or rails dbconsole if you are inside of a Rails project. Prepend EXPLAIN to your query to see an explanation of how Postgres is going to break down your problem in the most efficient way possible.

EXPLAIN SELECT * FROM posts;

                        QUERY PLAN
----------------------------------------------------------
 Seq Scan on posts  (cost=0.00..53.84 rows=281 width=704)
(1 row)

In Rails

If you don’t feel like dropping down to the database you can get explain statements from ActiveRecord::Relations by calling the method explain. Calling the following command will give you the same result you’ll see from the Postgres console:

Post.all.explain

How to interpret the output

Now we have the query plan. What is it telling us? Query plans are read bottom to top and display the following information for each step of the query:

estimated statement execution cost: an arbitrary unit of measuring cost
estimated number of rows: the number of rows the query will produce
estimated width of rows: the size of the rows in bytes

For a more thorough explanation of how to analyze these query plans’ costs and number of rows check out the Postgres docs on using explain.

What’s important to our question about indexes is the type of query being performed and whether an index is being used or not. If we see a step containing the words Index Scan we’re using the index.

A side note about query costs

As noted in the Postgres docs, the query plans are based on a sample of statistics about the tables in the query. These statistics are kept up to date by Postgres when your database is VACUUMed. As a result, even when the underlying data is identical, you may see slightly different values for the same query run at different times.

An Example

It doesn’t take much digging to find interesting results in the query plans. Let’s use a database with a posts table containing a couple hundred post records. id is the primary key and by default has an index. Let’s see what the query looks like to get the first 10 posts.

EXPLAIN SELECT * FROM POSTS WHERE id < 10;

Our plan is:

 Bitmap Heap Scan on posts  (cost=2.07..16.39 rows=9 width=704)
   Recheck Cond: (id < 10)
   ->  Bitmap Index Scan on posts_pkey  (cost=0.00..2.07 rows=9 width=0)
         Index Cond: (id < 10)
(4 rows)

This query is using the index because we see Index Scan in the plan. The meaning of the Recheck Cond: statement is beyond the scope of this article, but the postgres-performance mailing list has a helpful explanation.

Now, let’s see what it looks like to get the rest of the posts in the table.

EXPLAIN SELECT * FROM POSTS WHERE id > 10;

Our plan is:

 Seq Scan on posts  (cost=0.00..53.98 rows=272 width=704)
   Filter: (id > 10)
(2 rows)

This query is going to perform a sequential scan (Seq Scan) and not use the index. A sequential scan? This plan means that Postgres is going to read the whole table to find what we’re looking for. That approach seems inefficient given the index we already have. What’s going on here? To answer this question it’s helpful to think about what an index is. An index is a specialized representation in memory of the contents of a particular column (or multiple columns).

How indexes are used

As we saw above, running a couple of queries on our posts table reveals that even given an index to use, Postgres will not always choose to use it. The reason why this is the case is that indexes have a cost to create and maintain (on writes) and use (on reads).

When an index is used in a SELECT query, first the position of the requested rows is fetched from the index (instead of from the table directly). Then, the requested rows (a subset of the total) are actually fetched from the table. It’s a two step process. Given a large enough number of requested rows (such as all the posts with an id greater than 10), it is actually more efficient to go fetch all those rows from the table directly rather than to go through the extra step of asking the index for the position of those rows and next fetch those specific rows from the table. If the percentage of the rows is smaller than 5-10% of all rows in the table the benefit of using the information stored in the index outweighs the additional intermediate step.

What’s next

Hear more about Postgres indexes, caching, and performance from Harold Giménez, head of Heroku Postgres, on A Beautiful Thing, an episode of the Giant Robots Smashing Into Other Giants podcast. Or, watch the Improving Rails Performance screencast from Joe Ferris, our CTO.

Sign up to receive a weekly recap from thoughtbot

About thoughtbot