We were analyzing slow responses on a project recently and found ourselves
questioning the performance of some PostgreSQL queries. As is typical when
digging into queries, we had some questions around what the query was doing. So,
we used our handy
EXPLAIN command to shed some light on the database’s
behavior. Upon inspecting the query it turned out an index we had created was
not being used. We wanted to know why.
Let’s first look at how the
EXPLAIN command works.
The Postgres docs
have a helpful article for learning about
EXPLAIN. The basics are that in your
psql console prepend the word
EXPLAIN in front of your query and you’ll get
a query plan.
psql <name of your database>, or
rails dbconsole if you are inside of a
Rails project. Prepend
EXPLAIN to your query to see an explanation of how
Postgres is going to break down your problem in the most efficient way possible.
EXPLAIN SELECT * FROM posts;
QUERY PLAN ---------------------------------------------------------- Seq Scan on posts (cost=0.00..53.84 rows=281 width=704) (1 row)
If you don’t feel like dropping down to the database you can get explain
ActiveRecord::Relations by calling the method
Calling the following command will give you the same result you’ll see from the
How to interpret the output
Now we have the query plan. What is it telling us? Query plans are read bottom to top and display the following information for each step of the query:
- estimated statement execution cost: an arbitrary unit of measuring cost
- estimated number of rows: the number of rows the query will produce
- estimated width of rows: the size of the rows in bytes
For a more thorough explanation of how to analyze these query plans’ costs and number of rows check out the Postgres docs on using explain.
What’s important to our question about indexes is the type of query being
performed and whether an index is being used or not. If we see a step containing
Index Scan we’re using the index.
A side note about query costs
As noted in the Postgres docs, the query plans are based on a sample of
statistics about the tables in the query. These statistics are kept up to date
by Postgres when your database is
VACUUMed. As a
result, even when the underlying data is identical, you may see slightly
different values for the same
query run at different times.
It doesn’t take much digging to find interesting results in the query plans.
Let’s use a database with a
posts table containing a couple hundred post
id is the primary key and by default has an index.
Let’s see what the query looks like to get the first 10 posts.
EXPLAIN SELECT * FROM POSTS WHERE id < 10;
Our plan is:
Bitmap Heap Scan on posts (cost=2.07..16.39 rows=9 width=704) Recheck Cond: (id < 10) -> Bitmap Index Scan on posts_pkey (cost=0.00..2.07 rows=9 width=0) Index Cond: (id < 10) (4 rows)
This query is using the index because we see
Index Scan in the plan. The
meaning of the
Recheck Cond: statement is beyond the scope of this article,
but the postgres-performance mailing
has a helpful explanation.
Now, let’s see what it looks like to get the rest of the posts in the table.
EXPLAIN SELECT * FROM POSTS WHERE id > 10;
Our plan is:
Seq Scan on posts (cost=0.00..53.98 rows=272 width=704) Filter: (id > 10) (2 rows)
This query is going to perform a sequential scan (
Seq Scan) and not use the
index. A sequential scan? This plan means that Postgres is going to read the
whole table to find what we’re looking for. That approach seems inefficient
given the index we already have. What’s going on here? To answer this question
it’s helpful to think about what an index is. An index is a specialized
representation in memory of the contents of a particular column (or multiple
How indexes are used
As we saw above, running a couple of queries on our
posts table reveals that
even given an index to use, Postgres will not always choose to use it. The
reason why this is the case is that indexes have a cost to create and maintain
(on writes) and use (on reads).
When an index is used in a
SELECT query, first the position of the requested
rows is fetched from the index (instead of from the table directly). Then, the
requested rows (a subset of the total) are actually fetched from the table. It’s
a two step process. Given a large enough number of requested rows (such as all
the posts with an id greater than 10), it is actually more efficient to go fetch
all those rows from the table directly rather than to go through the extra step
of asking the index for the position of those rows and next fetch those specific
rows from the table. If the percentage of the rows is smaller than 5-10% of all
rows in the table the benefit of using the information stored in the index
outweighs the additional intermediate step.
Hear more about Postgres indexes, caching, and performance from Harold Giménez, head of Heroku Postgres, on A Beautiful Thing, an episode of the Giant Robots Smashing Into Other Giants podcast. Or, watch the Improving Rails Performance screencast from Joe Ferris, our CTO.