We were analyzing slow responses on a project recently and found ourselves
questioning the performance of some PostgreSQL queries. As is typical when
digging into queries, we had some questions around what the query was doing. So,
we used our handy EXPLAIN
command to shed some light on the database’s
behavior. Upon inspecting the query it turned out an index we had created was
not being used. We wanted to know why.
EXPLAIN explained
Let’s first look at how the EXPLAIN
command works.
The Postgres docs
have a helpful article for learning about EXPLAIN
. The basics are that in your
psql
console prepend the word EXPLAIN
in front of your query and you’ll get
a query plan.
In SQL
Run psql <name of your database>
, or rails dbconsole
if you are inside of a
Rails project. Prepend EXPLAIN
to your query to see an explanation of how
Postgres is going to break down your problem in the most efficient way possible.
EXPLAIN SELECT * FROM posts;
QUERY PLAN
----------------------------------------------------------
Seq Scan on posts (cost=0.00..53.84 rows=281 width=704)
(1 row)
In Rails
If you don’t feel like dropping down to the database you can get explain
statements from ActiveRecord::Relation
s by calling the method explain
.
Calling the following command will give you the same result you’ll see from the
Postgres console:
Post.all.explain
How to interpret the output
Now we have the query plan. What is it telling us? Query plans are read bottom to top and display the following information for each step of the query:
- estimated statement execution cost: an arbitrary unit of measuring cost
- estimated number of rows: the number of rows the query will produce
- estimated width of rows: the size of the rows in bytes
For a more thorough explanation of how to analyze these query plans’ costs and number of rows check out the Postgres docs on using explain.
What’s important to our question about indexes is the type of query being
performed and whether an index is being used or not. If we see a step containing
the words Index Scan
we’re using the index.
A side note about query costs
As noted in the Postgres docs, the query plans are based on a sample of
statistics about the tables in the query. These statistics are kept up to date
by Postgres when your database is
VACUUM
ed. As a
result, even when the underlying data is identical, you may see slightly
different values for the same
query run at different times.
An Example
It doesn’t take much digging to find interesting results in the query plans.
Let’s use a database with a posts
table containing a couple hundred post
records. id
is the primary key and by default has an index.
Let’s see what the query looks like to get the first 10 posts.
EXPLAIN SELECT * FROM POSTS WHERE id < 10;
Our plan is:
Bitmap Heap Scan on posts (cost=2.07..16.39 rows=9 width=704)
Recheck Cond: (id < 10)
-> Bitmap Index Scan on posts_pkey (cost=0.00..2.07 rows=9 width=0)
Index Cond: (id < 10)
(4 rows)
This query is using the index because we see Index Scan
in the plan. The
meaning of the Recheck Cond:
statement is beyond the scope of this article,
but the postgres-performance mailing
list
has a helpful explanation.
Now, let’s see what it looks like to get the rest of the posts in the table.
EXPLAIN SELECT * FROM POSTS WHERE id > 10;
Our plan is:
Seq Scan on posts (cost=0.00..53.98 rows=272 width=704)
Filter: (id > 10)
(2 rows)
This query is going to perform a sequential scan (Seq Scan
) and not use the
index. A sequential scan? This plan means that Postgres is going to read the
whole table to find what we’re looking for. That approach seems inefficient
given the index we already have. What’s going on here? To answer this question
it’s helpful to think about what an index is. An index is a specialized
representation in memory of the contents of a particular column (or multiple
columns).
How indexes are used
As we saw above, running a couple of queries on our posts
table reveals that
even given an index to use, Postgres will not always choose to use it. The
reason why this is the case is that indexes have a cost to create and maintain
(on writes) and use (on reads).
When an index is used in a SELECT
query, first the position of the requested
rows is fetched from the index (instead of from the table directly). Then, the
requested rows (a subset of the total) are actually fetched from the table. It’s
a two step process. Given a large enough number of requested rows (such as all
the posts with an id greater than 10), it is actually more efficient to go fetch
all those rows from the table directly rather than to go through the extra step
of asking the index for the position of those rows and next fetch those specific
rows from the table. If the percentage of the rows is smaller than 5-10% of all
rows in the table the benefit of using the information stored in the index
outweighs the additional intermediate step.
What’s next
Hear more about Postgres indexes, caching, and performance from Harold Giménez, head of Heroku Postgres, on A Beautiful Thing, an episode of the Giant Robots Smashing Into Other Giants podcast. Or, watch the Improving Rails Performance screencast from Joe Ferris, our CTO.