We recently ran into a case where a join was getting out of hand and we were struggling with how to rein in the size of it. Then we found window functions. Window functions (Oracle calls them analytic functions) are a part of the SQL standard and this post will explore how to use them in Postgres. Let’s see how they work and what kind of problems they can help us solve.
Our problem
Let’s say we have a blogging platform where we have posts
that have many
comments
. Here’s our setup:
CREATE TABLE posts (
id integer PRIMARY KEY,
body varchar,
created_at timestamp DEFAULT current_timestamp
);
CREATE TABLE comments (
id INTEGER PRIMARY KEY,
post_id integer NOT NULL,
body varchar,
created_at timestamp DEFAULT current_timestamp
);
/* make two posts */
INSERT INTO posts VALUES (1, 'foo');
INSERT INTO posts VALUES (2, 'bar');
/* make 4 comments for the first post */
INSERT INTO comments VALUES (1, 1, 'foo old');
INSERT INTO comments VALUES (2, 1, 'foo new');
INSERT INTO comments VALUES (3, 1, 'foo newer');
INSERT INTO comments VALUES (4, 1, 'foo newest');
/* make 4 comments for the second post */
INSERT INTO comments VALUES (5, 2, 'bar old');
INSERT INTO comments VALUES (6, 2, 'bar new');
INSERT INTO comments VALUES (7, 2, 'bar newer');
INSERT INTO comments VALUES (8, 2, 'bar newest');
Here’s a SQL Fiddle with our example if you’d like to query the data.
Our Goal
We want to get each post
‘s three most recent comment
s. We could start by
writing a query such as the one below to pull all the posts and comments. We
could then filter this result in whatever application language we’re using
(Ruby, PHP, Go, whatever) on top
of our database:
SELECT posts.id AS post_id, comments.id AS comment_ids, comments.body AS body
FROM posts LEFT OUTER JOIN comments ON posts.id = comments.post_id;
resulting in:
| POST_ID | COMMENT_IDS | BODY |
|---------|-------------|------------|
| 1 | 1 | foo old |
| 1 | 2 | foo new |
| 1 | 3 | foo newer |
| 1 | 4 | foo newest |
| 2 | 5 | bar old |
| 2 | 6 | bar new |
| 2 | 7 | bar newer |
| 2 | 8 | bar newest |
This solution may work for awhile but what happens when we have 10’s of thousands of posts that each have 10’s of thousands of comments (it’s a really popular blog)? That resulting join table just got really big.
We want to see just these results without requiring any further filtering:
| POST_ID | COMMENT_ID | BODY |
|---------|------------|------------|
| 1 | 4 | foo newest |
| 1 | 3 | foo newer |
| 1 | 2 | foo new |
| 2 | 8 | bar newest |
| 2 | 7 | bar newer |
| 2 | 6 | bar new |
To get this subset we need to some how rank or order the comments for each post and then limit the set. How do we go about this more targeted task in SQL? Enter window functions.
What are window functions
Window functions are a tool to perform advanced sorting and limiting on a subset
of a joined table of data (hence, the name window
). We’ll look at one
particular function, dense_rank()
, but all built-in (sum
, for example) and
user-defined aggregate functions can act as window functions by calling the
OVER
keyword. Some other popular functions include row_number()
, rank()
,
and percent_rank()
. A complete list of available window functions can be found
here.
Back to our problem
How can we use this new knowledge to write our query? We want the three most
recent comments so we know we’ll need to do some sort of sorting by created_at
and then limit the number of results.
Here’s how we can get the ranking information we need using dense_rank()
:
SELECT posts.id AS post_id, comments.id AS comment_id, comments.body AS body,
dense_rank() OVER (
PARTITION BY post_id
ORDER BY comments.created_at DESC
) AS comment_rank
FROM posts LEFT OUTER JOIN comments ON posts.id = comments.post_id;
Resulting in:
| POST_ID | COMMENT_ID | BODY | COMMENT_RANK |
|---------|------------|------------|--------------|
| 1 | 4 | foo newest | 1 |
| 1 | 3 | foo newer | 2 |
| 1 | 2 | foo new | 3 |
| 1 | 1 | foo old | 4 |
| 2 | 8 | bar newest | 1 |
| 2 | 7 | bar newer | 2 |
| 2 | 6 | bar new | 3 |
| 2 | 5 | bar old | 4 |
We have the same query from before but we’ve added an additional field in our
SELECT
statement:
dense_rank() OVER (
PARTITION BY post_id
ORDER BY comments.created_at DESC
) AS comment_rank
OVER
is the keyword that triggers the use of the window function. Inside the
OVER
statement we’re saying:
- rank the comments by
created_at
(ORDER BY comments.created_at
) and - scope the ranking to each post (
PARTITION BY post_id
)
This result is still returning the full set but now we have the added
comment_rank
field showing us the comments for each post in order of age.
Given this ranking, now we need to fetch just those comments that have a
comment_rank
less than 4 (since we want only the three most recent comments).
We can achieve this goal by using our new ranking query as a sub-select
statement like so:
SELECT comment_id, post_id, body FROM (
SELECT posts.id AS post_id, comments.id AS comment_id, comments.body AS body,
dense_rank() OVER (
PARTITION BY post_id
ORDER BY comments.created_at DESC
) AS comment_rank
FROM posts LEFT OUTER JOIN comments ON posts.id = comments.post_id
) AS ranked_comments
WHERE comment_rank < 4;
resulting in:
| POST_ID | COMMENT_ID | BODY |
|---------|------------|------------|
| 1 | 4 | foo newest |
| 1 | 3 | foo newer |
| 1 | 2 | foo new |
| 2 | 8 | bar newest |
| 2 | 7 | bar newer |
| 2 | 6 | bar new |
And that’s our answer. Our query has become pretty hard to read though.
Refactoring using Common Table Expressions
We can gain back some of the readability by using a Common Table Expression (or CTE). Our query would become:
WITH ranked_comments AS (
SELECT posts.id AS post_id, comments.id AS comment_id, comments.body AS body,
dense_rank() OVER (
PARTITION BY post_id
ORDER BY comments.created_at DESC
) AS comment_rank
FROM posts LEFT OUTER JOIN comments ON posts.id = comments.post_id
)
SELECT
post_id,
comment_id,
body
FROM ranked_comments
WHERE comment_rank < 4;
Window functions introduce some additional syntax but are a really powerful tool when we need more complex sorting or limiting behavior.
Want to learn more
For some more help on the syntax and options for using these functions check out
these
docs
or this guide. There are some
gotchas around dealing with NULL
values and how ordering works so be sure to
read through this
section
and test your code. Happy querying.