---
title: Querying the oldest and newest records within an SQL group
teaser: 'How to find the oldest or most recent record associated with a particular
  foreign key.

  '
tags: postgresql,sql,databases,rails
author: Jack Wilson
published_on: 2017-09-18
---

A common query that crops up in web apps is finding the oldest or the newest
record in a single database table. This is straightforward in SQL.

You can even write the relevant part of the query without knowing anything about the table
(other than the fact that it has a timestamp column called `created_at`):

```sql
ORDER BY created_at LIMIT 1
```

## SQL select oldest or newest records relative to another field

Let's start with the following simple DB schema:

- Users: [ `id`, `username` ]
- Orders: [ `id`, `user_id`, `quantity`, `price_in_cents`, `created_at` ]

And let's say you need to display a list of users next to the date of their most
recent order. [GROUP BY] exists to solve this problem:

```sql
SELECT
  user_id, MAX(created_at)
FROM
  orders
GROUP BY
  user_id
```

What if you wanted to display the user's `name` instead of the `id`? This adds a
bit of complexity, but with a [JOIN] and a sub-select, the query remains
pretty straightforward:

```sql
SELECT
  users.username, latest_orders.created_at
FROM
  (SELECT
     user_id, MAX(created_at) AS created_at
   FROM
     orders
   GROUP BY
     user_id) AS latest_orders
INNER JOIN
  users
ON
  users.id = latest_orders.user_id
```

Now let's say you actually need the entire order record. Maybe you need to
display the price and quantity or to perform a calculation on
them. This can be achieved by joining the result of the sub-select back to the
orders table itself.

```sql
SELECT
  orders.*
FROM
  (SELECT
     user_id, MAX(created_at) AS created_at
   FROM
     orders
   GROUP BY
     user_id) AS latest_orders
INNER JOIN
  orders
ON
  orders.user_id = latest_orders.user_id AND
  orders.created_at = latest_orders.created_at
;
```

Alright, this is slightly less intuitive. The key idea is that you select only
the orders that match the `user_id` _and_ the `created_at` timestamp of the
`latest_orders` sub-selected table. If you accept that the sub-select will
return only the subset of `user_id`s paired with their most recent order this
becomes relatively clear.

One caveat: there's a slight issue with using a datetime as an identifier here.
What if two orders have the same datetime (specifically the most recent)?
In the case of `users` making `orders`, it's pretty unlikely short of a double-submit
bug, but depending on your use-case for the table in question and the scale of
your application, it is possible. The upshot is that this query would include
2 entries for that user. This is likely a problem. A small one if you are
displaying the latest orders to the admins of your application (they see two
orders for one user). A larger one if you are doing something user-facing. In this situation, you'd want to pick one of the two by adding a tie-breaker to the query. The order with the highest ID is likely to be the most recent. Adding a `WHERE` clause to our query gives us a fallback for this unlikely edge case.



```sql
SELECT
  orders.*
FROM
  (SELECT
     user_id, MAX(created_at) AS created_at
   FROM
     orders
   GROUP BY
     user_id) AS latest_orders
INNER JOIN
  orders
ON
  orders.user_id = latest_orders.user_id AND
  orders.created_at = latest_orders.created_at
WHERE orders.id = (
  SELECT MAX(id)
  FROM orders AS o2
  WHERE o2.user_id = orders.user_id
    AND o2.created_at = orders.created_at
  );
```

This strategy can be expanded for any number of similar situations, even those
that require grouping by several fields. Let's look at another, slightly more
complicated example.

Consider this DB schema:

- Users: [ `id`, `username` ... ]
- Companies: [ `id`, `company_name` ... ]
- StockHoldings: [ `id`, `user_id`, `company_id`, `quantity`, `purchase_price` ]
- BondHoldings: [ `id`, `user_id`, `company_id`, `purchase_price`, `par_value`,
  `interest_rate` ]
- HoldingValueStamps: [ `user_id`, `holding_id`, `holding_type`, `value`,
  `created_at` ]

Here `StockHolding` and `BondHolding` represent ownership of stock and bonds. If
you bought 10 shares of AAPL, we would create a `StockHolding` for you with
`company_id: 'AAPL', quantity: 10`. The value of those AAPL shares would
fluctuate over time. `HoldingValueStamp` will be how we track their value.

Every minute, we'll create a `HoldingValueStamp` with the current value of your
10 shares of AAPL. These stamps are created to keep an historical record of the
value of our user's holdings. Having these stamps would also allow us to show
our users a trend-line of their portfolio's value overtime.

## SQL sub-select multiple columns with INNER JOIN

We also need to be able to tell the user the current value of their portfolio.
To do this we need the most recent `HoldingValueStamp`s for each holding (at
this point only `StockHoldings` and `BondHoldings`) that belongs to the user.

```sql
SELECT
  holding_value_stamps.*
FROM
  (SELECT
     holding_id,
     holding_type,
     MAX(created_at) as created_at
   FROM
     holding_value_stamps
   GROUP BY
     holding_id, holding_type) AS most_recent_stamps
INNER JOIN
  holding_value_stamps
ON
  holding_value_stamps.created_at = most_recent_stamps.created_at
  holding_value_stamps.holding_id = most_recent_stamps.holding_id AND
  holding_value_stamps.holding_type = most_recent_stamps.holding_type
WHERE
  holding_value_stamps.user_id = #{current_user.id}
```

This example is very similar to the sub-select above but specific to a certain
user.

Here, the sub-select finds the most recent `created_at` timestamp per holding
(`holding_id` and `holding_type` together ensure a unique `StockHolding` or
`BondHolding`). Unfortunately, we can't select the `value` field from this query
because we don't want to group by it or aggregate it.

We can, however, match the `created_at` timestamps from `most_recent_timestamps`
to the full row of `holding_value_stamps`. It's very important that we include
all of the fields we used in the `GROUP BY` as well as the selected `created_at`
timestamp in the `ON` clause. If we don't, we may wind up
with the wrong data (due to some `BondHoldings` or `StockHoldings` having
`HoldingValueStamps` with identical `created_at` timestamps).

Finally, the `WHERE` clause ensures that we don't select any that don't
belong to the current user.

## SQL sub-select multiple columns with OUTER JOIN

Another way to achieve the same result:

```sql
SELECT
  holding_value_stamps1.*
FROM
  holding_value_stamps AS holding_value_stamps1
LEFT OUTER JOIN
  holding_value_stamps AS holding_value_stamps2
ON
  holding_value_stamps1.holding_id = holding_value_stamps2.holding_id AND
  holding_value_stamps1.holding_type = holding_value_stamps2.holding_type AND
  holding_value_stamps1.created_at < holding_value_stamps2.created_at
WHERE
  holding_value_stamps1.user_id = #{current_user.id} AND
  holding_value_stamps2.id IS NULL
```

Here we ensure that we are not selecting any rows from `holding_value_stamps1`
when a row in `holding_value_stamps2` (the same table) exists with a more recent
`created_at`.

This solution has the same issue when multiple records have the same
`created_at` that we discussed earlier, both of those records would find their
way into the result. This is because there are no other records to satisfy the
`ON` clause for the `OUTER JOIN` for either of the two records, i.e. there is
no record with a `created_at` that is greater than their own.

Therefore, they
would not join to any actual data from `holding_value_stamps2`, and
`holding_value_stamps2.id` would be `NULL` (which in turn causes them to be
included in the result set per the `holding_value_stamps2.id IS NULL` in the
`WHERE` clause).

One way that this issue can be dealt with is favoring the stamp with the smaller
id:

```sql
SELECT
  holding_value_stamps1.*
FROM
  holding_value_stamps AS holding_value_stamps1
LEFT OUTER JOIN
  holding_value_stamps AS holding_value_stamps2
ON
  holding_value_stamps1.holding_id = holding_value_stamps2.holding_id AND
  holding_value_stamps1.holding_type = holding_value_stamps2.holding_type AND
  (
    holding_value_stamps1.created_at < holding_value_stamps2.created_at OR
    (
      holding_valuation_stamps1.created_at = holding_valuation_stamps2.created_at AND
      holding_valuation_stamps.id < h2.id
    )
  )
WHERE
  holding_value_stamps1.user_id = #{current_user.id} AND
  holding_value_stamps2.id IS NULL
```

## Filtering and Ordering records with Rails

If performance isn't a concern, there are other, possibly-clearer ways to handle
this kind of filtering and ordering in Ruby:

```ruby
HoldingValueStamp.where(user_id: current_user.id).
  group_by { |stamp| [stamp.holding_id, stamp.holding_type] }.
  map { |_holding, stamps| stamps.max(&:created_at) }
```

That said, the database level solutions in this post have been performant in
real production environments.

[GROUP BY]: https://thoughtbot.com/blog/understanding-sql-group-by
[JOIN]: https://thoughtbot.com/blog/understanding-sql-join
