A common query that crops up in web apps is finding the oldest or the newest record in a single database table. This is straightforward in SQL.
You can even write the relevant part of the query without knowing anything about the table
(other than the fact that it has a timestamp column called created_at
):
ORDER BY created_at LIMIT 1
SQL select oldest or newest records relative to another field
Let’s start with the following simple DB schema:
- Users: [
id
,username
] - Orders: [
id
,user_id
,quantity
,price_in_cents
,created_at
]
And let’s say you need to display a list of users next to the date of their most recent order. GROUP BY exists to solve this problem:
SELECT
user_id, MAX(created_at)
FROM
orders
GROUP BY
user_id
What if you wanted to display the user’s name
instead of the id
? This adds a
bit of complexity, but with a JOIN and a sub-select, the query remains
pretty straightforward:
SELECT
users.username, latest_orders.created_at
FROM
(SELECT
user_id, MAX(created_at) AS created_at
FROM
orders
GROUP BY
user_id) AS latest_orders
INNER JOIN
users
ON
users.id = latest_orders.user_id
Now let’s say you actually need the entire order record. Maybe you need to display the price and quantity or to perform a calculation on them. This can be achieved by joining the result of the sub-select back to the orders table itself.
SELECT
orders.*
FROM
(SELECT
user_id, MAX(created_at) AS created_at
FROM
orders
GROUP BY
user_id) AS latest_orders
INNER JOIN
orders
ON
orders.user_id = latest_orders.user_id AND
orders.created_at = latest_orders.created_at
Alright, this is slightly less intuitive. The key idea is that you select only
the orders that match the user_id
and the created_at
timestamp of the
latest_orders
sub-selected table. If you accept that the sub-select will
return only the subset of user_id
s paired with their most recent order this
becomes relatively clear.
One caveat: there’s a slight issue with using a datetime as an identifier here.
What if two orders have the same datetime (specifically the most recent)?
In the case of users
making orders
, it’s pretty unlikely short of a double-submit
bug, but depending on your use-case for the table in question and the scale of
your application, it is possible. The upshot is that this query would include
2 entries for that user. This is likely a problem. A small one if you are
displaying the latest orders to the admins of your application (they see two
orders for one user). A larger one if you are doing something user-facing. In
this situation you’d really just want to pick one of the two and adding a
DISTINCT
clause on user_id
in the above example would do just that.
This strategy can be expanded for any number of similar situations, even those that require grouping by several fields. Let’s look at another, slightly more complicated example.
Consider this DB schema:
- Users: [
id
,username
… ] - Companies: [
id
,company_name
… ] - StockHoldings: [
id
,user_id
,company_id
,quantity
,purchase_price
] - BondHoldings: [
id
,user_id
,company_id
,purchase_price
,par_value
,interest_rate
] - HoldingValueStamps: [
user_id
,holding_id
,holding_type
,value
,created_at
]
Here StockHolding
and BondHolding
represent ownership of stock and bonds. If
you bought 10 shares of AAPL, we would create a StockHolding
for you with
company_id: 'AAPL', quantity: 10
. The value of those AAPL shares would
fluctuate over time. HoldingValueStamp
will be how we track their value.
Every minute, we’ll create a HoldingValueStamp
with the current value of your
10 shares of AAPL. These stamps are created to keep an historical record of the
value of our user’s holdings. Having these stamps would also allow us to show
our users a trend-line of their portfolio’s value overtime.
SQL sub-select multiple columns with INNER JOIN
We also need to be able to tell the user the current value of their portfolio.
To do this we need the most recent HoldingValueStamp
s for each holding (at
this point only StockHoldings
and BondHoldings
) that belongs to the user.
SELECT
holding_value_stamps.*
FROM
(SELECT
holding_id,
holding_type,
MAX(created_at) as created_at
FROM
holding_value_stamps
GROUP BY
holding_id, holding_type) AS most_recent_stamps
INNER JOIN
holding_value_stamps
ON
holding_value_stamps.created_at = most_recent_stamps.created_at
holding_value_stamps.holding_id = most_recent_stamps.holding_id AND
holding_value_stamps.holding_type = most_recent_stamps.holding_type
WHERE
holding_value_stamps.user_id = #{current_user.id}
This example is very similar to the sub-select above but specific to a certain user.
Here, the sub-select finds the most recent created_at
timestamp per holding
(holding_id
and holding_type
together ensure a unique StockHolding
or
BondHolding
). Unfortunately, we can’t select the value
field from this query
because we don’t want to group by it or aggregate it.
We can, however, match the created_at
timestamps from most_recent_timestamps
to the full row of holding_value_stamps
. It’s very important that we include
all of the fields we used in the GROUP BY
as well as the selected created_at
timestamp in the ON
clause. If we don’t, we may wind up
with the wrong data (due to some BondHoldings
or StockHoldings
having
HoldingValueStamps
with identical created_at
timestamps).
Finally, the WHERE
clause ensures that we don’t select any that don’t
belong to the current user.
SQL sub-select multiple columns with OUTER JOIN
Another way to achieve the same result:
SELECT
holding_value_stamps1.*
FROM
holding_value_stamps AS holding_value_stamps1
LEFT OUTER JOIN
holding_value_stamps AS holding_value_stamps2
ON
holding_value_stamps1.holding_id = holding_value_stamps2.holding_id AND
holding_value_stamps1.holding_type = holding_value_stamps2.holding_type AND
holding_value_stamps1.created_at < holding_value_stamps2.created_at
WHERE
holding_value_stamps1.user_id = #{current_user.id} AND
holding_value_stamps2.id IS NULL
Here we ensure that we are not selecting any rows from holding_value_stamps1
when a row in holding_value_stamps2
(the same table) exists with a more recent
created_at
.
This solution has the same issue when multiple records have the same
created_at
that we discussed earlier, both of those records would find their
way into the result. This is because there are no other records to satisfy the
ON
clause for the OUTER JOIN
for either of the two records, i.e. there is
no record with a created_at
that is greater than their own.
Therefore, they
would not join to any actual data from holding_value_stamps2
, and
holding_value_stamps2.id
would be NULL
(which in turn causes them to be
included in the result set per the holding_value_stamps2.id IS NULL
in the
WHERE
clause).
One way that this issue can be dealt with is favoring the stamp with the smaller id:
SELECT
holding_value_stamps1.*
FROM
holding_value_stamps AS holding_value_stamps1
LEFT OUTER JOIN
holding_value_stamps AS holding_value_stamps2
ON
holding_value_stamps1.holding_id = holding_value_stamps2.holding_id AND
holding_value_stamps1.holding_type = holding_value_stamps2.holding_type AND
(
holding_value_stamps1.created_at < holding_value_stamps2.created_at OR
(
holding_valuation_stamps1.created_at = holding_valuation_stamps2.created_at AND
holding_valuation_stamps.id < h2.id
)
)
WHERE
holding_value_stamps1.user_id = #{current_user.id} AND
holding_value_stamps2.id IS NULL
Filtering and Ordering records with Rails
If performance isn’t a concern, there are other, possibly-clearer ways to handle this kind of filtering and ordering in Ruby:
HoldingValueStamp.where(user_id: current_user.id).
group_by { |stamp| [stamp.holding_id, stamp.holding_type] }.
map { |_holding, stamps| stamps.max(&:created_at) }
That said, the database level solutions in this post have been performant in real production environments.