When I first had to use
GROUP BY, it was anything but intuitive. It was only after I learned the mental model below that it started to make sense.
Let’s say we want to know the number of languages spoken in each country in our database. This is a perfect case for grouping. The
countrylanguage table has one row per language per country so we should be able to count them up. According to this, there are twelve languages spoken in the USA:
SELECT * FROM countrylanguage WHERE countrycode = 'USA';
But how do we show that as a number?
A lot of people (including a past-me) think of
GROUP BY like a parent-child sort of structure, where you group by some key (like
countrycode) and all other attributes that related to that key are nested inside.
In the last post, I stressed that SQL actions like
SELECT act on a single table. There’s a corollary for rows: every table row has one and only one value per column. That means the above table is not possible to represent in SQL.
If you ran the query at the start of the post, you’ll know that the data that we have before we group looks more like this:
Grouping in SQL is closer to a process of sorting, then squishing or flattening. SQL calls this flattening aggregation. If every table has one and only one value per column then we need tell our query how to turn those multiple values into one.
Aggregation can take many forms. We’re about to simply count up the number of rows, but you could also:
- Join each word together with commas:
- Get the biggest number: biggest of
- Get an average of numbers: average of
There are lots of ways to combine rows together based on your needs.
Before we start grouping we need to decide which columns to flatten, and which columns to
GROUP BY. Both types of columns go in the
SELECT clause. For this query, we’re going to use
The grouped columns will, by definition, be the same value so we don’t have to decide how to combine (aggregate) them. Multiple
countrycode values will all merge together into the same value.
It’s worth noting that you can also group by multiple columns. An example would be grouping people by current country and language spoken. E.g. you don’t want everyone from Andorra in one group. Instead you want everyone who can speak Spanish, Catalan, and Portuguese from Andorra in distinct groups.
The remaining, ungrouped columns will need to use an aggregate function so they know how to be merged together. No other columns can be included in the
SELECT clause. If there are extra columns you’ll see this error message:
column “language” must appear in the GROUP BY clause or be used in an aggregate function
In our example we want to group everything by the
countrycode and aggregate the number of languages. Our query looks like this:
SELECT countrycode, language -- this will need some kind of aggregate function FROM countrylanguage GROUP BY countrycode;
For our query, we want to know the number of languages per country, so we will use the
SELECT countrycode, COUNT(language) FROM countrylanguage GROUP BY countrycode;
Even with all this explanation,
GROUP BY can remain a bit baffling. Let’s walk through it one more time step-by-step:
- Start with the full, ungrouped table.
- Sort all the rows together based on the
GROUP BYcolumns. So in our example: move all the
USArows together, and then all the
TUVrows, and so on.
- For each group
- All of the columns specified in the
GROUP BYclause are already the same so they simply collapse down into that value.
- For the other columns, they get combined based on the aggregate function. In our case, we
COUNTeach row that exists and display the final number.
- All of the columns specified in the
GROUP BY is such a useful feature of SQL, but it’s not always intuitive the first few times you use it. My hope is that these mental models will serve you in your SQL travels. Good luck!