This is part 3 of a series on SQL basics. Read part 1 and part 2 for context on
SELECT
,WHERE
, andJOIN
and how to set up a test database to follow along with these examples.
When I first had to use GROUP BY
, it was anything but intuitive. It was only after I learned the mental model below that it started to make sense.
Do You Speak SQL?
Let’s say we want to know the number of languages spoken in each country in our database. This is a perfect case for grouping. The countrylanguage
table has one row per language per country so we should be able to count them up. According to this, there are twelve languages spoken in the USA:
SELECT * FROM countrylanguage
WHERE countrycode = 'USA';
But how do we show that as a number?
There Can Be Only One
A lot of people (including a past-me) think of GROUP BY
like a parent-child sort of structure, where you group by some key (like countrycode
) and all other attributes that related to that key are nested inside.
In the last post, I stressed that SQL actions like WHERE
and SELECT
act on a single table. There’s a corollary for rows: every table row has one and only one value per column. That means the above table is not possible to represent in SQL.
If you ran the query at the start of the post, you’ll know that the data that we have before we group looks more like this:
Grouping in SQL is closer to a process of sorting, then squishing or flattening. SQL calls this flattening aggregation. If every table has one and only one value per column then we need tell our query how to turn those multiple values into one.
Aggregation can take many forms. We’re about to simply count up the number of rows, but you could also:
- Join each word together with commas:
English,Spanish,French
- Get the biggest number: biggest of
1
,6
,2
=6
- Get an average of numbers: average of
1
,6
,2
=3
There are lots of ways to combine rows together based on your needs.
Picking Favorites
Before we start grouping we need to decide which columns to flatten, and which columns to GROUP BY
. Both types of columns go in the SELECT
clause. For this query, we’re going to use countrycode
and language
.
The grouped columns will, by definition, be the same value so we don’t have to decide how to combine (aggregate) them. Multiple countrycode
values will all merge together into the same value.
It’s worth noting that you can also group by multiple columns. An example would be grouping people by current country and language spoken. E.g. you don’t want everyone from Andorra in one group. Instead you want everyone who can speak Spanish, Catalan, and Portuguese from Andorra in distinct groups.
The remaining, ungrouped columns will need to use an aggregate function so they know how to be merged together. No other columns can be included in the SELECT
clause. If there are extra columns you’ll see this error message:
column “language” must appear in the GROUP BY clause or be used in an aggregate function
In our example we want to group everything by the countrycode
and aggregate the number of languages. Our query looks like this:
SELECT
countrycode,
language -- this will need some kind of aggregate function
FROM countrylanguage
GROUP BY countrycode;
For our query, we want to know the number of languages per country, so we will use the COUNT
function.
SELECT
countrycode,
COUNT(language)
FROM countrylanguage
GROUP BY countrycode;
Even with all this explanation, GROUP BY
can remain a bit baffling. Let’s walk through it one more time step-by-step:
- Start with the full, ungrouped table.
- Sort all the rows together based on the
GROUP BY
columns. So in our example: move all theUSA
rows together, and then all theTUV
rows, and so on. - For each group
- All of the columns specified in the
GROUP BY
clause are already the same so they simply collapse down into that value. - For the other columns, they get combined based on the aggregate function. In our case, we
COUNT
each row that exists and display the final number.
- All of the columns specified in the
GROUP BY
is such a useful feature of SQL, but it’s not always intuitive the first few times you use it. My hope is that these mental models will serve you in your SQL travels. Good luck!