Understanding SQL GROUP BY

Edward Loveall

This is part 3 of a series on SQL basics. Read part 1 and part 2 for context on SELECT, WHERE, and JOIN and how to set up a test database to follow along with these examples.

When I first had to use GROUP BY, it was anything but intuitive. It was only after I learned the mental model below that it started to make sense.

Do You Speak SQL?

Let’s say we want to know the number of languages spoken in each country in our database. This is a perfect case for grouping. The countrylanguage table has one row per language per country so we should be able to count them up. According to this, there are twelve languages spoken in the USA:

SELECT * FROM countrylanguage
WHERE countrycode = 'USA';

But how do we show that as a number?

There Can Be Only One

A lot of people (including a past-me) think of GROUP BY like a parent-child sort of structure, where you group by some key (like countrycode) and all other attributes that related to that key are nested inside.

Missing data in a languages table A table with two columns listing country codes and a language spoken in that country. The country code is USA for the first row, then blank for the remaining.

In the last post, I stressed that SQL actions like WHERE and SELECT act on a single table. There’s a corollary for rows: every table row has one and only one value per column. That means the above table is not possible to represent in SQL.

If you ran the query at the start of the post, you’ll know that the data that we have before we group looks more like this:

Duplicated data in a languages table A table with two columns listing country codes and a language spoken in that country. The country code is USA for every row.

Grouping in SQL is closer to a process of sorting, then squishing or flattening. SQL calls this flattening aggregation. If every table has one and only one value per column then we need tell our query how to turn those multiple values into one.

Aggregation can take many forms. We’re about to simply count up the number of rows, but you could also:

  • Join each word together with commas: English,Spanish,French
  • Get the biggest number: biggest of 1, 6, 2 = 6
  • Get an average of numbers: average of 1, 6, 2 = 3

There are lots of ways to combine rows together based on your needs.

Picking Favorites

Before we start grouping we need to decide which columns to flatten, and which columns to GROUP BY. Both types of columns go in the SELECT clause. For this query, we’re going to use countrycode and language.

The grouped columns will, by definition, be the same value so we don’t have to decide how to combine (aggregate) them. Multiple countrycode values will all merge together into the same value.

It’s worth noting that you can also group by multiple columns. An example would be grouping people by current country and language spoken. E.g. you don’t want everyone from Andorra in one group. Instead you want everyone who can speak Spanish, Catalan, and Portuguese from Andorra in distinct groups.

The remaining, ungrouped columns will need to use an aggregate function so they know how to be merged together. No other columns can be included in the SELECT clause. If there are extra columns you’ll see this error message:

column “language” must appear in the GROUP BY clause or be used in an aggregate function

In our example we want to group everything by the countrycode and aggregate the number of languages. Our query looks like this:

SELECT
  countrycode,
  language -- this will need some kind of aggregate function
FROM countrylanguage
GROUP BY countrycode;

For our query, we want to know the number of languages per country, so we will use the COUNT function.

SELECT
  countrycode,
  COUNT(language)
FROM countrylanguage
GROUP BY countrycode;

Duplicated data in a languages table A table with two columns listing country codes and a language spoken in that country. The country code is USA for every row.

Even with all this explanation, GROUP BY can remain a bit baffling. Let’s walk through it one more time step-by-step:

  1. Start with the full, ungrouped table.
  2. Sort all the rows together based on the GROUP BY columns. So in our example: move all the USA rows together, and then all the TUV rows, and so on.
  3. For each group
    1. All of the columns specified in the GROUP BY clause are already the same so they simply collapse down into that value.
    2. For the other columns, they get combined based on the aggregate function. In our case, we COUNT each row that exists and display the final number.

GROUP BY is such a useful feature of SQL, but it’s not always intuitive the first few times you use it. My hope is that these mental models will serve you in your SQL travels. Good luck!