Understanding SQL JOIN

This is part 2 of a series on SQL basics. Read part 1 for context on SELECT and WHERE and how to set up a test database to follow along with these examples. Part 3 will cover GROUP BY.

Table for One, Please

In the last post we talked about the following query:

SELECT name, countrycode
FROM city
WHERE population < 1000;

We specifically focused on WHERE and SELECT. The one part of the query not mentioned was FROM. It (along with the JOIN clause) specifies the table that our data comes from. This is the data that SELECT and WHERE use to do their filtering.

One thing I want to emphasize here is that SQL actions like WHERE and SELECT act on a single table. Even when we talk about combining tables together, it does just that: combines multiple tables into one.

JOIN Together

As an example, let’s query for all the cities that exist in a country that has a long life expectancy. We know how to get city data, but lifeexpectancy exists on the country table. How do we get them both together? The only way is to create one big table using JOIN.

SQL joins can be confusing when you first come across them. This is what the syntax looks like:

FROM city
JOIN country ON city.countrycode = country.code;

This creates a single, larger table (again, not multiple) containing all the columns from the city and country tables. But if it’s only one table, what do the rows look like? And how does it know which rows to use? We’ll address the second question first.

You may have noticed the ON part of the JOIN. Its job is to describe how the JOIN will combine rows from each table together. The left side of the equation must match the right side. To visualize the process for what JOIN and ON are doing, I picture the two separate tables side by side:

The city and country table A sample of some rows and columns from the city and country table showing ids, names, and country codes.

I start with either table, city for example. Then I look at the ON portion of the JOIN. It connects city.countrycode to country.code. So for each city row, I take the city.countrycode and find the matching country.code row in the country table. Then I take those two rows and add them together. They are now one, larger row.

The city and country table Rows from the city and country table being fused together by their country code, then combined into a single table.

Note for you DB experts out there: This is using a Hash join behind the scenes. It works by scanning the smaller table first (country), hashes each compared value (e.g. country.code) into a hash table, and then scans the larger table (city) to see if any of the values from the larger table (e.g. city.countrycode) can be found in the hash table. Hash joins are a great tool for optimizing queries but less useful for understanding JOIN.

This works from either direction. If I start from country and find matching rows from city Postgres is smart enough to figure out which table to look at and we’ll get the same number of results.

If there is no match when iterating through each row, it’s removed from the results. You can see that above in how Armenia isn’t present in the final table. JOIN also can duplicate data. Both Luanda and Huambo match the same country so it is present twice after the JOIN. This makes it possible to filter rows based on the new data.

JOIN is short for INNER JOIN, which is just one way of joining tables. There are a few different types of joins. They each change what happens when there is (or is not) a match found for a row. We won’t cover other types here because the important part is understanding how rows combine into one, larger table.

Important Life Questions

Once we have this big table, we can return to our original question. How many cities exist in a country that has a long life expectancy? Once we have all the data we need (a single table with both city and country data) we pare down the data a bit with WHERE and SELECT:

SELECT city.name, country.lifeexpectancy
FROM city
JOIN country ON country.code = city.countrycode
WHERE country.lifeexpectancy > 80;

To recap, the process looks like:

  • Create one big table with FROM and JOIN
  • Pair down the rows with WHERE
  • Pair down the columns with SELECT

I recommend playing around with JOIN more. Try joining in countrylanguage so you have all three tables together at once. For example, try to add the language spoken in the life expectancy query above.

Next we’ll talk about combining results with GROUP BY.