In the last post we talked about the following query:
SELECT name, countrycode
WHERE population < 1000;
We specifically focused on
SELECT. The one part of the query not mentioned was
FROM. It (along with the
JOIN clause) specifies the table that our data comes from. This is the data that
WHERE use to do their filtering.
One thing I want to emphasize here is that SQL actions like
SELECT act on a single table. Even when we talk about combining tables together, it does just that: combines multiple tables into one.
As an example, let’s query for all the cities that exist in a country that has a long life expectancy. We know how to get
city data, but
lifeexpectancy exists on the
country table. How do we get them both together? The only way is to create one big table using
SQL joins can be confusing when you first come across them. This is what the syntax looks like:
JOIN country ON city.countrycode = country.code;
This creates a single, larger table (again, not multiple) containing all the columns from the
country tables. But if it’s only one table, what do the rows look like? And how does it know which rows to use? We’ll address the second question first.
You may have noticed the
ON part of the
JOIN. Its job is to describe how the
JOIN will combine rows from each table together. The left side of the equation must match the right side. To visualize the process for what
ON are doing, I picture the two separate tables side by side:
I start with either table,
city for example. Then I look at the
ON portion of the
JOIN. It connects
country.code. So for each
city row, I take the
city.countrycode and find the matching
country.code row in the
country table. Then I take those two rows and add them together. They are now one, larger row.
Note for you DB experts out there: This is using a Hash join behind the scenes. It works by scanning the smaller table first (
country), hashes each compared value (e.g.
country.code) into a hash table, and then scans the larger table (
city) to see if any of the values from the larger table (e.g.
city.countrycode) can be found in the hash table. Hash joins are a great tool for optimizing queries but less useful for understanding
This works from either direction. If I start from
country and find matching rows from
city Postgres is smart enough to figure out which table to look at and we’ll get the same number of results.
If there is no match when iterating through each row, it’s removed from the results. You can see that above in how Armenia isn’t present in the final table.
JOIN also can duplicate data. Both Luanda and Huambo match the same country so it is present twice after the
JOIN. This makes it possible to filter rows based on the new data.
JOIN is short for
INNER JOIN, which is just one way of joining tables. There are a few different types of joins. They each change what happens when there is (or is not) a match found for a row. We won’t cover other types here because the important part is understanding how rows combine into one, larger table.
Once we have this big table, we can return to our original question. How many cities exist in a country that has a long life expectancy? Once we have all the data we need (a single table with both city and country data) we pare down the data a bit with
SELECT city.name, country.lifeexpectancy
JOIN country ON country.code = city.countrycode
WHERE country.lifeexpectancy > 80;
To recap, the process looks like:
- Create one big table with
- Pair down the rows with
- Pair down the columns with
I recommend playing around with
JOIN more. Try joining in
countrylanguage so you have all three tables together at once. For example, try to add the language spoken in the life expectancy query above.
Next we’ll talk about combining results with