This is part 2 of a series on SQL basics. Read part 1 for context on
SELECT
andWHERE
and how to set up a test database to follow along with these examples. Part 3 will coverGROUP BY
.
Table for One, Please
In the last post we talked about the following query:
SELECT name, countrycode
FROM city
WHERE population < 1000;
We specifically focused on WHERE
and SELECT
. The one part of the query not mentioned was FROM
. It (along with the JOIN
clause) specifies the table that our data comes from. This is the data that SELECT
and WHERE
use to do their filtering.
One thing I want to emphasize here is that SQL actions like WHERE
and SELECT
act on a single table. Even when we talk about combining tables together, it does just that: combines multiple tables into one.
JOIN Together
As an example, let’s query for all the cities that exist in a country that has a long life expectancy. We know how to get city
data, but lifeexpectancy
exists on the country
table. How do we get them both together? The only way is to create one big table using JOIN
.
SQL joins can be confusing when you first come across them. This is what the syntax looks like:
SELECT *
FROM city
JOIN country ON city.countrycode = country.code;
This creates a single, larger table (again, not multiple) containing all the columns from the city
and country
tables. But if it’s only one table, what do the rows look like? And how does it know which rows to use? We’ll address the second question first.
You may have noticed the ON
part of the JOIN
. Its job is to describe how the JOIN
will combine rows from each table together. The left side of the equation must match the right side. To visualize the process for what JOIN
and ON
are doing, I picture the two separate tables side by side:
I start with either table, city
for example. Then I look at the ON
portion of the JOIN
. It connects city.countrycode
to country.code
. So for each city
row, I take the city.countrycode
and find the matching country.code
row in the country
table. Then I take those two rows and add them together. They are now one, larger row.
Note for you DB experts out there: This is using a Hash join behind the scenes. It works by scanning the smaller table first (
country
), hashes each compared value (e.g.country.code
) into a hash table, and then scans the larger table (city
) to see if any of the values from the larger table (e.g.city.countrycode
) can be found in the hash table. Hash joins are a great tool for optimizing queries but less useful for understandingJOIN
.
This works from either direction. If I start from country
and find matching rows from city
Postgres is smart enough to figure out which table to look at and we’ll get the same number of results.
If there is no match when iterating through each row, it’s removed from the results. You can see that above in how Armenia isn’t present in the final table. JOIN
also can duplicate data. Both Luanda and Huambo match the same country so it is present twice after the JOIN
. This makes it possible to filter rows based on the new data.
JOIN
is short for INNER JOIN
, which is just one way of joining tables. There are a few different types of joins. They each change what happens when there is (or is not) a match found for a row. We won’t cover other types here because the important part is understanding how rows combine into one, larger table.
Important Life Questions
Once we have this big table, we can return to our original question. How many cities exist in a country that has a long life expectancy? Once we have all the data we need (a single table with both city and country data) we pare down the data a bit with WHERE
and SELECT
:
SELECT city.name, country.lifeexpectancy
FROM city
JOIN country ON country.code = city.countrycode
WHERE country.lifeexpectancy > 80;
To recap, the process looks like:
- Create one big table with
FROM
andJOIN
- Pair down the rows with
WHERE
- Pair down the columns with
SELECT
I recommend playing around with JOIN
more. Try joining in countrylanguage
so you have all three tables together at once. For example, try to add the language spoken in the life expectancy query above.
Next we’ll talk about combining results with GROUP BY
.