Querying with Custom Joins

Video

Want to see the full-length video right now for free?

Notes

ActiveRecord can write a lot of your SQL for you. It can figure out a lot of your joins and do a lot of things without you ever having to leave the comfort of your Ruby home.

However, there will be situations where you need to do something that ActiveRecord doesn't know how to do. Fortunately, ActiveRecord exposes holes where you can, when necessary, plug in bits of your own SQL. This lets you combine ActiveRecord's nice, composable, Ruby-like separation of concerns with all the raw power of SQL.

This frees you up to start with the clean and simple Ruby/ActiveRecord querying that we've practiced in the last two lessons, with the confidence that you can break out into custom SQL if you really need to. We can get the best of both worlds by using... Custom Joins!

Our example's domain model

This time, we're going to focus on just one model, Person, and add a self- referential association:

class Person < ActiveRecord::Base
  belongs_to :manager, class_name: "Person", foreign_key: :manager_id
  has_many :employees, class_name: "Person", foreign_key: :manager_id
end

The current state of affairs in the database is:

People.all

id	name	role_id	location_id	manager_id
1	Eve	2	2	NULL
2	Bill	2	1	NULL
3	Wendell	1	1	1
4	Christie	1	1	1
5	Sandy	1	3	2

If you look at the list of people, you'll notice we have two people with no manager (Eve and Bill), two people managed by Eve (Wendell and Christie), and one person managed by Bill (Sandy).

`belongs_to` challenge

Let's start by finding all the people who are not managed by "Eve." That includes Eve and Bill, since they do not have a manager at all.

Naive `joins` method

We can start out with what we know by [joining a named association with ActiveRecord][]:

Person.joins(:manager)

ActiveRecord can figure out that it needs to join the people table to itself:

SELECT "people".*
FROM "people"
INNER JOIN "people" "managers_people"
  ON "managers_people"."id" = "people"."manager_id"

You can see that ActiveRecord made up the alias managers_people to keep things straight, since we're joining a table to itself. The resulting data looks like this:

people					managers_people
id	name	role_id	location_id	manager_id	id	name	role_id	location_id	manager_id
3	Wendell	1	1	1	1	Eve	2	2	NULL
4	Christie	1	1	1	1	Eve	2	2	NULL
5	Sandy	1	3	2	2	Bill	2	1	NULL

Again, the part on the left side gets sent back to ActiveRecord and built into objects, while the right side is available while in the database to query against.

[joining a named association with ActiveRecord]: http://guides.rubyonrails.org/active_record_querying.html#using-array-hash-of-named-associations

The problem with inner joins

You'll notice that we're missing Eve and Bill on the left side. This is because the ActiveRecord joins method uses a SQL "inner" join. We're not going to go into the ins and outs of inner and outer joins in SQL too much ([you can read more about them here][]), but a good thing to know is that with an inner join, you need to have a match on both sides in order to get a row back.

"Outer" joins can be used when you want to get back a row even if it doesn't have a corresponding record in the joined table. In our case, we want people to show up on the left side even if they don't have a corresponding manager on the right side. To achieve this, we can use a "left outer" join.

[you can read more about them here]: http://blog.codinghorror.com/a-visual-explanation-of-sql-joins/

ActiveRecord's `not` method

First, let's narrow down our results a bit: we were interested in finding people not managed by Eve, not just anybody with a manager. We can do this cleanly with [ActiveRecord's not method][]:

Person.
  joins(:manager).
  where.
  not(managers_people: { id: Person.find_by!(name: "Eve") })

which generates SQL like this:

SELECT "people".*
FROM "people"
INNER JOIN "people" "managers_people"
  ON "managers_people"."id" = "people"."manager_id"
WHERE ("managers_people"."id" != 1);

and retrieves data like this:

people					managers_people
id	name	role_id	location_id	manager_id	id	name	role_id	location_id	manager_id
5	Sandy	1	3	2	2	Bill	2	1	NULL

You can see that we are now just left with Sandy, who is managed by Bill. So far so good, but we're still missing Eve and Bill, both of whom are also not managed by Eve.

[ActiveRecord's not method]: https://robots.thoughtbot.com/activerecords-wherenot

Using a custom join

To get Eve and Bill as well, we need a left outer join rather than an inner join. Unfortunately, ActiveRecord doesn't provide a built-in method to perform a left outer join. However, we can solve this with a Custom Join:

Person.
  joins(<<-SQL).
    LEFT JOIN people managers
    ON managers.id = people.manager_id
  SQL
  where.
  not(managers: { id: Person.find_by!(name: "Eve") })

A couple of notes on the above:

We're using a [Ruby Heredoc] as a convenient way to write a multiline string for our SQL, which is a pretty common approach.
We are using the same old joins method, but passing it a string rather than a symbol; and so rather than looking for an association to join, it executes the raw SQL provided.
Since we are now writing the SQL ourselves, we have to be aware of the particular syntax that our database requires. ActiveRecord can't translate for us anymore.
Try to keep custom SQL like this encapsulated in your model so that in case you switch your database or table or foreign key names, it won't be too hard to update it.
The joins method still returns an ActiveRecord relation, which means we can still keep the rest of our query the same, which is nice!

That said, this is the SQL that is generated:

SELECT "people".*
FROM "people"
LEFT JOIN "people" "managers"
  ON "managers"."id" = "people"."manager_id"
WHERE ("managers"."id" != 1);

which retrieves data like this:

people					managers_people
id	name	role_id	location_id	manager_id	id	name	role_id	location_id	manager_id
5	Sandy	1	3	2	2	Bill	2	1	NULL

[Ruby Heredoc]: http://weblog.jamisbuck.org/2015/9/12/little-things-heredocs.html

NULL = NULL is not truthy

Unfortunately, our output still hasn't changed! Frustrating. We still don't get Eve and Bill back. This is due to an oddity with comparisons in SQL:

In SQL, a row is returned if the condition inside the where clause is truthy -- e.g., not false and not NULL. This is what we are used to in Ruby, with everything being truthy except for false and nil.

However, there is a slight difference between how Ruby handles comparisons with nil and how SQL handles comparisons with NULL:

Ruby Expression	Result	SQL Expression	Result
`1 == 1`	`true`	`1 = 1`	`true`
`1 == 2`	`false`	`1 = 2`	`false`
`1 != 2`	`true`	`1 != 2`	`true`
`1 == nil`	`false`	`1 = NULL`	`NULL`
`1 != nil`	`true`	`1 != NULL`	`NULL`

The last comparison in SQL might be surprising, coming from a Ruby world. In SQL, any comparison to NULL results in NULL -- even NULL = NULL:

Ruby expression	Result		SQL expression	Result
`nil == nil`	`true`		`NULL = NULL`	`NULL`

This is a problem for us, because it means when we ask "Is this person's manager ID not equal to 1?" (for Eve and Bill, the manager ID is NULL) in order to check whether to return a row, the answer is falsy, and so the row isn't being returned.

In other words,

Person.where("manager_id != null").all

generates SQL like this:

SELECT "people".* FROM "people" WHERE (manager_id != null)

but returns an empty set:

=> []

Meanwhile,

Person.where("manager_id != 5").all

generates SQL like this:

SELECT "people".* FROM "people" WHERE (manager_id != 5)

and returns people that have a non-NULL value for manager_id that is not equal to 5:

=> [#<Person:0x005575ad16f6c0 id: 3, name: "Wendell", role_id: 1, location_id: 1, manager_id: 1, salary: 35000>,
#<Person:0x005575ad16f3f0 id: 4, name: "Christie", role_id: 1, location_id: 1, manager_id: 1, salary: 30000>,
#<Person:0x005575ad16f1e8 id: 5, name: "Sandy", role_id: 1, location_id: 3, manager_id: 2, salary: 40000>]

but still no Eve or Bill!

One solution

There are several solutions to this problem, but here is one we like:

Person.
  joins(<<-SQL).
    LEFT JOIN people managers
    ON managers.id = people.manager_id
  SQL
  where(
    "managers.id != ? OR managers.id IS NULL",
    Person.find_by!(name: "Eve")
  )

As with joins, the where method can accept a string rather than a hash, and it will know to expect raw SQL to execute. We need to use custom SQL here because ActiveRecord doesn't have built-in support for SQL's OR ([yet][]).

Our custom string includes an extra comparison, OR managers.id IS NULL, which behaves as you would expect:

SELECT "people".*
FROM "people"
LEFT JOIN "people" "managers"
  ON "managers"."id" = "people"."manager_id"
WHERE ("managers"."id" != 1 OR "managers"."id" IS NULL);

[yet]: https://github.com/rails/rails/pull/16052

Never interpolate into SQL

Notice that even though we are passing the where method a Ruby string, we are not interpolating the value that we want into that string. Instead, we put a question mark in the string where we want the value to go, and then pass the value as a subsequent argument to where.

This is because if someday a user was somehow able to supply that value, we would be vulnerable to a [SQL injection attack][]. So one thing to always keep in mind while working with custom SQL is to never let potentially user-supplied values come near your raw SQL strings. By using the placeholder (question mark) syntax, we can allow ActiveRecord to sanitize the string for us and then embed it.

[SQL injection attack]: https://xkcd.com/327/

Final solution

Finally, the output that we get from our query:

id	name	role_id	location_id	manager_id	id	name	role_id	location_id	manager_id
1	Eve	2	2	NULL	NULL	NULL	NULL	NULL	NULL
2	Bill	2	1	NULL	NULL	NULL	NULL	NULL	NULL
5	Sandy	1	3	2	2	Bill	2	1	NULL

At last, we have successfully found all people not managed by Eve, including Eve and Bill. And we did it all in the database, with one efficient query. Success!

Tricking ActiveRecord into doing a left join with the `includes` method

Now, we said earlier that there isn't a way to have ActiveRecord perform a left join for us, since the joins method uses inner joins exclusively. But that's not completely true -- while there isn't a way to tell ActiveRecord explicitly that we want a left join, there is another method that, coincidentally, uses a left join: [includes][].

includes is intended to be used for eager loading data, and if you also use [references][] to specify conditions on the eager loaded data, then it just so happens that ActiveRecord will use a left join:

Person.
  includes(:manager).
  references(:manager).
  where(
    "managers_people.id != ? OR managers_people.id IS NULL",
    Person.find_by!(name: "Eve")
  )

generates SQL like this:

SELECT "people"."id" AS t0_r0, "people"."name" AS t0_r1, "people"."role_id" AS t0_r2,
"people"."location_id" AS t0_r3, "people"."manager_id" AS t0_r4, "managers_people"."id" AS t1_r0,
"managers_people"."name" AS t1_r1, "managers_people".role_id AS t1_r2,
"managers_people"."location_id" AS t1_r3, "managers_people"."manager_id" AS t1_r4
FROM "people"
LEFT OUTER JOIN "people" "managers_people"
  ON "managers_people"."id" = "people"."managers_id"
WHERE ("managers_people"."id" != 1 OR "managers_people"."id" IS NULL)

and retrieves data like this:

t0_r0	t0_r1	t0_r2	t0_r3	t0_r4	t1_r0	t1_r1	t1_r2	t1_r3	t1_r4
1	Eve	2	2	NULL	NULL	NULL	NULL	NULL	NULL
2	Bill	2	1	NULL	NULL	NULL	NULL	NULL	NULL
5	Sandy	1	3	2	2	Bill	2	1	NULL

But we think it's a bad idea to depend on this, for a number of reasons:

ActiveRecord may, in the future, change the implementation of includes and references and break our query.
More importantly, it obscures our intention. We don't want to eager load data -- we want an outer join. A teammate (or future you) might be confused and remove the includes if they notice that the eager loaded data isn't being used.
The way that ActiveRecord aliases all of the columns in order to ensure no conflicts between columns with the same names in the different tables (t0_r0, t0_r1, etc) makes the query hard to read and debug.
It's eager loading a lot of data that you don't need, which defeats part of our purpose -- not to build a bunch of unnecessary ActiveRecord objects.

[includes]: http://guides.rubyonrails.org/active_record_querying.html#eager-loading-multiple-associations [references]: http://guides.rubyonrails.org/active_record_querying.html#specifying-conditions-on-eager-loaded-associations

has_many Challenge

One final challenge: let's find all distinct locations with at least one person who belongs to a billable role, ordered by region name, then location name.

Wait, this sounds familiar! We've done this before [in our has_many lesson][].

Old problem

If you recall, we had something like

Location.
  joins(people: :role).
  where(roles: { billable: true }).
  distinct.
  joins(:region).
  merge(Region.order(:name)).
  order(:name)

which caused Postgres to choke:

SELECT DISTINCT "locations".*
FROM "locations"
INNER JOIN "people"
  ON "people"."location_id" = "locations"."id"
INNER JOIN "roles"
  ON "roles"."id" = "people"."role_id"
INNER JOIN "regions"
  ON "regions"."id" = "locations"."region_id"
WHERE "roles"."billable" = 't'
ORDER BY "regions"."name" ASC, "locations"."name" ASC;

PG::InvalidColumnReference: ERROR:  for SELECT DISTINCT, ORDER BY expressions must
appear in select list
LINE 1: ...gion_id" WHERE "roles"."billable" = 't'  ORDER BY "regions"....

[in our has_many lesson]: https://upcase.com/videos/advanced-querying-has-many#our-first-challenge

Old solution

The way we resolved this issue previously was with a sub-query using [ActiveRecord's from method][] method:

Location.
  from(
    Location.
      joins(people: :role).
      where(roles: { billable: true }).
      distinct,
    :locations
  ).
  joins(:region).
  merge(Region.order(:name)).
  order(:name)

which generates SQL like this:

SELECT "locations".*
FROM (
  SELECT DISTINCT "locations".*
  FROM "locations"
  INNER JOIN "people"
    ON "people"."location_id" = "locations"."id"
  INNER JOIN "roles"
    ON "roles"."id" = "people"."role_id"
  WHERE "roles"."billable" = 't'
) locations
INNER JOIN "regions"
  ON "regions"."id" = "locations"."region_id"
ORDER BY "regions"."name" ASC, "locations"."name" ASC;

and retrieves data like this:

locations			regions
id	name	region_id	id	name
1	Boston	1	1	East
3	Denver	2	2	West

[ActiveRecord's from method]: http://api.rubyonrails.org/classes/ActiveRecord/QueryMethods.html#method-i-from

Solution with custom join

There is another way, which involves using a custom join:

Location.
  joins(
    "INNER JOIN (" +
      Location.
        joins(people: :role).
        where(roles: { billable: true }).
        distinct.
        to_sql +
      ") billable_locations " \
      "ON locations.id = billable_locations.id"
  ).
  joins(:region).
  merge(Region.order(:name)).
  order(:name)

This time, rather than selecting from a sub-query, we are joining on to a sub- query. This would generate SQL like this:

SELECT "locations".*
FROM "locations"
INNER JOIN "regions"
  ON "regions"."id" = "locations"."region_id"
INNER JOIN (
  SELECT DISTINCT "locations".*
  FROM "locations"
  INNER JOIN "people"
    ON "people"."location_id" = "locations"."id"
  INNER JOIN "roles"
    ON "roles"."id" = "people"."role_id"
  WHERE "roles"."billable" = 't'
) billable_locations
  ON "locations"."id" = "billable_locations"."id"
ORDER BY "regions"."name" ASC, "locations"."name" ASC;

and retrieves data like this:

locations			regions		billable_locations
id	name	region_id	id	name	id	name	region_id
1	Boston	1	1	East	1	Boston	1
3	Denver	2	2	West	3	Denver	2

Choosing which use

So, which one to use? It's mostly a matter of preference. There are situations where one is more performant than the other, and even in these cases, Postgres is frequently smart enough to rewrite the query to use the better option.

So the most important thing is to use the one that makes the most sense to you. We often find joins to be easier to visualize than sub-queries because they are two-dimensional. But both are good options.

ActiveRecord's `to_sql` method

The trick that makes this work is ActiveRecord's to_sql method, which you can call on any ActiveRecord relation, and which will return a string of SQL rather than execute that SQL in the database.

This is very handy for debugging, but also like in this case, for generating SQL to use inside of custom joins.

Exception to Never Concatenate SQL

This might seem like an exception to our previous rule about never interpolating or concatenating SQL, but since to_sql gets its SQL from ActiveRecord, it should be safe. But it's good that you're being vigilant!

Conclusion

So now we have a sense of how we can use custom joins to, when all else fails, plug in our own raw SQL into the middle of a chain of ActiveRecord methods. This way we can still achieve our goal of letting the database do as much work as possible even when ActiveRecord doesn't have a specific built-in method for what we need.

We have an exercise for you to practice custom joins, and then we'll see you in the final lesson of this series: Aggregations.

Good luck!

Advanced ActiveRecord Querying

23 minutes

Video

Notes

Our example's domain model

`belongs_to` challenge

Naive `joins` method

The problem with inner joins

ActiveRecord's `not` method

Using a custom join

NULL = NULL is not truthy

One solution

Never interpolate into SQL

Final solution

Tricking ActiveRecord into doing a left join with the `includes` method

has_many Challenge

Old problem

Old solution

Solution with custom join

Choosing which use

ActiveRecord's `to_sql` method

Exception to Never Concatenate SQL

Conclusion

Video

Our example's domain model

belongs_to challenge

Naive joins method

The problem with inner joins

ActiveRecord's not method

Using a custom join

NULL = NULL is not truthy

One solution

Never interpolate into SQL

Final solution

Tricking ActiveRecord into doing a left join with the includes method

has_many Challenge

Old problem

Old solution

Solution with custom join

Choosing which use

ActiveRecord's to_sql method

Exception to Never Concatenate SQL

Conclusion

`belongs_to` challenge

Naive `joins` method

ActiveRecord's `not` method

Tricking ActiveRecord into doing a left join with the `includes` method

ActiveRecord's `to_sql` method