Why Ecto's Way of Storing Embedded Lists of Maps Makes Querying Hard

The Problem

We have a bunch of embedded models on an Ecto model. They’re self-contained and don’t need to be normalized or unique. They need to be flexible and lightweight and not have a lot of overhead in either UX or code.

Here’s a migration that defines the embeds_many column according to how Ecto tells you to.

defmodule Grocer.Repo.Migrations.CreateFoods do
  use Ecto.Migration

  def change do
    create table("foods") do
      add :name, :string
      add :upc, :string
      add :ingredients, {:array, :map}, default: []
    end
  end
end

And the associated schemas:

schema "foods" do
  field :name, :string
  field :upc, :string
  embeds_many :ingredients, Grocer.Ingredient
end

embedded_schema do
  field :name, :string
  field :mg, :integer
end

Pretty standard stuff. Anyone who’s defined an embeds_many has seen that before. I’ve found that users of ORMs like Ecto, etc. don’t always look at the resulting database table, so let’s look. When we go into psql and enter \d foods, we’ll see this:

                                   Table "public.foods"
   Column    |          Type          |                     Modifiers
-------------+------------------------+----------------------------------------------------
 id          | bigint                 | not null default nextval('foods_id_seq'::regclass)
 name        | character varying(255) |
 upc         | character varying(255) |
 ingredients | jsonb[]                | default ARRAY[]::jsonb[]
Indexes:
    "foods_pkey" PRIMARY KEY, btree (id)

Where Ecto’s {:array, :map} column type gives us jsonb[] in the database. This means “a Postgres array of jsonb values”.

And there’s the thing: Postgres has an Array type of its own (and has for a while). This is well and good and well-supported, but they don’t work like JSON arrays do. But you know what does work like JSON? JSON.

Getting our hands dirty

I wanted to see how well an array of JSON would stack up against a single JSON document that contains an array. Since we have the JSONB[] column already, let’s start checking there. Important note: JSONB and JSON are functionally equivalent, but have some important under-the-hood differences. Put briefly: You want JSONB. You don’t want JSON.

I inserted 1 million rows using this quick script which isn’t relevant but I’ll link because the names it made entertained me greatly (and I want you to be entertained, too). Through the magic of EXPLAIN ANALYZE, we can see how the performance will be. Our SQL statement:

EXPLAIN ANALYZE SELECT name
FROM foods
WHERE ARRAY_TO_JSON(ingredients_::JSONB @> '[{"name": "Milk"}]');

We get these results:

                                                  QUERY PLAN
---------------------------------------------------------------------------------------------------------------
 Seq Scan on foods  (cost=0.00..50729.56 rows=1000 width=23) (actual
time=0.134..4395.623 rows=287015 loops=1)
   Filter: ((array_to_json(ingredients))::jsonb @> '[{"name": "Milk"}]'::jsonb)
   Rows Removed by Filter: 712980
 Planning time: 0.040 ms
 Execution time: 4411.212 ms

4.4 seconds? Dag, yo. That’s not really tenable.

Ok, so that’s one option. The other option is simply JSONB. I changed the column to real :jsonb in the migration (note, the default is '[]' and not [] like it was before):

add :ingredients, :jsonb, default: "[]"

And Postgres said the column type become this:

ingredients | jsonb                  | default '[]'::jsonb

With no index on the column, the equivalent query from before, EXPLAIN ANALYZE SELECT * FROM foods WHERE ingredients @> '[{"name": "Milk"}]';, looked like this:

                                                  QUERY PLAN
---------------------------------------------------------------------------------------------------------------
 Seq Scan on foods  (cost=0.00..39618.01 rows=1000 width=187) (actual time=0.040..575.419 rows=208370 loops=1)
   Filter: (ingredients @> '[{"name": "Milk"}]'::jsonb)
   Rows Removed by Filter: 791631
 Planning time: 0.036 ms
 Execution time: 585.420 ms

Better! But, let’s be honest, ~0.6 seconds still isn’t great. However, a very important part about JSONB columns is that you can index them. And you can’t really expect good performance out of a non-trivial query without good indexing.

To add an index, we can add execute "CREATE INDEX ingredients_gin ON foods USING GIN (ingredients);" to our migration (and execute "DROP INDEX ingredeients_gin;" to the down migration). This lets the same query as before have these results:

                                                            QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on foods  (cost=27.75..3407.64 rows=1000 width=187) (actual time=42.464..248.331 rows=208370 loops=1)
   Recheck Cond: (ingredients @> '[{"name": "Milk"}]'::jsonb)
   Heap Blocks: exact=27113
   ->  Bitmap Index Scan on ingredients_gin  (cost=0.00..27.50 rows=1000 width=0) (actual time=37.778..37.778 rows=208370 loops=1)
         Index Cond: (ingredients @> '[{"name": "Milk"}]'::jsonb)
 Planning time: 0.054 ms
 Execution time: 258.965 ms

That’s better. That’s way better. It’s still not 100% perfect, but that’s serviceable for columns you’re not querying on in a tight loop. And it’s quite a bit better than the 4.4 seconds from the first example.

Can I use this?

This is all fine here in a blog post. It’s useless to me if I can’t actually use this in my code. Let’s see it in action in Elixir and not just in SQL.

iex(1)> %Grocer.Food{} |> Grocer.Food.changeset(%{name: "Egg", upc: "12345", ingredients: [%{name: "Egg", mg: 1}]}) |> Grocer.Repo.insert
[debug] QUERY OK db=42.6ms
INSERT INTO "foods" ("ingredients","name","upc") VALUES ($1,$2,$3) RETURNING "id" [[%{id: "fea00c41-7050-4795-9aca-5b5d39affa60", mg: 1, name: "Egg"}], "Egg", "12345"]
{:ok,
 %Grocer.Food{
   __meta__: #Ecto.Schema.Metadata<:loaded, "foods">,
   id: 1000002,
   ingredients: [
     %Grocer.Ingredient{
       id: "fea00c41-7050-4795-9aca-5b5d39affa60",
       mg: 1,
       name: "Egg"
     }
   ],
   name: "Egg",
   upc: "12345"
 }}

It certainly appears to work just as well as the recommended column type. What does the database look like?

grocer_dev=# select * from foods where upc = '12345';
   id    | name |  upc  |                               ingredients
---------+------+-------+--------------------------------------------------------------------------
 1000002 | Egg  | 12345 | [{"id": "fea00c41-7050-4795-9aca-5b5d39affa60", "mg":
1, "name": "Egg"}]

That looks like it works just fine!

So why do we use JSONB[]?

Postgres came out with JSON support in version 9.2. It was implemented as little more than a text blob, though. Version 9.4 has JSONB support, but that didn’t come out until the end of 2014. Ecto’s first commits for embeds_many came in mid-2015. It seems quite likely that the concern at the time was that older versions of Postgres (which would still have been very common) wouldn’t have good JSONB support and would therefore be not performant enough.

Since we’re over 3 years out from the release of Postgres 9.4, I think it’s probably been long enough to consider switching your column types, especially on new development and when you control your database or you’re using something like Heroku which keeps its databases nice and shiny.

Why Ecto's Way of Storing Embedded Lists of Maps Makes Querying Hard

The Problem

Getting our hands dirty

Can I use this?

So why do we use JSONB[]?

About thoughtbot

Empower your teams to work remotely with thoughtbot

The Problem

Getting our hands dirty

Can I use this?

So why do we use JSONB[]?

Sign up to receive a weekly recap from thoughtbot

About thoughtbot

Empower your teams to work remotely with thoughtbot