The Problem
We have a bunch of embedded models on an Ecto model. They’re self-contained and don’t need to be normalized or unique. They need to be flexible and lightweight and not have a lot of overhead in either UX or code.
Here’s a migration that defines the embeds_many
column according to how Ecto
tells you to.
defmodule Grocer.Repo.Migrations.CreateFoods do
use Ecto.Migration
def change do
create table("foods") do
add :name, :string
add :upc, :string
add :ingredients, {:array, :map}, default: []
end
end
end
And the associated schemas:
schema "foods" do
field :name, :string
field :upc, :string
embeds_many :ingredients, Grocer.Ingredient
end
embedded_schema do
field :name, :string
field :mg, :integer
end
Pretty standard stuff. Anyone who’s defined an embeds_many
has seen that
before. I’ve found that users of ORMs like Ecto, etc. don’t always look at the
resulting database table, so let’s look. When we go into psql
and enter \d
foods
, we’ll see this:
Table "public.foods"
Column | Type | Modifiers
-------------+------------------------+----------------------------------------------------
id | bigint | not null default nextval('foods_id_seq'::regclass)
name | character varying(255) |
upc | character varying(255) |
ingredients | jsonb[] | default ARRAY[]::jsonb[]
Indexes:
"foods_pkey" PRIMARY KEY, btree (id)
Where Ecto’s {:array, :map}
column type gives us jsonb[]
in the database.
This means “a Postgres array of jsonb
values”.
And there’s the thing: Postgres has an Array type of its own (and has for a while). This is well and good and well-supported, but they don’t work like JSON arrays do. But you know what does work like JSON? JSON.
Getting our hands dirty
I wanted to see how well an array of JSON would stack up against a single JSON
document that contains an array. Since we have the JSONB[]
column already,
let’s start checking there. Important note: JSONB and JSON are functionally
equivalent, but have some important under-the-hood differences. Put briefly:
You want JSONB. You don’t want JSON.
I inserted 1 million rows using this quick script which isn’t relevant but
I’ll link because the names it made entertained me greatly (and I want you to be
entertained, too). Through the magic of EXPLAIN ANALYZE
, we can see how the
performance will be. Our SQL statement:
EXPLAIN ANALYZE SELECT name
FROM foods
WHERE ARRAY_TO_JSON(ingredients_::JSONB @> '[{"name": "Milk"}]');
We get these results:
QUERY PLAN
---------------------------------------------------------------------------------------------------------------
Seq Scan on foods (cost=0.00..50729.56 rows=1000 width=23) (actual
time=0.134..4395.623 rows=287015 loops=1)
Filter: ((array_to_json(ingredients))::jsonb @> '[{"name": "Milk"}]'::jsonb)
Rows Removed by Filter: 712980
Planning time: 0.040 ms
Execution time: 4411.212 ms
4.4 seconds? Dag, yo. That’s not really tenable.
Ok, so that’s one option. The other option is simply JSONB
. I changed the
column to real :jsonb
in the migration (note, the default is '[]'
and not
[]
like it was before):
add :ingredients, :jsonb, default: "[]"
And Postgres said the column type become this:
ingredients | jsonb | default '[]'::jsonb
With no index on the column, the equivalent query from before, EXPLAIN ANALYZE
SELECT * FROM foods WHERE ingredients @> '[{"name": "Milk"}]';
, looked like
this:
QUERY PLAN
---------------------------------------------------------------------------------------------------------------
Seq Scan on foods (cost=0.00..39618.01 rows=1000 width=187) (actual time=0.040..575.419 rows=208370 loops=1)
Filter: (ingredients @> '[{"name": "Milk"}]'::jsonb)
Rows Removed by Filter: 791631
Planning time: 0.036 ms
Execution time: 585.420 ms
Better! But, let’s be honest, ~0.6 seconds still isn’t great. However, a very important part about JSONB columns is that you can index them. And you can’t really expect good performance out of a non-trivial query without good indexing.
To add an index, we can add execute "CREATE INDEX ingredients_gin ON foods
USING GIN (ingredients);"
to our migration (and execute "DROP INDEX
ingredeients_gin;"
to the down migration). This lets the same query as before
have these results:
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on foods (cost=27.75..3407.64 rows=1000 width=187) (actual time=42.464..248.331 rows=208370 loops=1)
Recheck Cond: (ingredients @> '[{"name": "Milk"}]'::jsonb)
Heap Blocks: exact=27113
-> Bitmap Index Scan on ingredients_gin (cost=0.00..27.50 rows=1000 width=0) (actual time=37.778..37.778 rows=208370 loops=1)
Index Cond: (ingredients @> '[{"name": "Milk"}]'::jsonb)
Planning time: 0.054 ms
Execution time: 258.965 ms
That’s better. That’s way better. It’s still not 100% perfect, but that’s serviceable for columns you’re not querying on in a tight loop. And it’s quite a bit better than the 4.4 seconds from the first example.
Can I use this?
This is all fine here in a blog post. It’s useless to me if I can’t actually use this in my code. Let’s see it in action in Elixir and not just in SQL.
iex(1)> %Grocer.Food{} |> Grocer.Food.changeset(%{name: "Egg", upc: "12345", ingredients: [%{name: "Egg", mg: 1}]}) |> Grocer.Repo.insert
[debug] QUERY OK db=42.6ms
INSERT INTO "foods" ("ingredients","name","upc") VALUES ($1,$2,$3) RETURNING "id" [[%{id: "fea00c41-7050-4795-9aca-5b5d39affa60", mg: 1, name: "Egg"}], "Egg", "12345"]
{:ok,
%Grocer.Food{
__meta__: #Ecto.Schema.Metadata<:loaded, "foods">,
id: 1000002,
ingredients: [
%Grocer.Ingredient{
id: "fea00c41-7050-4795-9aca-5b5d39affa60",
mg: 1,
name: "Egg"
}
],
name: "Egg",
upc: "12345"
}}
It certainly appears to work just as well as the recommended column type. What does the database look like?
grocer_dev=# select * from foods where upc = '12345';
id | name | upc | ingredients
---------+------+-------+--------------------------------------------------------------------------
1000002 | Egg | 12345 | [{"id": "fea00c41-7050-4795-9aca-5b5d39affa60", "mg":
1, "name": "Egg"}]
That looks like it works just fine!
So why do we use JSONB[]?
Postgres came out with JSON support in version 9.2. It was implemented as
little more than a text blob, though. Version 9.4 has JSONB support, but that
didn’t come out until the end of 2014. Ecto’s first commits for
embeds_many
came in mid-2015. It seems quite likely that the
concern at the time was that older versions of Postgres (which would still have
been very common) wouldn’t have good JSONB support and would therefore be not
performant enough.
Since we’re over 3 years out from the release of Postgres 9.4, I think it’s probably been long enough to consider switching your column types, especially on new development and when you control your database or you’re using something like Heroku which keeps its databases nice and shiny.