Postgres full-text search is awesome
but without tuning, searching large columns can be slow.
Introducing a tsvector
column to cache lexemes
and using a trigger to keep the lexemes up-to-date
can improve the speed of full-text searches.
This article shows how to accomplish that in Rails.
Using pg_search
with Rails
We want our users to search for products.
Let’s add the pg_search
gem
to our Rails app:
gem "pg_search"
Then, configure it for our Product
model:
class Product < ActiveRecord::Base
include PgSearch
pg_search_scope(
:search,
against: %i(
description
manufacturer_name
name
),
using: {
tsearch: {
dictionary: "english",
}
}
)
end
This example shows configuration for
full-text search of the products
table’s
description
, manufacturer
, and name
columns.
See Implementing Multi-Table Full Text Search with Postgres in Rails
for an example of full-text search of multiple tables.
Our pg_search_scope
is named :search
,
so we can invoke it with:
Product.search("wool")
We’re explicitly specifying the :tsearch
option
(which is the default Postgres full-text search)
in order to use the english
dictionary instead of the default
simple
dictionary.
What we get for a SQL query
Wonderful. We have full-text searching set up in minutes.
Now, what does our SQL query look like?
SELECT products.*
FROM products
INNER JOIN (
SELECT products.id AS pg_search_id,
(
ts_rank(
(
to_tsvector('english', coalesce(products.description::text, '')) ||
to_tsvector('english', coalesce(products.manufacturer_name::text, '')) ||
to_tsvector('english', coalesce(products.name::text, ''))
),
(
to_tsquery('english', ''' ' || 'wool' || ' ''')
),
?
)
) AS rank
FROM products
WHERE (
(
(
to_tsvector('english', coalesce(products.description::text, '')) ||
to_tsvector('english', coalesce(products.manufacturer_name::text, '')) ||
to_tsvector('english', coalesce(products.name::text, ''))
) @@
(
to_tsquery('english', ''' ' || 'wool' || ' ''')
)
)
)
) pg_search ON products.id = pg_search.pg_search_id
ORDER BY pg_search.rank DESC
LIMIT 24
OFFSET 0
This is all pretty standard SQL plus a few cool functions:
ts_rank
, to_tsvector
, and to_tsquery
.
The to_tsvector
function in is worth a closer look.
It generates tsvector
data types,
which are “a sorted list of distinct lexemes.”
Lexemes, in turn, are “words that have been normalized
to make different variants of the same word look alike”.
For example, given the following product:
Product.create(
description: "Michael Kors",
name: "Sunglasses",
manufacturer_name: "Michael Kors"
)
The tsvector
looks like:
'kor':2,4,6 'michael':1,3,5 'sunglass':7
The resulting lexemes were “normalized to make different variants” by lowercasing, removing suffixes, etc. The lexemes were sorted into a list and the numbers represent the position of the lexeme in the original strings.
For tons of awesome examples and details on these three functions, see Postgres full-text search is Good Enough!
Caching tsvector
lexemes
On a large products
table,
our searches may be slow.
If so, we have some tuning options.
One option would be to cache the tsvector
s using a materialized view.
Read Caching with Postgres materialized views or
Postgres full-text search is Good Enough! (again)
for more information materialized views with Postgres and Ruby.
Materialized views may be a good option for your data. One downside is that the entire view must be refreshed with:
REFRESH MATERIALIZED VIEW view_name;
That may be a good fit in some scenarios,
perhaps run daily as a cron or Heroku Scheduler job.
In our case,
we want a cache to be updated
when a Product
is created or updated.
Let’s edit our pg_search_scope
:
using: {
tsearch: {
+ tsvector_column: "tsv",
}
}
Since we can’t dump a tsvector
column to schema.rb
,
we need to switch to the SQL schema format
in our config/application.rb
:
config.active_record.schema_format = :sql
Remove the now-unnecessary db/schema.rb
:
rm db/schema.rb
And generate a migration:
class AddTsvectorColumns < ActiveRecord::Migration
def up
add_column :products, :tsv, :tsvector
add_index :products, :tsv, using: "gin"
execute <<-SQL
CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE
ON products FOR EACH ROW EXECUTE PROCEDURE
tsvector_update_trigger(
tsv, 'pg_catalog.english', description, manufacturer_name, name
);
SQL
now = Time.current.to_s(:db)
update("UPDATE products SET updated_at = '#{now}'")
end
def down
execute <<-SQL
DROP TRIGGER tsvectorupdate
ON products
SQL
remove_index :products, :tsv
remove_column :products, :tsv
end
end
This change introduces a tsv
column of type tsvector
to search against,
a GIN index on the new column,
a TRIGGER
on those new columns BEFORE INSERT OR UPDATE
,
and a backfill UPDATE
for existing products
,
to keep the data in sync.
Postgres has a built-in tsvector_update_trigger
function
to make this easier.
The GIN index could alternatively be a GiST index. See the GIN vs. GiST tradeoffs.
Here’s the resulting query with the new tsvector
-type column:
SELECT products.*
FROM products
INNER JOIN (
SELECT products.id AS pg_search_id,
(
ts_rank(
(products.tsv),
(to_tsquery('english', ''' ' || 'wool' || ' ''')), 0
)
) AS rank
FROM products
WHERE (
((products.tsv) @@ (to_tsquery('english', ''' ' || 'wool' || ' ''')))
)
) pg_search ON products.id = pg_search.pg_search_id
ORDER BY pg_search.rank DESC
LIMIT 24
OFFSET 0
We can see that our run-time to_tsvector
function calls are gone,
and our cached tsvector
data in the
GIN-indexed tsv
column are being queried against.
We’ve now improved the speed of our queries
by introducing a tsvector
column to cache lexemes.
The trigger will keep the lexemes up-to-date
as products
are created and updated,
without any daily cron job to run.
Happy searching.