Postgres full-text search is awesome
but without tuning, searching large columns can be slow.
Introducing a tsvector column to cache lexemes
and using a trigger to keep the lexemes up-to-date
can improve the speed of full-text searches.
This article shows how to accomplish that in Rails.
Using pg_search with Rails
We want our users to search for products.
Let’s add the pg_search gem
to our Rails app:
gem "pg_search"
Then, configure it for our Product model:
class Product < ActiveRecord::Base
include PgSearch
pg_search_scope(
:search,
against: %i(
description
manufacturer_name
name
),
using: {
tsearch: {
dictionary: "english",
}
}
)
end
This example shows configuration for
full-text search of the products table’s
description, manufacturer, and name columns.
See Implementing Multi-Table Full Text Search with Postgres in Rails
for an example of full-text search of multiple tables.
Our pg_search_scope is named :search,
so we can invoke it with:
Product.search("wool")
We’re explicitly specifying the :tsearch option
(which is the default Postgres full-text search)
in order to use the english dictionary instead of the default
simple dictionary.
What we get for a SQL query
Wonderful. We have full-text searching set up in minutes.
Now, what does our SQL query look like?
SELECT products.*
FROM products
INNER JOIN (
SELECT products.id AS pg_search_id,
(
ts_rank(
(
to_tsvector('english', coalesce(products.description::text, '')) ||
to_tsvector('english', coalesce(products.manufacturer_name::text, '')) ||
to_tsvector('english', coalesce(products.name::text, ''))
),
(
to_tsquery('english', ''' ' || 'wool' || ' ''')
),
?
)
) AS rank
FROM products
WHERE (
(
(
to_tsvector('english', coalesce(products.description::text, '')) ||
to_tsvector('english', coalesce(products.manufacturer_name::text, '')) ||
to_tsvector('english', coalesce(products.name::text, ''))
) @@
(
to_tsquery('english', ''' ' || 'wool' || ' ''')
)
)
)
) pg_search ON products.id = pg_search.pg_search_id
ORDER BY pg_search.rank DESC
LIMIT 24
OFFSET 0
This is all pretty standard SQL plus a few cool functions:
ts_rank, to_tsvector, and to_tsquery.
The to_tsvector function in is worth a closer look.
It generates tsvector data types,
which are “a sorted list of distinct lexemes.”
Lexemes, in turn, are “words that have been normalized
to make different variants of the same word look alike”.
For example, given the following product:
Product.create(
description: "Michael Kors",
name: "Sunglasses",
manufacturer_name: "Michael Kors"
)
The tsvector looks like:
'kor':2,4,6 'michael':1,3,5 'sunglass':7
The resulting lexemes were “normalized to make different variants” by lowercasing, removing suffixes, etc. The lexemes were sorted into a list and the numbers represent the position of the lexeme in the original strings.
For tons of awesome examples and details on these three functions, see Postgres full-text search is Good Enough!
Caching tsvector lexemes
On a large products table,
our searches may be slow.
If so, we have some tuning options.
One option would be to cache the tsvectors using a materialized view.
Read Caching with Postgres materialized views or
Postgres full-text search is Good Enough! (again)
for more information materialized views with Postgres and Ruby.
Materialized views may be a good option for your data. One downside is that the entire view must be refreshed with:
REFRESH MATERIALIZED VIEW view_name;
That may be a good fit in some scenarios,
perhaps run daily as a cron or Heroku Scheduler job.
In our case,
we want a cache to be updated
when a Product is created or updated.
Let’s edit our pg_search_scope:
using: {
tsearch: {
+ tsvector_column: "tsv",
}
}
Since we can’t dump a tsvector column to schema.rb,
we need to switch to the SQL schema format
in our config/application.rb:
config.active_record.schema_format = :sql
Remove the now-unnecessary db/schema.rb:
rm db/schema.rb
And generate a migration:
class AddTsvectorColumns < ActiveRecord::Migration
def up
add_column :products, :tsv, :tsvector
add_index :products, :tsv, using: "gin"
execute <<-SQL
CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE
ON products FOR EACH ROW EXECUTE PROCEDURE
tsvector_update_trigger(
tsv, 'pg_catalog.english', description, manufacturer_name, name
);
SQL
now = Time.current.to_s(:db)
update("UPDATE products SET updated_at = '#{now}'")
end
def down
execute <<-SQL
DROP TRIGGER tsvectorupdate
ON products
SQL
remove_index :products, :tsv
remove_column :products, :tsv
end
end
This change introduces a tsv column of type tsvector to search against,
a GIN index on the new column,
a TRIGGER on those new columns BEFORE INSERT OR UPDATE,
and a backfill UPDATE for existing products,
to keep the data in sync.
Postgres has a built-in tsvector_update_trigger function
to make this easier.
The GIN index could alternatively be a GiST index. See the GIN vs. GiST tradeoffs.
Here’s the resulting query with the new tsvector-type column:
SELECT products.*
FROM products
INNER JOIN (
SELECT products.id AS pg_search_id,
(
ts_rank(
(products.tsv),
(to_tsquery('english', ''' ' || 'wool' || ' ''')), 0
)
) AS rank
FROM products
WHERE (
((products.tsv) @@ (to_tsquery('english', ''' ' || 'wool' || ' ''')))
)
) pg_search ON products.id = pg_search.pg_search_id
ORDER BY pg_search.rank DESC
LIMIT 24
OFFSET 0
We can see that our run-time to_tsvector function calls are gone,
and our cached tsvector data in the
GIN-indexed tsv column are being queried against.
We’ve now improved the speed of our queries
by introducing a tsvector column to cache lexemes.
The trigger will keep the lexemes up-to-date
as products are created and updated,
without any daily cron job to run.
Happy searching.