Action Text excels at capturing user-provided rich text content and rendering the way end-users intended by editing and storing HTML content. However, once that content is encoded and stored as HTML, it’s becomes indecipherable to search engines.
Wouldn’t it be great if the same database that stores the HTML content could find the proverbial needle in that content’s haystack?
Setting up
We’ll start with an out-of-the-box Rails installation generated by an rails
new
command, and skip the majority of the application’s setup (including the
Action Text installation steps).
If you’d like to read the rest of this article’s source code (including a test suite!), it can be found on GitHub.
Scaffolding our model
We’ll create our Article
model’s scaffolding with Rails’ model
generator
to serve as a starting point for our extensions and customizations:
bin/rails generate model \
Article \
title:text \
--no-fixture
bin/rails db:migrate
We’ll also add a has_rich_text :content
declaration to the model so that
it accepts rich text content for its content
attribute:
--- b/app/models/article.rb
+++ b/app/models/article.rb
class Article < ApplicationRecord
+ has_rich_text :content
end
Searching our content with SQL ILIKE
Postgres supports the LIKE and ILIKE keywords for loose pattern matching:
string LIKE pattern [ESCAPE escape-character]
The
LIKE
expression returnstrue
if the*string*
matches the supplied*pattern*
.If
pattern
does not contain percent signs or underscores, then thepattern
only represents thestring
itself; in that caseLIKE
acts like the equals operator. An underscore (_
) in pattern stands for (matches) any single character; a percent sign (%
) matches any sequence of zero or more characters.
To test it out, declare a with_content_containing
scope that joins onto the
related ActionText::RichText model, and implemented with an ILIKE
query:
--- a/app/models/article.rb
+++ b/app/models/article.rb
class Article < ApplicationRecord
has_rich_text :content
+
+ scope :with_content_containing, ->(query) { joins(:rich_text_content).merge(ActionText::RichText.where <<~SQL, "%" + query + "%") }
+ body ILIKE ?
+ SQL
end
Using an ILIKE
clause achieves our desired outcome, but has some
quirks and drawbacks. For instance, ILIKE
won’t account for any HTML
characters, so while a search for "needle in the haystack"
matches
content like <div>The text of this article matches has the needle in
the haystack<div>
, it won’t match content with HTML interspersed (e.g.
<div>The text of this article matches has the <strong>needle</strong>
in the haystack</div>
.
Ignoring HTML debris
Let’s address the major drawback: HTML junk in our content.
To do so, we’ll add a plain_text_body
column to our action_text_rich_texts
table to mirror the existing body
column so that our ActionText::RichText
models can write to and search from it. We’ll create the migration file through
Rails’ migration
generator:
# Created by:
#
# bin/rails generate migration AddPlainTextBodyToActionTextRichTexts \
# plain_text_body:text
#
class AddPlainTextBodyToActionTextRichTexts < ActiveRecord::Migration[7.0]
def change
add_column :action_text_rich_texts, :plain_text_body, :text
end
end
To write to the new action_text_rich_texts.plain_text_body
, we’ll declare a
before_save
callback so that creating or updating an Article
instance
transforms by invoking ActionText::RichText#to_plain_text
. The call to
to_plain_text
will remove all HTML syntax, leaving behind only the text
content of the body
to be written to plain_text_body
:
--- a/app/models/article.rb
+++ b/app/models/article.rb
class Article < ApplicationRecord
has_rich_text :content
+
+ before_save { content.plain_text_body = content.body.to_plain_text }
scope :with_content_containing, ->(query) { joins(:rich_text_content).merge(ActionText::RichText.where <<~SQL, "%" + query + "%") }
body ILIKE ?
SQL
end
Querying with plain-text search
Now that we have a column free of HTML entities, we can point our existing query to it in support querying of values like:
<div>
The text of this Article contains the <strong>needle</strong> in the haystack.
</div>
While this is a major improvement in what’s possible to query, there
are still some quirks and drawbacks. For instance, querying with ILIKE
does not account for “stop words” like “in”, “of”, or “the”. This means
that while a query like "needle in the haystack"
will match "The text
of this Article contains the needle in the haystack."
, "needle
haystack"
will not.
--- a/app/models/article.rb
+++ b/app/models/article.rb
class Article < ApplicationRecord
has_rich_text :content
before_save { content.plain_text_body = content.body.to_plain_text }
scope :with_content_containing, ->(query) { joins(:rich_text_content).merge(ActionText::RichText.where <<~SQL, "%" + query + "%") }
- body ILIKE ?
+ plain_text_body ILIKE ?
SQL
end
Querying with full-text search
We’ll need to ignore stop words and boost the performance of our query with
text search vectors by transforming content
with to_tsvector
.
PostgreSQL full-text search works by comparing collections of text search tokens,
and ignoring stop words like “in”, “of”, and “the”. To query a column for a
search term, both the column and the query must be converted to text search
vectors via to_tsvector
. We’ll convert the query string by passing it to
websearch_to_tsquery
, and convert the plain_text_body
column by passing
it to to_tsvector
:
--- a/app/models/article.rb
+++ b/app/models/article.rb
before_save { content.plain_text_body = content.body.to_plain_text }
- scope :with_content_containing, ->(query) { joins(:rich_text_content).merge(ActionText::RichText.where <<~SQL, "%" + query + "%") }
- plain_text_body ILIKE ?
+ scope :with_content_containing, ->(query) { joins(:rich_text_content).merge(ActionText::RichText.where <<~SQL, query) }
+ to_tsvector('english', plain_text_body) @@ websearch_to_tsquery(?)
SQL
end
To improve querying performance, declare a Generalized Inverted Index
(GIN) index on
action_text_rich_texts.plain_text_body
to store the to_tsvector
return value:
class AddTsvectorIndexToActionTextRichTexts < ActiveRecord::Migration[7.0]
def change
add_index :action_text_rich_texts,"to_tsvector('english', plain_text_body)", using: :gin, name: "tsvector_body_idx"
end
end
Generalizing our search
While declaring the before_save
callback and with_content_containing
scope within the Article
model has utility, its value is limited to
Article
instances despite the implementation being tied to
ActionText::RichText
concepts.
Let’s generalize our full-text search wins by re-opening the
ActionText::RichText
class through an ActiveSupport.on_load
hook,
in the config/initializers/action_text_rich_text.rb
initializer,
making it accessible to future models that have rich text:
ActiveSupport.on_load :action_text_rich_text do
before_save { self.plain_text_body = body.to_plain_text }
scope :with_body_containing, ->(query) { where <<~SQL, query }
to_tsvector('english', plain_text_body) @@ websearch_to_tsquery(?)
SQL
end
After extracting the code to an initializer, we can change our
scope :with_body_containing
implementation to depend on the
ActionText::RichText.with_body_containing
scope:
--- a/app/models/article.rb
+++ b/app/models/article.rb
class Article < ApplicationRecord
has_rich_text :content
- before_save { content.plain_text_body = content.body.to_plain_text }
-
- scope :with_content_containing, ->(query) { joins(:rich_text_content).merge(ActionText::RichText.where <<~SQL, query) }
- to_tsvector('english', plain_text_body) @@ websearch_to_tsquery(?)
- SQL
+ scope :with_content_containing, ->(query) { joins(:rich_text_content).merge(ActionText::RichText.with_body_containing(query)) }
end
Wrapping up
We’ve unlocked PostgreSQL full-text searching capabilities for content that was previously indecipherable to its search engine. To do so, we’ve extended existing Action Text concepts and classes to purge rich-text content of its HTML, encoded that content into a text-searchable vector, and stored it in a way that is performant to retrieve. We’ve done so with the tools already at our disposal: Active Record and PostgresSQL. All without adding any additional system-level dependencies or services!
Special thanks to George Claghorn, Javan Makhmali, and Sam Stephenson for their collective work on Action Text and Trix.