Full-text search with PostgreSQL and Action Text

Sean Doyle

Action Text excels at capturing user-provided rich text content and rendering the way end-users intended by editing and storing HTML content. However, once that content is encoded and stored as HTML, it’s becomes indecipherable to search engines.

Wouldn’t it be great if the same database that stores the HTML content could find the proverbial needle in that content’s haystack?

Setting up

We’ll start with an out-of-the-box Rails installation generated by an rails new command, and skip the majority of the application’s setup (including the Action Text installation steps).

If you’d like to read the rest of this article’s source code (including a test suite!), it can be found on GitHub.

Scaffolding our model

We’ll create our Article model’s scaffolding with Rails’ model generator to serve as a starting point for our extensions and customizations:

bin/rails generate model \
  Article \
  title:text \
  --no-fixture
bin/rails db:migrate

We’ll also add a has_rich_text :content declaration to the model so that it accepts rich text content for its content attribute:

--- b/app/models/article.rb
+++ b/app/models/article.rb

class Article < ApplicationRecord
+  has_rich_text :content
end

Searching our content with SQL ILIKE

Postgres supports the LIKE and ILIKE keywords for loose pattern matching:

string LIKE pattern [ESCAPE escape-character]

The LIKE expression returns true if the *string* matches the supplied *pattern*.

If pattern does not contain percent signs or underscores, then the pattern only represents the string itself; in that case LIKE acts like the equals operator. An underscore (_) in pattern stands for (matches) any single character; a percent sign (%) matches any sequence of zero or more characters.

To test it out, declare a with_content_containing scope that joins onto the related ActionText::RichText model, and implemented with an ILIKE query:

--- a/app/models/article.rb
+++ b/app/models/article.rb

  class Article < ApplicationRecord
    has_rich_text :content
+
+  scope :with_content_containing, ->(query) { joins(:rich_text_content).merge(ActionText::RichText.where <<~SQL, "%" + query + "%") }
+    body ILIKE ?
+  SQL
  end

Using an ILIKE clause achieves our desired outcome, but has some quirks and drawbacks. For instance, ILIKE won’t account for any HTML characters, so while a search for "needle in the haystack" matches content like <div>The text of this article matches has the needle in the haystack<div>, it won’t match content with HTML interspersed (e.g. <div>The text of this article matches has the <strong>needle</strong> in the haystack</div>.

Ignoring HTML debris

Let’s address the major drawback: HTML junk in our content.

To do so, we’ll add a plain_text_body column to our action_text_rich_texts table to mirror the existing body column so that our ActionText::RichText models can write to and search from it. We’ll create the migration file through Rails’ migration generator:

# Created by:
#
#  bin/rails generate migration AddPlainTextBodyToActionTextRichTexts \
#    plain_text_body:text
#
class AddPlainTextBodyToActionTextRichTexts < ActiveRecord::Migration[7.0]
  def change
    add_column :action_text_rich_texts, :plain_text_body, :text
  end
end

To write to the new action_text_rich_texts.plain_text_body, we’ll declare a before_save callback so that creating or updating an Article instance transforms by invoking ActionText::RichText#to_plain_text. The call to to_plain_text will remove all HTML syntax, leaving behind only the text content of the body to be written to plain_text_body:

--- a/app/models/article.rb
+++ b/app/models/article.rb

  class Article < ApplicationRecord
    has_rich_text :content
+
+  before_save { content.plain_text_body = content.body.to_plain_text }

    scope :with_content_containing, ->(query) { joins(:rich_text_content).merge(ActionText::RichText.where <<~SQL, "%" + query + "%") }
      body ILIKE ?
    SQL
  end

Now that we have a column free of HTML entities, we can point our existing query to it in support querying of values like:

<div>
  The text of this Article contains the <strong>needle</strong> in the haystack.
</div>

While this is a major improvement in what’s possible to query, there are still some quirks and drawbacks. For instance, querying with ILIKE does not account for “stop words” like “in”, “of”, or “the”. This means that while a query like "needle in the haystack" will match "The text of this Article contains the needle in the haystack.", "needle haystack" will not.

--- a/app/models/article.rb
+++ b/app/models/article.rb

  class Article < ApplicationRecord
    has_rich_text :content

    before_save { content.plain_text_body = content.body.to_plain_text }

    scope :with_content_containing, ->(query) { joins(:rich_text_content).merge(ActionText::RichText.where <<~SQL, "%" + query + "%") }
-    body ILIKE ?
+    plain_text_body ILIKE ?
    SQL
  end

We’ll need to ignore stop words and boost the performance of our query with text search vectors by transforming content with to_tsvector.

PostgreSQL full-text search works by comparing collections of text search tokens, and ignoring stop words like “in”, “of”, and “the”. To query a column for a search term, both the column and the query must be converted to text search vectors via to_tsvector. We’ll convert the query string by passing it to websearch_to_tsquery, and convert the plain_text_body column by passing it to to_tsvector:

--- a/app/models/article.rb
+++ b/app/models/article.rb

    before_save { content.plain_text_body = content.body.to_plain_text }

-  scope :with_content_containing, ->(query) { joins(:rich_text_content).merge(ActionText::RichText.where <<~SQL, "%" + query + "%") }
-    plain_text_body ILIKE ?
+  scope :with_content_containing, ->(query) { joins(:rich_text_content).merge(ActionText::RichText.where <<~SQL, query) }
+    to_tsvector('english', plain_text_body) @@ websearch_to_tsquery(?)
    SQL
  end

To improve querying performance, declare a Generalized Inverted Index (GIN) index on action_text_rich_texts.plain_text_body to store the to_tsvector return value:

class AddTsvectorIndexToActionTextRichTexts < ActiveRecord::Migration[7.0]
  def change
    add_index :action_text_rich_texts,"to_tsvector('english', plain_text_body)", using: :gin, name: "tsvector_body_idx"
  end
end

While declaring the before_save callback and with_content_containing scope within the Article model has utility, its value is limited to Article instances despite the implementation being tied to ActionText::RichText concepts.

Let’s generalize our full-text search wins by re-opening the ActionText::RichText class through an ActiveSupport.on_load hook, in the config/initializers/action_text_rich_text.rb initializer, making it accessible to future models that have rich text:

ActiveSupport.on_load :action_text_rich_text do
  before_save { self.plain_text_body = body.to_plain_text }

  scope :with_body_containing, ->(query) { where <<~SQL, query }
    to_tsvector('english', plain_text_body) @@ websearch_to_tsquery(?)
  SQL
end

After extracting the code to an initializer, we can change our scope :with_body_containing implementation to depend on the ActionText::RichText.with_body_containing scope:

--- a/app/models/article.rb
+++ b/app/models/article.rb

  class Article < ApplicationRecord
    has_rich_text :content

-  before_save { content.plain_text_body = content.body.to_plain_text }
-
-  scope :with_content_containing, ->(query) { joins(:rich_text_content).merge(ActionText::RichText.where <<~SQL, query) }
-    to_tsvector('english', plain_text_body) @@ websearch_to_tsquery(?)
-  SQL
+  scope :with_content_containing, ->(query) { joins(:rich_text_content).merge(ActionText::RichText.with_body_containing(query)) }
  end

Wrapping up

We’ve unlocked PostgreSQL full-text searching capabilities for content that was previously indecipherable to its search engine. To do so, we’ve extended existing Action Text concepts and classes to purge rich-text content of its HTML, encoded that content into a text-searchable vector, and stored it in a way that is performant to retrieve. We’ve done so with the tools already at our disposal: Active Record and PostgresSQL. All without adding any additional system-level dependencies or services!

Special thanks to George Claghorn, Javan Makhmali, and Sam Stephenson for their collective work on Action Text and Trix.