---
title: Full-text search with PostgreSQL and Action Text
teaser: 'Search through user-provided HTML content with PostgreSQL and Active Record.

  '
tags: rails,postgresql,actiontext,activerecord
author: Sean Doyle
published_on: 2021-05-17
---

[Action Text][] excels at capturing user-provided rich text content and
rendering the way end-users intended by [editing][] and storing HTML content.
However, once that content is encoded and stored as HTML, it's becomes
indecipherable to search engines.

Wouldn't it be great if the same database that stores the HTML content could
find the proverbial needle in that content's haystack?

## Setting up

We'll start with an out-of-the-box Rails installation generated by an `rails
new` command, and skip the majority of the application's setup (including the
[Action Text installation][] steps).

If you'd like to read the rest of this article's source code (including a test
suite!), it can be found [on GitHub][].

[on GitHub]: https://github.com/seanpdoyle/action-text-postgres-full-text-search/commits/main
[editing]: https://trix-editor.org
[Action Text]: https://edgeguides.rubyonrails.org/action_text_overview.html
[Action Text installation]: https://edgeguides.rubyonrails.org/action_text_overview.html#installation

## Scaffolding our model

We'll create our `Article` model's [scaffolding][] with Rails' `model` generator
to serve as a starting point for our extensions and customizations:

```sh
bin/rails generate model \
  Article \
  title:text \
  --no-fixture
bin/rails db:migrate
```

We'll also add a [`has_rich_text :content`][] declaration to the model so that
it accepts rich text content for its `content` attribute:

```diff
--- b/app/models/article.rb
+++ b/app/models/article.rb

class Article < ApplicationRecord
+  has_rich_text :content
end
```

[scaffolding]: https://edgeguides.rubyonrails.org/getting_started.html#mvc-and-you-generating-a-model
[`has_rich_text :content`]: https://edgeapi.rubyonrails.org/classes/ActionText/Attribute.html#method-i-has_rich_text

## Searching our content with SQL ILIKE

Postgres supports the [LIKE and ILIKE][] keywords for loose pattern
matching:

```
string LIKE pattern [ESCAPE escape-character]
```

> The `LIKE` expression returns `true` if the `*string*` matches the
> supplied `*pattern*`.

> If `pattern` does not contain percent signs or underscores, then the
> `pattern` only represents the `string` itself; in that case `LIKE`
> acts like the equals operator. An underscore (`_`) in pattern stands
> for (matches) any single character; a percent sign (`%`) matches any
> sequence of zero or more characters.

To test it out, declare a `with_content_containing` scope that joins onto the
related [ActionText::RichText][] model, and implemented with an `ILIKE` query:

```diff
--- a/app/models/article.rb
+++ b/app/models/article.rb

  class Article < ApplicationRecord
    has_rich_text :content
+
+  scope :with_content_containing, ->(query) { joins(:rich_text_content).merge(ActionText::RichText.where <<~SQL, "%" + query + "%") }
+    body ILIKE ?
+  SQL
  end
```

Using an `ILIKE` clause achieves our desired outcome, but has some
quirks and drawbacks. For instance, `ILIKE` won't account for any HTML
characters, so while a search for `"needle in the haystack"` matches
content like `<div>The text of this article matches has the needle in
the haystack<div>`, it won't match content with HTML interspersed (e.g.
`<div>The text of this article matches has the <strong>needle</strong>
in the haystack</div>`.

[LIKE and ILIKE]: https://www.postgresql.org/docs/12/functions-matching.html#FUNCTIONS-LIKE
[ActionText::RichText]: https://edgeapi.rubyonrails.org/classes/ActionText/RichText.html

## Ignoring HTML debris

Let's address the major drawback: HTML junk in our content.

To do so, we'll add a `plain_text_body` column to our `action_text_rich_texts`
table to mirror the existing `body` column so that our `ActionText::RichText`
models can write to and search from it. We'll create the migration file through
Rails' `migration` generator:

```ruby
# Created by:
#
#  bin/rails generate migration AddPlainTextBodyToActionTextRichTexts \
#    plain_text_body:text
#
class AddPlainTextBodyToActionTextRichTexts < ActiveRecord::Migration[7.0]
  def change
    add_column :action_text_rich_texts, :plain_text_body, :text
  end
end
```

To write to the new `action_text_rich_texts.plain_text_body`, we'll declare a
[`before_save`][] callback so that creating or updating an `Article` instance
transforms by invoking [`ActionText::RichText#to_plain_text`][]. The call to
`to_plain_text` will remove all HTML syntax, leaving behind only the text
content of the `body` to be written to `plain_text_body`:

```diff
--- a/app/models/article.rb
+++ b/app/models/article.rb

  class Article < ApplicationRecord
    has_rich_text :content
+
+  before_save { content.plain_text_body = content.body.to_plain_text }

    scope :with_content_containing, ->(query) { joins(:rich_text_content).merge(ActionText::RichText.where <<~SQL, "%" + query + "%") }
      body ILIKE ?
    SQL
  end
```

### Querying with plain-text search

Now that we have a column free of HTML entities, we can point our
existing query to it in support querying of values like:

```html
<div>
  The text of this Article contains the <strong>needle</strong> in the haystack.
</div>
```

While this is a major improvement in what's _possible_ to query, there
are still some quirks and drawbacks. For instance, querying with `ILIKE`
does not account for "stop words" like "in", "of", or "the". This means
that while a query like `"needle in the haystack"` will match `"The text
of this Article contains the needle in the haystack."`, `"needle
haystack"` will not.

```diff
--- a/app/models/article.rb
+++ b/app/models/article.rb

  class Article < ApplicationRecord
    has_rich_text :content

    before_save { content.plain_text_body = content.body.to_plain_text }

    scope :with_content_containing, ->(query) { joins(:rich_text_content).merge(ActionText::RichText.where <<~SQL, "%" + query + "%") }
-    body ILIKE ?
+    plain_text_body ILIKE ?
    SQL
  end
```

[`before_save`]: https://edgeapi.rubyonrails.org/classes/ActiveRecord/Callbacks.html
[`ActionText::RichText#to_plain_text`]: https://edgeapi.rubyonrails.org/classes/ActionText/RichText.html#method-i-to_plain_text

### Querying with full-text search

We'll need to ignore stop words *and* boost the performance of our query with
text search vectors by transforming `content` with `to_tsvector`.

PostgreSQL full-text search works by comparing collections of text search tokens,
and ignoring [stop words][] like "in", "of", and "the". To query a column for a
search term, both the column and the query must be converted to text search
vectors via [`to_tsvector`][]. We'll convert the query string by passing it to
[`websearch_to_tsquery`][], and convert the `plain_text_body` column by passing
it to [`to_tsvector`][]:

```diff
--- a/app/models/article.rb
+++ b/app/models/article.rb

    before_save { content.plain_text_body = content.body.to_plain_text }

-  scope :with_content_containing, ->(query) { joins(:rich_text_content).merge(ActionText::RichText.where <<~SQL, "%" + query + "%") }
-    plain_text_body ILIKE ?
+  scope :with_content_containing, ->(query) { joins(:rich_text_content).merge(ActionText::RichText.where <<~SQL, query) }
+    to_tsvector('english', plain_text_body) @@ websearch_to_tsquery(?)
    SQL
  end
```

To improve querying performance, declare a Generalized Inverted Index
(<abbr title="Generalized Inverted Index">[GIN][]</abbr>) index on
`action_text_rich_texts.plain_text_body` to store the [`to_tsvector`][]
return value:

```ruby
class AddTsvectorIndexToActionTextRichTexts < ActiveRecord::Migration[7.0]
  def change
    add_index :action_text_rich_texts,"to_tsvector('english', plain_text_body)", using: :gin, name: "tsvector_body_idx"
  end
end
```

[`to_tsvector`]: https://www.postgresql.org/docs/12/textsearch-controls.html#TEXTSEARCH-PARSING-DOCUMENTS
[stop words]: https://www.postgresql.org/docs/12/textsearch-dictionaries.html#TEXTSEARCH-STOPWORDS
[`websearch_to_tsquery`]: https://www.postgresql.org/docs/12/textsearch-controls.html#TEXTSEARCH-PARSING-QUERIES
[GIN]: https://www.postgresql.org/docs/12/textsearch-indexes.html#TEXTSEARCH-INDEXES

## Generalizing our search

While declaring the `before_save` callback and `with_content_containing`
scope within the `Article` model has utility, its value is limited to
`Article` instances despite the implementation being tied to
`ActionText::RichText` concepts.

Let's generalize our full-text search wins by re-opening the
`ActionText::RichText` class through an [`ActiveSupport.on_load`][] hook,
in the `config/initializers/action_text_rich_text.rb` initializer,
making it accessible to future models that have rich text:

```ruby
ActiveSupport.on_load :action_text_rich_text do
  before_save { self.plain_text_body = body.to_plain_text }

  scope :with_body_containing, ->(query) { where <<~SQL, query }
    to_tsvector('english', plain_text_body) @@ websearch_to_tsquery(?)
  SQL
end
```

After extracting the code to an initializer, we can change our
`scope :with_body_containing` implementation to depend on the
`ActionText::RichText.with_body_containing` scope:

```diff
--- a/app/models/article.rb
+++ b/app/models/article.rb

  class Article < ApplicationRecord
    has_rich_text :content

-  before_save { content.plain_text_body = content.body.to_plain_text }
-
-  scope :with_content_containing, ->(query) { joins(:rich_text_content).merge(ActionText::RichText.where <<~SQL, query) }
-    to_tsvector('english', plain_text_body) @@ websearch_to_tsquery(?)
-  SQL
+  scope :with_content_containing, ->(query) { joins(:rich_text_content).merge(ActionText::RichText.with_body_containing(query)) }
  end
```

## Wrapping up

We've unlocked PostgreSQL full-text searching capabilities for content that was
previously indecipherable to its search engine. To do so, we've extended
existing Action Text concepts and classes to purge rich-text content of its
HTML, encoded that content into a text-searchable vector, and stored it in a way
that is performant to retrieve. We've done so with the tools already at our
disposal: Active Record and PostgresSQL. All without adding any additional
system-level dependencies or services!

Special thanks to [George Claghorn](https://georgeclaghorn.com), [Javan
Makhmali](https://javan.us), and [Sam
Stephenson](https://twitter.com/sstephenson) for their collective work on Action
Text and Trix.

[`ActiveSupport.on_load`]: https://guides.rubyonrails.org/engines.html#available-load-hooks
