---
title: The Durable Document Store You Didn't Know You Had, But Did
teaser:
tags: web,postgresql
author: Harold Giménez
published_on: 2011-12-06
---

As it turns out, PostgreSQL has a number of ways of storing loosely structured
data &#8212; documents &#8212; in a column on a table.

* _`[hstore](http://www.postgresql.org/docs/current/static/hstore.html)`_ is a
  data type available as a contrib package that allows you to store key/value
  structures just like a dictionary or hash.
* You can store data in _JSON_ format on a `text` field, and then use
  [PLV8](http://code.google.com/p/plv8js/wiki/PLV8) to `JSON.parse()` it [right
  in the database](http://pgeu-plv8.herokuapp.com/).
* There is a native `xml` data type, along with a few interesting query
  functions that allow you to extract and operate on data that sits deep in an
  <abbr title="Extensible Markup Language">XML</abbr> structure.

XML Storage is the topic of this post.

The <abbr title="Extensible Markup Language">XML</abbr> data type has been
Postgres core since version 8.3 (and as part of a contrib package prior to
that). Documents that span more than one data block will be
[TOASTed](http://www.postgresql.org/docs/current/static/storage-toast.html) as
usual, so storing big documents should not be a problem. In essence, it is a
`text` data type, but it does bring a few features to the _table_.

For starters, it will perform basic balance integrity checks on the data. After
all, this _is_ Postgres, where data integrity is high up in the priority.

Postgres also supports a number of [utility
functions](http://www.postgresql.org/docs/9.1/static/xml2.html) for extracting
data from <abbr title="Extensible Markup Language">XML</abbr> data. Most
notably, you can use XPath expressions to extract portions of the <abbr
title="Extensible Markup Language">XML</abbr> document. You can use the
extracted data in queries as you would anything else. For example:

    xml_test=# CREATE TABLE beers(
      id serial primary key,
      data xml
    )
    [...]
    CREATE TABLE
    xml_test=# \d beers
                             Table "public.beers"
     Column |  Type   |                     Modifiers
    --------+---------+----------------------------------------------------
     id     | integer | not null default nextval('beers_id_seq'::regclass)
     data   | xml     |
    Indexes:
        "beers_pkey" PRIMARY KEY, btree (id)
    xml_test=# INSERT INTO beers (data) values ('<beer><name>Harpoon</name><location>Boston, USA</location></beer>');
    INSERT 0 1
    xml_test=# INSERT INTO beers (data) values ('<beer><name>Guinness</name><location>Dublin, Ireland</location></beer>');
    INSERT 0 1
    xml_test=# INSERT INTO beers (data) values ('<beer><name>Polar</name><location>Caracas, Venezuela</location><type>Pilsner</type></beer>');
    INSERT 0 1
    xml_test=# select * from beers;
     id |                                            data
    ----+--------------------------------------------------------------------------------------------
      1 | <beer><name>Harpoon</name><location>Boston, USA</location></beer>
      2 | <beer><name>Guinness</name><location>Dublin, Ireland</location></beer>
      3 | <beer><name>Polar</name><location>Caracas, Venezuela</location><type>Pilsner</type></beer>
    (3 rows)

We've basically created a beers table with three entries containing some <abbr
title="Extensible Markup Language">XML</abbr> data. You could use that data in a
number of ways, including bringing it into your application, deserializing and
parsing it there. This may be acceptable in many cases, especially if you have
also included foreign keys to other relations in your data model, or any other
data that you can use to filter data down. However Postgres offers the ability
to use XPath to extract data from the <abbr title="Extensible Markup
Language">XML</abbr> type directly:

    xml_test=# SELECT xpath('//beer/name', data) from beers;
             xpath
    ------------------------
     {<name>Harpoon</name>}
     {<name>Guinness</name>}
     {<name>Polar</name>}
    (3 rows)

The curly braces in Postgres indicate an array, so the xpath function returns an
array of elements that match your query. You can index the first element of that
array with square bracket notation:

    xml_test=# SELECT (xpath('//beer/name', data))[1]::text from beers;
            xpath
    ----------------------
     <name>Harpoon</name>
     <name>Guinness</name>
     <name>Polar</name>
    (3 rows)

And further, use the XPath `text()` function to extract the text within the
matched <abbr title="Extensible Markup Language">XML</abbr> nodes:

    xml_test=# SELECT (xpath('//beer/name/text()', data))[1]::text from beers;
      xpath
    ---------
     Harpoon
     Guinness
     Polar
    (3 rows)

This is useful for extracting data from your XML. Now let's use it to query for
beers in Venezuela:

    xml_test=# SELECT * from beers where (xpath('//beer/location/text()', data))[1]::text = 'Caracas, Venezuela'::text;
     id |                                            data
    ----+--------------------------------------------------------------------------------------------
      3 | <beer><name>Polar</name><location>Caracas, Venezuela</location><type>Pilsner</type></beer>
    (1 row)

Now we're getting somewhere. With a bit of creativity we can create a generic
finder for your ORM of choice that finds records by XPath expression. Here's a
simple one for ActiveRecord

    class Beer < ActiveRecord::Base
      def self.by_xpath(xpaths)
        xpaths.inject(scoped) do |s, (xpath, value)|
          s.where("(xpath(?, data))[1]::text = ?", xpath, value)
        end
      end
    end

Use it:

    Beer.by_xpath('//beer/location' => 'Caracas, Venezuela', '//beer/name' => 'Polar')

You may be thinking this can't possibly be fast to query. Postgres doesn't allow
you to add any indexes on the <abbr title="Extensible Markup
Language">XML</abbr> data type itself, but on the other hand any <abbr
title="Structured Query Language">SQL</abbr> expression can be used in an index.
Therefore you can choose to index part of the <abbr title="Extensible Markup
Language">XML</abbr> document using the `xpath` function:

    xml_test=# create index index_beers_on_location ON beers USING GIN ( CAST(xpath('//beer/location/text()', data) as TEXT[]) );
    CREATE INDEX

Because we're indexing an array value, we use a Generalized Inverted index (GIN).

## When should I use this

Some may argue that it does not make any sense to store <abbr title="Extensible
Markup Language">XML</abbr> data directly in Postgres, after all it is a
relational database and there are databases like CouchDB or MongoDB that are
designed to store document data. Storing it in your main data store has its
advantages though:

* You don't need to maintain a new database in your infrastructure. Any service
  that is required to run your app adds complexity that is not always justified.
* <abbr title="Extensible Markup Language">XML</abbr> data backups continue to
  occur as part of your main database backups.
* The <abbr title="Extensible Markup Language">XML</abbr> data can be used to
  reference and JOIN other data in your data model.

Anecdotically, I recently used this strategy for storing raw <abbr
title="Extensible Markup Language">XML</abbr> data that we are receiving in the
background from an API. We parse it, normalize some of it out to our own data
model, but keep the raw <abbr title="Extensible Markup Language">XML</abbr>
around. This is useful because we can easily refer back to the source data to
verify things when they seem off, or to extract and normalize more data out of
it. However we don't use it to display data back to the user or for reporting
&#8212; we just keep it around and refer back to it from the backend. It's
working great!
