Have you ever needed to write a query that combined data from two PostgreSQL databases? Our client Healthify has.
Since Healthify work with healthcare and store some patient information, they’re very careful not to send any personally identifiable information to third party services.
One such service is Segment, which Healthify use to track user activity on the site. Conveniently, Segment can send your data to a PostgreSQL data warehouse, so Healthify has two PostgeSQL databases full of useful information:
- Their app database, which their Rails app uses, and
- a reporting database, with anonymised data from Segment.
To better understand how people are using the application, we wanted to answer some questions that needed data from both sources. We wanted to filter the reporting data using attributes from the app data, without having to pass those attributes to Segment. This blog post explains what we did.
Foreign Data Wrapper
PostgreSQL has a useful feature called Foreign Data Wrapper, which lets you create foreign tables in a PostgreSQL database that are proxies for some other data source. When you make a query against a foreign table, the Foreign Data Wrapper will query the external data source and return the results as if they were coming from a table in your database.
There are two foreign data wrappers that ship with PostgreSQL:
file_fdw
to create foreign tables that represent flat files (Postgres 9.1 and later), andpostgres_fdw
to create foreign tables that represent tables in another PostgreSQL database (Postgres 9.3 and later).
You can also define your own wrapper,
or use an third party wrapper
(we’ve used tds_fdw
on other thoughtbot projects),
but for Healthify’s situation postgres_fdw
was exactly what we needed.
Before we could create foreign tables in our reporting database to proxy the tables in our app database, we needed to do a bit of setup in the reporting database.
Install the
postgres_fdw
extension:CREATE EXTENSION postgres_fdw;
Create a server:
CREATE SERVER app_database_server FOREIGN DATA WRAPPER postgres_fdw OPTIONS (host 'postgres.example.com', dbname 'my_app');
Create a user mapping, which defines the credentials that a user on the local server will use to make queries against the remote server:
CREATE USER MAPPING FOR CURRENT_USER SERVER app_database_server OPTIONS (user 'reporting', password 'secret123');
Since we’re using this for reporting, we chose to create a separate user in our app database specifically for Foreign Data Wrapper to use. That user only has read permissions, so a badly written reporting query can’t accidentally
INSERT
orUPDATE
data.
Now we’ve told Foreign Data Wrapper where the server is, and the credentials it can use to connect, we’re ready to define the foreign tables.
We could use CREATE FOREIGN TABLE
for this,
but we’d end up specifying all of the fields
for each table we want to access.
We’d effectively end up repeating our whole schema.
Fortunately,
Postgres 9.5 introduced the IMPORT FOREIGN SCHEMA
command:
CREATE SCHEMA app;
IMPORT FOREIGN SCHEMA public
FROM SERVER app_database_server
INTO app;
This will create foreign tables for all of the tables
from our app database’s public
schema
into our reporting database’s app
schema.
A schema in PostgreSQL is just a namespace for grouping tables.
If you haven’t explicitly specified a schema
then you’re implicitly using the public
schema.
We could import from the app database’s public
schema
into the reporting database’s public
schema,
but keeping all of the foreign tables in a separate schema
means we don’t have to worry about naming collisions.
Segment also recommends that if you’re going to put any additional tables in a PostgreSQL database that you’re using as a Segment warehouse you should make sure it’s in a separate schema:
If you want to insert custom data into your warehouse, create new schemas that are not associated with an existing source, since these may be deleted upon a reload of the Segment data in the cluster.
Now we’ve imported our schema, we can query the foreign tables as if they were normal, local tables.
SELECT COUNT(*) FROM app.users;
We can even JOIN
between a local table and a foreign table.
SELECT COUNT(*)
FROM
production.events
JOIN app.users
ON production.events.user_id::INT = app.users.id
WHERE
app.users.company_id = 10;
It ain’t all good
Handling schema changes
When we ran IMPORT FOREIGN SCHEMA
we created foreign tables
based on the current structure of the app database.
If the app database changes—if we add a table, or drop a column—our
foreign tables will need updating too.
To work around this we added some code to update our reporting database’s foreign tables every time we run the migrations on the production app.
Rake::Task["db:migrate"].enhance do
if ENV["UPDATE_FOREIGN_TABLES_AFTER_MIGRATE"]
Rake::Task["db:export_schema_for_reporting"].invoke
end
end
The UPDATE_FOREIGN_TABLES_AFTER_MIGRATE
environment variable
lets us control which environments use this task.
It’s always going to be useful in production,
but rarely in development.
The db:export_schema_for_reporting
task
runs all of the queries we’ve previously seen,
slightly modified to make them idempotent:
BEGIN;
CREATE EXTENSION IF NOT EXISTS postgres_fdw;
DROP SERVER IF EXISTS app CASCADE;
CREATE SERVER … ;
CREATE USER MAPPING … ;
DROP SCHEMA IF EXISTS app;
CREATE SCHEMA app;
IMPORT FOREIGN SCHEMA … ;
COMMIT;
Rebuilding the server config
and foreign tables
every time we migrate
keeps our reporting server up-to-date.
Wrapping the whole thing in a transaction
ensures queries against the foreign tables
that are run while we’re running the db:export_schema_for_reporting
task
will be blocked until the task finishes,
but won’t fail.
There were some fringe benefits from running this from our Rails application:
we can work out what arguments to pass to CREATE SERVER
by reading our application’s database configuration.
Performance considerations
Once we had everything set up
we learnt that we couldn’t always treat foreign tables
exactly like local tables.
In particular,
the performance of queries that JOIN
over multiple foreign tables
was very poor.
Behind the scenes,
PostgreSQL was making separate queries to the app database
and then JOIN
ing them after the fact in the reporting database.
Re-writing our JOIN
s as sub-queries improved things significantly.
Now that Postgres 9.6 has been released, this should be less of a problem. The Postgres 9.6 release notes say:
postgres_fdw
now supports remote joins, sorts,UPDATE
s, andDELETE
s