Lean & Maintainable Reports with Heroku Dataclips

Many organizations are hungry for data to help the team understand how customers are using their services. On a recent project the business owners needed a report but intended to hold off building an admin-y report-packed hub until core customer-facing functionality was complete.

Pacman eating CSV files

All reports take effort: Writing the database query, maintaining it as data structures change, and allowing business users to customize the report all cost time. Layer on supporting different output formats & security and one can see the gulf between cobbling together a casual database query and building a reporting tool with ongoing value.

The ideal solution for the client also needed to be self-service — let’s face it — no one wants to be the guardian of a jumble of SQL scripts and field ad-hoc requests to run them.

A solution

Heroku Dataclips is a data sharing tool that displays the results of a SQL query run against your Heroku PostgreSQL database. Dataclips are awesome because:

Sharing is easy via hard-to-guess public URLs (or you can enable authentication).
Export options include CSV, XLS, JSON.
The CSV URL is embeddable in Google Sheets using ImportData("https://dataclips.heroku.com/<id>.csv").
Business users can make adjustments to the SQL query if they need to.
Results are cached to minimize load on your database.

Forge & forget?

Although the Dataclips interface features a handy built-in editor, one doesn’t have to immediately jump to writing SQL. One point to bear in mind is keeping future maintenance costs small. If a database column is renamed during a migration this will break your report and a CSV-starved colleague will seek you out.

One approach to protect ourselves from brittle reports is to generate the SQL query with an ORM and test it at the same time. This has a double benefit. First, testing helps us verify a complex query returns the correct data. Second, a breaking migration will cause the test to fail, in turn prompting our diligent selves to update the query and the dataclip.

Example time

Here’s a simplified example which lists cinema bookings and information about the associated user.

Report

# app/reports/reports/bookings.rb

module Reports
  class Bookings
    def to_sql
      query.to_sql
    end

    private

    def query
      Booking.
        joins(:user).
        select("bookings.id").
        select("users.name AS client_name").
        select("users.email AS client_email").
        select("bookings.price AS price_paid")
    end
  end
end

Spec

# spec/reports/reports/bookings_report_spec.rb

require "rails_helper"

RSpec.describe Reports::Bookings do
  describe "#to_sql" do
    subject(:query) { described_class.new.to_sql }

    let(:report) { ActiveRecord::Base.connection.execute(query) }

    it "includes the session details" do
      booking = create(:booking)
      row = report.first

      expect(row["id"]).to eq booking.id
      expect(row["client_name"]).to eq booking.user.name
      expect(row["client_email"]).to eq booking.user.email
      expect(row["price_paid"]).to eq booking.price.to_s
    end

    it "has a row per client booking" do
      user = create(:user)
      booking1 = create(:booking, user: user)
      booking2 = create(:booking, user: user)

      report_booking_ids = report.map { |row| row["id"] }

      expect(report_booking_ids).to contain_exactly booking1.id, booking2.id
    end

    # etc
  end
end

Rake task

# lib/tasks/reports.rake

namespace :reports do
  desc "Generate SQL queries for Heroku Dataclips"
  task generate_sql: :environment do
    reports_path = Rails.root.join("tmp", "sql")
    timestamp = Time.current.strftime("%F@%H-%M-%S")

    FileUtils.mkdir_p(reports_path)

    Dir.chdir(reports_path) do
      File.write("bookings.#{timestamp}.sql", Reports::Bookings.new.to_sql)
    end

    puts "Reports SQL saved to:", reports_path
  end
end

The public interface of Reports::Bookings deliberately exposes only #to_sql, rather than the ActiveRecord Relation object from #query. That’s because the raw SQL is all that’s required to satisfy the Rake task.

There are benefits to the spec using the raw SQL too. It helps us verify the query can be executed standalone. Also the expectations are then written against an array of hashes, which are closer to the rows of strings in a CSV output, without the hassle of executing the Rake task and reading a CSV file in the tests.

Pulling it all together

Running the following command in the terminal creates the SQL files:

bundle exec rake reports:generate_sql

Which outputs the SQL files to tmp/sql, ready to paste into a dataclip:

SELECT bookings.id,
       users.name AS client_name,
       users.email AS client_email,
       bookings.price AS price_paid
FROM "bookings"
INNER JOIN "users" ON "users"."id" = "bookings"."user_id"

A vapourware cherry on top

What do you think would make Dataclips even more useful? Top of my wishlist would be updating dataclip queries via the Heroku CLI.

Picture your deployment script: Pushing changes to Heroku, running migrations, and updating dataclips could all be tidily bundled together. The upside is that reports and the database schema should remain in sync.

#!/bin/sh

# Run this script to deploy the app to Heroku.

set -eu

branch="$(git symbolic-ref HEAD --short)"
target="${1:-staging}"

# Backup
heroku pg:backups capture --remote "$target"

# Deploy
git push "$target" "$branch:master"

# Migrate
heroku run rails db:migrate --exit-code --remote "$target"
heroku restart --remote "$target"

# Upsert dataclips
bundle exec rake reports:generate_sql
heroku dataclips upsert --files tmp/sql --remote "$target"

Preferably the CLI would allow upserts — providing one combined command to create dataclips for new queries, and update dataclips for modified queries.

A minor quandary would be associating SQL files with dataclips. Perhaps the dataclip ID could be derived from the account name + SQL filename, or a YAML manifest could map the two sides. Hopefully this is a stumbling block worth navigating!