---
title: Data Migrations in Rails
teaser: A discussion about the reasons for using Active Record migrations vs. temporary
  rake tasks for data manipulations.
tags: rails,data,suspenders,web
author: Elle Meredith
published_on: 2015-12-08
---

Every now and again we need to change actual data in the production database.
The first obvious option that comes to mind is to use a Rails migration,
especially since the word "migration" is
already in the task at hand, a "data migration."
But let's talk about it some more and let me try to dissuade you from doing so.

Looking the at [Rails Guides for Active Records Migration],
the first section starts by saying:

> Migrations are a feature of Active Record that
  allows you to evolve your database schema over time.
  Rather than write schema modifications in pure SQL,
  migrations allow you to use an easy Ruby DSL
  to describe changes to your tables.

[Rails Guides for Active Records Migration]: http://edgeguides.rubyonrails.org/active_record_migrations.html

Have you noticed that the word "data" is absent in the above paragraph?
By definition, Rails migrations should only be used for schema changes
and not for actual data changes in the database.

Generally speaking, manipulating data in migrations is
a bad idea for a few reasons.
For one, data migrations files will
stay in the `db/migrate` directory for posterity,
and will run whenever a new developer sets their local development environment.
This is not very future proof.
For example, future changes to a class and its logic
can easily break the migration later on.
Furthermore, this is not business logic
and thus should not stay forever in the code base.

A second issue is that those data migrations might
be ignored by future developers if instead of running `rake db:migrate`
the developers run `rake db:schema:load` or `rake db:reset`.
Both commands merely load the latest version of the database structure
using the `schema.rb` file without touching the migrations.

A third issue is that your application deployment
is now dependent on the data migration to be completed.
This might not be a problem when your application is new
and your database is small.
But what about large databases with millions of records?
Your deployment will now have to wait for the data manipulation to be finished
and that is just asking for trouble, with possible hanging or failed migrations.

I would like to suggest a better alternative using temporary rake tasks.
Temporary rake tasks allow us to decouple a deployment from completed migrations.
It gives us more control of the data manipulation process
by encapsulating it in one place.
The downside is that we need remember to either add this rake task
to our deployment script or run the rake task manually after deployment.
We will also need to clean up after ourselves and
remove the temporary rake task
once the changes have been deployed and implemented.

When creating a temporary rake task for a data migration,
you might be tempted to write something like the following code:

```ruby
# lib/tasks/temporary/users.rake
namespace :users do
  task :set_newsletter => :environment do
    User.all.each do |user|
      if user.confirmed?
        user.receive_newsletter = true
        user.save
      end
    end
  end
end
```

There are five problems with the code above:

1. the task goes through every single user;
1. it invokes validations and callbacks,
   which may have unintended consequences;
1. it uses an `if` block to check whether a user needs to be updated;
1. it does not give us a visual indication that it is actually working; and
1. it does not include a description and thus
   we will not see the task when running `rake -T`.

Here is a suggestion for a better rake task:

```ruby
# lib/tasks/temporary/users.rake
namespace :users do
  desc "Update confirmed users to receive newsletter"
  task set_newsletter: :environment do
    users = User.confirmed
    puts "Going to update #{users.count} users"

    ActiveRecord::Base.transaction do
      users.each do |user|
        user.mark_newsletter_received!
        print "."
      end
    end

    puts " All done now!"
  end
end
```

In this case:

1. it includes a description so we will be able to see the task
   and its description when running `rake -T`;
1. it uses a scope to fetch only the records that need to be changed,
   thus removing the need for the `if` block
   and limiting the number of records that need to be touched;
1. it tells us in advance how many records will be manipulated,
   gives us a visual indication as it is working, and when it is done; and
1. it wraps the change in a transaction.

If your database supports [transactions], it is always a good idea to
wrap the code that actually changes the data with a `transaction`.
Transactions help to deal with crashes, failures, and data consistency.
They are extremely important when
working with multiple objects to be changed
and we want to ensure data integrity.
For example when moving money around, we want to guarantee that
there will not be a situation where money is withdrawn from one account,
but not deposited to another account.

[transactions]: https://en.wikipedia.org/wiki/Database_transaction

The examples above are very short. If the data change requires more actions,
consider pulling out that behaviour into its own method.

If our database is quite large, it is also recommended to
run the data migration in batches. You can read more about
batches on the Rails Guides at [Retrieving Multiple Objects in Batches].

[Retrieving Multiple Objects in Batches]:http://guides.rubyonrails.org/active_record_querying.html#retrieving-multiple-objects-in-batches

Last but not least, if you ever use the [Suspenders] gem
to generate your base Rails application with thoughtbot's standard defaults,
you might have noticed that it adds a `lib/tasks/dev.rake` file
and a `rake dev:prime` to your rake tasks.
This task should be used to add data to the development environment
so that `seeds.rb` can be kept for data that is required in any environment.

In summary, by creating a temporary rake task,
we get the benefits of more robust deployments,
non-destructive migrations,
and more control over the data manipulation process.
And did I remember to mention that we should always do a dry run
of our data migration rake task on development and staging first?

[Suspenders]:https://github.com/thoughtbot/suspenders
