Every now and again we need to change actual data in the production database. The first obvious option that comes to mind is to use a Rails migration, especially since the word “migration” is already in the task at hand, a “data migration.” But let’s talk about it some more and let me try to dissuade you from doing so.
Looking the at Rails Guides for Active Records Migration, the first section starts by saying:
Migrations are a feature of Active Record that allows you to evolve your database schema over time. Rather than write schema modifications in pure SQL, migrations allow you to use an easy Ruby DSL to describe changes to your tables.
Have you noticed that the word “data” is absent in the above paragraph? By definition, Rails migrations should only be used for schema changes and not for actual data changes in the database.
Generally speaking, manipulating data in migrations is
a bad idea for a few reasons.
For one, data migrations files will
stay in the db/migrate
directory for posterity,
and will run whenever a new developer sets their local development environment.
This is not very future proof.
For example, future changes to a class and its logic
can easily break the migration later on.
Furthermore, this is not business logic
and thus should not stay forever in the code base.
A second issue is that those data migrations might
be ignored by future developers if instead of running rake db:migrate
the developers run rake db:schema:load
or rake db:reset
.
Both commands merely load the latest version of the database structure
using the schema.rb
file without touching the migrations.
A third issue is that your application deployment is now dependent on the data migration to be completed. This might not be a problem when your application is new and your database is small. But what about large databases with millions of records? Your deployment will now have to wait for the data manipulation to be finished and that is just asking for trouble, with possible hanging or failed migrations.
I would like to suggest a better alternative using temporary rake tasks. Temporary rake tasks allow us to decouple a deployment from completed migrations. It gives us more control of the data manipulation process by encapsulating it in one place. The downside is that we need remember to either add this rake task to our deployment script or run the rake task manually after deployment. We will also need to clean up after ourselves and remove the temporary rake task once the changes have been deployed and implemented.
When creating a temporary rake task for a data migration, you might be tempted to write something like the following code:
# lib/tasks/temporary/users.rake
namespace :users do
task :set_newsletter => :environment do
User.all.each do |user|
if user.confirmed?
user.receive_newsletter = true
user.save
end
end
end
end
There are five problems with the code above:
- the task goes through every single user;
- it invokes validations and callbacks, which may have unintended consequences;
- it uses an
if
block to check whether a user needs to be updated; - it does not give us a visual indication that it is actually working; and
- it does not include a description and thus
we will not see the task when running
rake -T
.
Here is a suggestion for a better rake task:
# lib/tasks/temporary/users.rake
namespace :users do
desc "Update confirmed users to receive newsletter"
task set_newsletter: :environment do
users = User.confirmed
puts "Going to update #{users.count} users"
ActiveRecord::Base.transaction do
users.each do |user|
user.mark_newsletter_received!
print "."
end
end
puts " All done now!"
end
end
In this case:
- it includes a description so we will be able to see the task
and its description when running
rake -T
; - it uses a scope to fetch only the records that need to be changed,
thus removing the need for the
if
block and limiting the number of records that need to be touched; - it tells us in advance how many records will be manipulated, gives us a visual indication as it is working, and when it is done; and
- it wraps the change in a transaction.
If your database supports transactions, it is always a good idea to
wrap the code that actually changes the data with a transaction
.
Transactions help to deal with crashes, failures, and data consistency.
They are extremely important when
working with multiple objects to be changed
and we want to ensure data integrity.
For example when moving money around, we want to guarantee that
there will not be a situation where money is withdrawn from one account,
but not deposited to another account.
The examples above are very short. If the data change requires more actions, consider pulling out that behaviour into its own method.
If our database is quite large, it is also recommended to run the data migration in batches. You can read more about batches on the Rails Guides at Retrieving Multiple Objects in Batches.
Last but not least, if you ever use the Suspenders gem
to generate your base Rails application with thoughtbot’s standard defaults,
you might have noticed that it adds a lib/tasks/dev.rake
file
and a rake dev:prime
to your rake tasks.
This task should be used to add data to the development environment
so that seeds.rb
can be kept for data that is required in any environment.
In summary, by creating a temporary rake task, we get the benefits of more robust deployments, non-destructive migrations, and more control over the data manipulation process. And did I remember to mention that we should always do a dry run of our data migration rake task on development and staging first?