---
title: Protecting User Data in HIPAA Compliant Staging Environments
teaser: 'How to populate your staging environment with data while keeping user data
  secure.

  '
tags: health tech,security,data,web
author: Sweta Sanghavi
published_on: 2020-03-06
---

A staging environment is a powerful tool to test code changes before deploying
them to production. The more closely a staging environment mirrors production,
the more breaking changes you can identify before deploying to production.
You may wonder how to populate a staging database with data that mimics production
data, while keeping sensitive user data private and secure.

Protecting user information is a concern for all applications, but it is of special
concern to applications in industries with regulations around protecting user
information. For example, health tech products need to comply with the Health
Insurance Portability and Accountability Act (HIPAA). HIPAA mandates that there must
be safeguards in place to protect access to Protected Health Information. Exposing
user data adds additional overhead and expenses.

Let’s explore two different strategies to populate a staging database with data.
These approaches can be used for local development environments as well.

## Seed Development and Staging

One way to populate the staging environment with data is to start with a robust
seed data file and then have developers add additional data through the user interface
(UI) of the application. The seed file builds the data necessary to display the
bare-bones version of the application. It often includes things like seeding an admin
user and some objects necessary for the app’s important UI views to display.

This is the path we would recommend for most products. A benefit of this approach
is that there is no need for added security on the staging environment since it has
no real user data. Also, since additional data comes from the UI, the database should
have data that is similar to what is in production. One downside is that it does
take time to get to a high volume of data since the development team will have to
populate the data themselves through the UI. It’s important to consider that staging
likely won’t have the same volume of data as production, so you may not catch certain
bugs through testing on staging. For example, an inefficient data migration that
affects a lot of data may not cause database errors on staging, but may still cause
errors on production. Be mindful to replace phone numbers and emails with test phone
numbers and email addresses, especially if your staging application is sending emails
or SMS messages. Use a sandbox environment if third party APIs provide them for more
protection.

### Seed Data Best Practices

This approach needs to be paired with collective ownership of the seed file. When
new UI elements are added, (i.e., a new index page), new seeds to populate those
views also need to be added. A practice of seeding your development and staging
environment with the seed file and regularly dropping the database and reseeding
can ensure that your seeds are robust. This tactic can indicate missing pieces of
data regularly. If there are UI elements that you expect to see or broken pages
caused by missing data, add seed data to fill in those gaps. Developers using the
development environment can even reseed multiple times a day, giving your team quick
feedback when seed data is incomplete.

Your seed file should be idempotent, allowing developers to rerun them when
necessary without error. It will make it easier to test your seeds as you add new
ones. You may find yourself needing to repeat certain user flows in your application
multiple times to build enough seed data. In those scenarios, you may consider adding
seeds to fill that data instead.

An example seed file:

```ruby
puts 'Seeding an admin user and company'

admin_role = Role.find_or_create_by(
  name: Role::ADMIN
)

admin_user = User.find_or_create_by(
  name: “Super User”,
  roles: [admin_role]
)

company_location = Location.find_or_create_by(
  address: '123 Knope Way, Pawnee, IN'
)

company = Company.find_or_create_by(
  name: 'Chips and Stuff',
  location: company_location,
  phone: '5555555555',
  email: 'example@example.com'
)

blue = Color.find_or_create_by(
  name: Color::BLUE,
  hex_code: '#0099ff'
)

green = Color.find_or_create_by(
  name: Color::GREEN,
  hex_code: '#009900'
)

CompanyColor.find_or_create_by(
  company: company,
  color: blue
)

CompanyColor.find_or_create_by(
  company: company,
  color: green
)
```

## Bowdlerize Production Data

Another option is to copy production data and replace identifying user information.
This can be in the form of a script that periodically takes a dump of production
data, and replaces user information with developer created data for fields that can
be used to identify users.

A benefit of this approach is that data on staging closely resembles production data.
The data can capture how your users are using the UI, instead of seeds added by your
development team programmatically. This process allows lots of data to be produced
quickly, instead of needing to build it up through the UI. Having a high volume of
data allows staging to better resemble production in showing the effect of
data-intensive tasks. While there are benefits to this approach, it can be difficult
to implement. This approach adds overhead to your development process, as your team
will need to pull data and run the bowdlerizing script frequently to keep your staging
database schema in sync with production. There is a risk of leaving enough
user-related data that can be traced back to a user and of violating a compliance
requirement, a costly bug. The risks of this method outweigh the benefits for most
products.

## Wrapping Up

These are a few strategies that we've seen work well during our experience with HIPAA
compliant staging environments. If you're interested in learning more about thoughtbot
or how we work with health tech teams, check out [our work](https://thoughtbot.com/services/health-tech).