A staging environment is a powerful tool to test code changes before deploying them to production. The more closely a staging environment mirrors production, the more breaking changes you can identify before deploying to production. You may wonder how to populate a staging database with data that mimics production data, while keeping sensitive user data private and secure.
Protecting user information is a concern for all applications, but it is of special concern to applications in industries with regulations around protecting user information. For example, health tech products need to comply with the Health Insurance Portability and Accountability Act (HIPAA). HIPAA mandates that there must be safeguards in place to protect access to Protected Health Information. Exposing user data adds additional overhead and expenses.
Let’s explore two different strategies to populate a staging database with data. These approaches can be used for local development environments as well.
One way to populate the staging environment with data is to start with a robust seed data file and then have developers add additional data through the user interface (UI) of the application. The seed file builds the data necessary to display the bare-bones version of the application. It often includes things like seeding an admin user and some objects necessary for the app’s important UI views to display.
This is the path we would recommend for most products. A benefit of this approach is that there is no need for added security on the staging environment since it has no real user data. Also, since additional data comes from the UI, the database should have data that is similar to what is in production. One downside is that it does take time to get to a high volume of data since the development team will have to populate the data themselves through the UI. It’s important to consider that staging likely won’t have the same volume of data as production, so you may not catch certain bugs through testing on staging. For example, an inefficient data migration that affects a lot of data may not cause database errors on staging, but may still cause errors on production. Be mindful to replace phone numbers and emails with test phone numbers and email addresses, especially if your staging application is sending emails or SMS messages. Use a sandbox environment if third party APIs provide them for more protection.
This approach needs to be paired with collective ownership of the seed file. When new UI elements are added, (i.e., a new index page), new seeds to populate those views also need to be added. A practice of seeding your development and staging environment with the seed file and regularly dropping the database and reseeding can ensure that your seeds are robust. This tactic can indicate missing pieces of data regularly. If there are UI elements that you expect to see or broken pages caused by missing data, add seed data to fill in those gaps. Developers using the development environment can even reseed multiple times a day, giving your team quick feedback when seed data is incomplete.
Your seed file should be idempotent, allowing developers to rerun them when necessary without error. It will make it easier to test your seeds as you add new ones. You may find yourself needing to repeat certain user flows in your application multiple times to build enough seed data. In those scenarios, you may consider adding seeds to fill that data instead.
An example seed file:
puts 'Seeding an admin user and company' admin_role = Role.find_or_create_by( name: Role::ADMIN ) admin_user = User.find_or_create_by( name: “Super User”, roles: [admin_role] ) company_location = Location.find_or_create_by( address: '123 Knope Way, Pawnee, IN' ) company = Company.find_or_create_by( name: 'Chips and Stuff', location: company_location, phone: '5555555555', email: 'firstname.lastname@example.org' ) blue = Color.find_or_create_by( name: Color::BLUE, hex_code: '#0099ff' ) green = Color.find_or_create_by( name: Color::GREEN, hex_code: '#009900' ) CompanyColor.find_or_create_by( company: company, color: blue ) CompanyColor.find_or_create_by( company: company, color: green )
Another option is to copy production data and replace identifying user information. This can be in the form of a script that periodically takes a dump of production data, and replaces user information with developer created data for fields that can be used to identify users.
A benefit of this approach is that data on staging closely resembles production data. The data can capture how your users are using the UI, instead of seeds added by your development team programmatically. This process allows lots of data to be produced quickly, instead of needing to build it up through the UI. Having a high volume of data allows staging to better resemble production in showing the effect of data-intensive tasks. While there are benefits to this approach, it can be difficult to implement. This approach adds overhead to your development process, as your team will need to pull data and run the bowdlerizing script frequently to keep your staging database schema in sync with production. There is a risk of leaving enough user-related data that can be traced back to a user and of violating a compliance requirement, a costly bug. The risks of this method outweigh the benefits for most products.
These are a few strategies that we’ve seen work well during our experience with HIPAA compliant staging environments. If you’re interested in learning more about thoughtbot or how we work with health tech teams, check out our work.