Executive guide to DevOps, Deployment, and Maintenance

You’re a business leader working to scale and optimize a promising digital product. You know your industry, your market, and your prospects well, but you are a bit out of your comfort zone when it comes to deploying, and maintaining an application. You have choices to make about how to maintain and scale the code to keep your product stable and responsive and your data secure. Product development requires upskilling your current team, which likely will disrupt their essential work for your business.

We talked with a few of thoughtbot’s development leaders to offer the following overview of development, deployment, and maintenance—what you should know, and when, in order to ensure the success of your product.

For starters, a few definitions

You will hear thoughtbot’s teams talk about a number of services, especially these three:

DevOps: Simply defined, DevOps combines development and operations functions with an emphasis on collaboration and communications. With the goal of breaking down silos between teams, DevOps increases the speed of getting features to market, reducing post-deployment defects and incidence, and ensures quality, security, and compliance. DevOps is also an integral part of thoughtbot’s culture. From our open Slack channels to our open source tooling, we focus on enterprise-level open collaboration and communications to improve and expedite the experience for our teams, clients, and end users.
SRE (Site Reliability Engineering): SRE connects the user experience back to the developer experience. We look at your site’s reliability and identify issues that require an engineer’s input. SRE focuses on keeping a product running smoothly through better metrics and increased observability. With SRE, developers can confidently release new, high-quality features, scale your application for exponential growth, and migrate cloud platforms.
Platform Engineering: Platform engineers combine technology and tools to allow self-service capabilities and workflows for developers. After partnering to deliver code into production, platform engineers manage deployment to cloud servers and offer ongoing support to ensure your site remains reliable—available and responsive—while maintaining a rapid pace of feature development.

All three functions are interrelated. “It’s like setting up a tent: You’ve got to have all the poles to put it up. No matter where you choose to focus, eventually you have to coordinate all of it,” says Richard Newman, a thoughtbot development director.

When should you start paying attention to DevOps?

Once your website has grown so critical to your business that you can’t afford to have it go down for a day, you’re ready to embrace DevOps.

“You need to think about the reliability of your process to produce changes,” agrees Newman, “and the minimum time to deploy those changes. If it takes longer than a day to make simple changes, you should probably be focusing on DevOps. Are your customers experiencing downtime? If they tell you they’re frustrated, that might indicate it’s time to focus on site reliability.”

Additional sure signs that you need to move DevOps from the back burner: your developers are moving slower than you’d like, your developers are unhappy because of how long it takes to deploy changes, the quality of what you’re releasing is declining, and/or your development costs are rising.

Deprioritizing DevOps means asking for trouble, “It’s like you keep developing new features without thinking about the plumbing system but the pipes are about to burst. It’s extremely stressful for everyone on the project when things start breaking and there are outages.”

What to know about cloud providers

The prevalence of cloud providers today means you have options in terms of prices and services for hosting your application other than having to invest in your own servers. In general, choosing a cloud provider takes care of a lot of the system operations, providing redundancy and in most cases disaster recovery services.

“Cloud providers mean you don’t have to reinvent the wheel,” says Fritz Meissner, Development Team Lead. “Someone has already solved a lot of potential problems.”

The costs of cloud hosting range from free to a few dollars a month for a small-scale site or start-up application to hundreds of thousands of dollars for enterprise applications. In the Ruby on Rails web application community, where thoughtbot’s developers are established experts, Salesforce’s Heroku is the biggest and best known out-of-the-box hosting option. Amazon, Google, and Microsoft all offer cloud hosting; Amazon Web Services or AWS is the most well known.

“It’s hard to do a lot better than something like AWS, it offers the best intersection of costs, features, and familiarity. Google’s version offers comparable tools and resources, but it’s just not in the same league of maturity and sophistication.” AWS also offers a robust billing section with projections for monthly costs, allowing you to experiment with variables to adjust the hosting package.

Finding the right team and setup: What do I need to consider?

If your core application involves a number of unique functions, you’ll most likely want to hire your own in-house development team.

“There isn’t a binary point where we say, ‘Now you need to hire a team, ideally we’re working with people who already have people so there isn’t a large learning curve for picking up the code. Our process at thoughtbot is to help train people along the way. A foundational aspect of our work has always been to bring the clients’ teams along with us.”

As for whether to buy or build the tools for your site, Meissner recommends paying for tools if they already exist. “You have to compare the cost of paying people to build and the likelihood they will make mistakes versus the costs for a ‘platform in a box’ and the development costs to make that solution work for your specific circumstances.”

“With thoughtbot, what you’re really buying is engineering time with us,” Also, he notes, the best tools are open source, “so they’re free. Except that you’re investing in the tool, and you’re contributing back to the development community.” As an example Flight Deck, thoughtbot’s dashboard platform that offers a central overview of multiple metrics and helps identify and prevent issues, is an open source platform freely available to all.

Best practices for securing data, user privacy, and your application

Creating security and privacy policies is essential for peace of mind as your business grows. Your policies, which thoughtbot can help you create should establish who is authorized to change your software.

“Give people the least authority they need to do their job,” advises Newman. “This is very important as you grow to an enterprise scale.”

“The principle of least access is the notion that if developers need access to deploy code, maybe they don’t need direct access to the production database, for example. It’s basically designing things in a defensive way where everyone shouldn’t have access to everything. You’re minimizing the surface area for how things interact.”

Once established, test your policies with some internal dry runs, preparing for emergency situations such as hardware fails, major bugs, hacker attacks, and more. Monitoring and observability come into play here. Monitoring means you have a tool that warns you if your site is up or down, records how fast your site is working, and flags any errors in coding. Observability means you can actually see what is happening with the code. Most hosting services offer monitoring tools, but observability tells you where any problems are coming from.

“Observability is the notion that you don’t want key parts of how something works to be opaque. Monitoring tells you how many resources the system is using. It takes the pulse of the application, checks the blood pressure.” Adds Newman, “It doesn’t take very long for your site to be big enough that you really wish you could see where the bugs are coming from, otherwise you’re flying blind.”

“Don’t have too many alerts, but have them be good and actionable alerts. Otherwise they just become background noise as everyone gets flooded with emails. A page at 2 a.m. is very unpleasant background noise. So establish what are the most critical things you want to know about, what merits waking you up at 2 a.m. and what needs to happen then.”

Paying for well-built infrastructure with on-call alerting is well worth the investment, allowing your team to manage who is on call and the system for escalation if someone is sick or on vacation.

How to build resilience for your app

Frequent “fire alarms” are like an early infection, a warning that things could quickly get worse, the developers say. And if your site’s downtime is interfering with customer service or other aspects of your business, it’s going to become an expensive problem. Monthly fire alarms can soon grow to weekly occurrences as your business and demand on your site grows.

“When you start seeing those pain points, you need to be investing the time to assess the root cause of the problems, you’ll want to do some testing to make sure things are working the way they should. It could be that the hosting service isn’t large enough or we need to do some caching. Those types of hypotheses need to be suggested and questions need to be asked. Post-mortems, assessing things after something breaks, are important.”

“Lots of companies will throw more servers at the problem,” agrees Newman. “But that gets expensive and is really just throwing Band-Aids at the issue.”

Having a cloud provider and the redundancy it offers, a way to monitor and observe your site, and clear policies for how to handle emergencies all contribute to your site’s resilience—and they all depend on some level of testing. “Test your response to security breaches. Test your backups and try to restore your site from time to time. Try to see how fast you can deploy what we call a hot fix, a quick change for some surprise bug,” says Newman. “Don’t just say, ‘I hope it will be fine, I know it will be fine.’ Testing drives the resilience you want for the long term.”

“After working with thoughtbot, production and staging environments match, and we are building out even more environments to verify code quality and security. Down time is rare and when it occurs, it is easy to roll back changes. We easily support exponentially more users with no impact to system performance.”

Denis Vilela, Lead Engineer, Branching Minds

Still have questions?

thoughtbot has developed a platform called Flight Deck for deploying and managing applications on Kubernetes with a curated set of pre-configured open source Terraform modules and AWS products. The AWS platform guide documents our approach and provides guidance for both platform engineers and developers using the platform.

Check out the DevOps best practices our team is writing and talking about

Don’t hesitate to reach out to your thoughtbot team for advice and other questions.