SRE: The Next Big Thing from thoughtbot's Platform Engineering team

thoughtbot is proud to add Site Reliability Engineering to our DevOps, SRE & Cloud Platform team.

Site Reliability Engineering, or SRE, is a different approach to maintenance. It focuses on keeping a product running smoothly through better metrics and increased observability.

Our SRE team focuses on supporting products that need to scale, face reliability issues, or need to find the right balance of refactoring and feature development.

As thoughtbot’s first Site Reliability Engineer, I am excited about how this will strengthen our overall delivery across all of our teams and hopefully give product teams confidence on where to focus their efforts.

Fewer surprises

Every developer has gotten a panicked call at 2 a.m. when the whole system is down and no one can figure out what’s happened. SRE is our way of preventing that call, or at least making sure it doesn’t happen twice.

Developer and Operations teams typically have found it difficult to know what to monitor to get truly useful insight. They might know statistics like CPU usage, or amount of available memory, but nothing that helps them spot problems that customers care about or affect their experience. Most actionable reports come from customer reports, or alarms that would cause actionable reports, but these reports go off so frequently when nothing is wrong that they end up being ignored.

That’s a dangerous game. It assumes that because a product works now, it will always work. Code-based products are too complex for that kind of thinking. By the time the system has failed enough for users to notice, so many things might be wrong that it can cause a breakdown in trust with the end users, and turn into a sizable development effort to resolve.

SRE allows us to be more proactive, making both the customer and development team feel better.

How it works

SRE work starts with a code audit that diagnoses issues, focusing on identifying areas that are most important for monitoring. Then, we build observability tools to monitor those areas and surface that information to the development team.

Under most systems, the development teams might get a note that says “This part of the site is slow.” But with observatory tools in place, we can see exactly what’s causing the issue, and what “slow” actually means.

This way, in the future, we can quickly find what’s causing a slowdown. Once we find an issue, we don’t just apply a band-aid. We automate fixes to prevent it from happening again. My ultimate goal with SRE is to automate myself out of a job.

Better talks lead to better features

SRE also allows us to have a different kind of conversation around product functionality.

We do this through a few core principles. We first define our Service Level Indicators, or SLIs. With those, we figure out the acceptable level of service, or the Service Level Objective, or SLO. With that we have our error budget, which is the amount of failure we can accept before going below our SLO.

For example, we might place the SLO at 99.99%. This means that as long as the site is performing, in whatever ways we’ve defined 99.99% of the time, that 0.01% of failure is absolutely fine. This lets us make more data driven decisions on what to focus on.

This also provides a smarter structure to development. If the product is performing below the error budget, the development team can test out new features. If those new features don’t push errors beyond the limit, we can add more.

What’s next for SRE?

If you’ve got a project that could benefit from extra reliability or have an idea for a new project that you hope to scale without issues, drop us a line. We’d love to help you start out on the right path.

My team is also seeking talented engineers with a Ruby on Rails background who are interested in reliability. If the idea of automating yourself out of a job sounds good to you, check out the current openings we have at thoughtbot.