A Journey into Site Reliability Engineering

Clarissa Lima Borges
This article is also available in: Português

When a developer is asked “what do you most enjoy doing in your job?”, saying that the biggest motivation is solving problems is usually a standard answer. We see opportunities for issues we could solve everywhere, including those that affect our productivity and speed.

I worked as a Rails developer for a few years before joining thoughtbot, and it was no different for me. In these projects, I had excellent references for good product construction practices from conception to deployment. Despite this, we were starting to face difficulties that are usually symptoms of small projects - which did not necessarily initially have scalability plans - becoming large projects very quickly.

On the one hand, we were trying to deliver new functionality as soon as possible and expand projects at the same pace as companies grew. On the other hand, we accumulated technical debt and our agility and confidence in deliveries were increasingly compromised as time passed.

After I joined thoughtbot’s Platform Engineering team and started to learn more about SRE fundamentals, I had the opportunity to see several strategies we could have used applying SRE principles to overcome these obstacles.

Let’s start by looking at some of the more common challenges I’ve dealt with on my teams. You’ve probably already seen some of these or are even facing them right now!

Continuous Integration bottleneck

The number of changes incorporated each day increased as the teams grew. Additionally, our tests could have been better optimized, and even running in parallel in continuous integration took 20 to 30 minutes to finish - depending on the project, they took up to 40 minutes! The process of adding changes in production was constantly blocked, as were the people who needed those changes to keep working.

Accumulated errors

Addressing less urgent production errors was never prioritized over delivering new features. Errors were accumulating over time, and the team had to learn an additional and unusual context: which errors could be ignored because they were not important enough to be addressed. There’s no way this could work - inevitably, we overlooked problems that needed attention. Without a strategy to solve those errors, they were going nowhere, and they were growing proportionally with new features.

Lack of confidence to deploy new features to production

Even if we tested every bit of new code before deployment, we had a low overall test coverage, especially in old code. Besides that, we didn’t have good testing practices spread through the team. Those factors caused the team to fear changing some behavior of older code or making a change that had an unknown impact in some other end of the application, as the flows were large and highly dependent on each other.

Another significant aggravating factor that made people feel insecure about deploying was that we constantly discovered problems in the application after promoting the changes to production instead of development or staging. Without question, tests were the problem, but more than that, the application was growing so complex and coupled with legacy code - where no one had the confidence to put their hands on - that due to not having better high-level context, developers couldn’t imagine all the test scenarios that would ultimately impact users.

The risk of putting bugs into production reflected even greater insecurity knowing that this could cause very big financial losses.

Leads lost across flows

The teams knew where the application was slower for users because of external API calls that took very long; they also knew that many users usually gave up on continuing using the application mainly at those slow moments. It was possible to have an idea of this happening by identifying the last stage that the users stopped and left the website.

Awareness of these problems is a good first step, but it could be better if we could extract more data to understand what we can do to avoid losing users: how many users do we lose per day because of this? What are the reasons that lead them to give up on continuing using the app (e.g., impatience, false feeling that the website or their network wasn’t working)? What are the exact times of sluggishness? With that information we could build a better design to avoid losing users.

Using SRE to address technical debt

These were not the only issues we had, but these examples help paint the whole picture of how they affected our projects (and our satisfaction developing code). All these growing flaws and obstacles were strong symptoms of technical debt. The accumulation of technical debt is often a reflection of a culture that does not listen, not on purpose, but because of the lack of means of measuring and illustrating those problems for the people involved to design a plan to handle these problems finally. That happens because the concept of technical debt itself can be abstract and hard to identify.

That’s where SRE, Site Reliability Engineering, kicks in: by applying its principles and practices, it is possible to have wide visibility of failures, build scalability and risk management strategies and setting goals to improve what currently exists, as well as delivering new code with more confidence and speed. Personally, learning about the fundamentals of SRE represented an expectation about how pleasant it can be to develop code for a big project without major crises.

“SRE is what happens when you ask a Software Engineer to design an operations team.” - Google SRE Book

Therefore, the first five steps that could be taken to start implementing SRE practices in projects that faced the problems mentioned would be:

Step 1: Define SLIs and SLOs

Identifying technical debt is one of the challenges in addressing it. Issues related to uptime and deployment are frequently signs that there is technical debt that is actively hurting your app. SLOs are explicitly a way of measuring reliability and can be used to have concrete data on where some of your technical debt is. As determined by Google in the SRE Book, the SLI (Service Level Indicator) is a quantitative measure of some aspect of the service level provided. At the same time, the SLO (Service Level Objective) is the target value or target variation of a service level measured by an SLI.

SLOs are critical measures for making data-driven reliability decisions and key measures in SRE practices. By using these metrics, we can have relevant availability information for users. We can, for example, track periods with lower response rates during user flows and design a strategy to improve those times.

In an example involving a search, the SLI could be the time it takes for the search to display results to the user, and an SLO could target to return the search results of 99% of the searches in less than 1,000 milliseconds, and to return the search results of 90% of the searches in less than 100 milliseconds.

Step 2: Determine Error Budgets

The concept of Error Budget is intrinsic to the definition of SLIs and SLOs. As the name suggests, Error Budgets are given by how much the application can behave out of the expected within that metric. In other words, how much the application can violate an SLO.

An aspect of using error budgets is that you must come to an agreement with the entire organization that once a budget is exhausted, the entire team responsible for that area must stop any development and solve that problem. It may sound hard to convince the whole organization about how to use error budgets, but it shouldn’t be if you have realistic SLOs that are needed to keep your application working well for users. Assuming that you defined your SLOs to achieve the ideal level of service, committing to prioritizing work that ensures that level of service should be crucial.

From the example in step 1, we could say that the Error Budget would allow up to 1% of 1,000,000 queries to take 1,000 milliseconds or more to return results to the users and up to 10% of 1,000,000 queries to take 100 milliseconds or more in a 3-month window.

Step 3: Create Alerts

You can imagine that people responsible for the project would always want to know when something wrong is happening, but good alerts don’t alert all the time! In real life, uninformative notifications that beep all the time become noisy and are often ignored, especially when most of them do not require an immediate reaction. As a result, critical and urgent alerts can go unnoticed.

So, when is the best time to alert? Always assuming that we know things aren’t always going to go well, we should set alerts for when things are going too wrong for too long. That happens when, in a given time window, SLOs are not being met, affecting a significant amount of customers. It is also important to alert when the Error Budget is burning too quickly or, logically, when it has been exhausted.

Step 4: Refactoring tests

Refactoring your tests helps to decrease the CI time and regain the confidence to make deployments. It is critical that the code is well-tested to ensure that it is reliable going forward. However, reducing the CI time is not necessarily an SRE responsibility; it may depend on your goal. In this case, as it directly impacts the possibility of users having new features and bug fixes faster, it is an SRE responsibility.

Step 5: Using Canary Releases

When we’re talking about applications that reach a large scale of users, releasing new code can be riskier. Even if you test a new feature very well and confidently and rely on several code reviews, it’s necessary to recognize the risk of unpredictability.

What if there was a way to reduce the impact of a flawed change that could cause a catastrophe? Canary Release is a mechanism where a new application version is made available for a small set of users. To do that, you should route part of the traffic to the version being tested, while the rest of the traffic is directed normally to the known stable version. The metrics for this new version available for the small set of users should be monitored. Once you’re confident that the new version is working well, you can promote it to the latest stable version and make it available to 100% of the users.

Reducing the impact of the risks brings back the confidence to deliver code fast!

Investing in implementing SRE practices brings healthy benefits that reinforce your chances of succeeding with your application and the long-term sustainability of the projects. The cost of start using SRE is paid when problems stop burdening the team with so much effort and time.

While only a few SRE practices have been named for resolving these common issues, countless others are suitable for various scenarios. thoughtbot’s Platform Engineering team is happy to help you to ensure your site is reliable - available and responsive to users - while maintaining a rapid pace of feature development. Read more about how you can work with the Platform Engineering team here.