---
title: A Journey into Site Reliability Engineering
teaser: 'While Rails gained a lot of popularity among companies to develop products
  quickly, technical debt and scalability issues were challenges that also gained
  space in this context. Let''s talk about some SRE fundamentals that can address
  those situations.

  '
tags: sre,devops,cloud
author: Clarissa Borges
published_on: 2022-12-08
---

When a developer is asked "what do you most enjoy doing in your job?", saying
that the biggest motivation is solving problems is usually a standard answer. We
see opportunities for issues we could solve everywhere, including those that
affect our productivity and speed.

I worked as a Rails developer for a few years before joining thoughtbot, and it
was no different for me. In these projects, I had excellent references for good
product construction practices from conception to deployment. Despite this, we
were starting to face difficulties that are usually symptoms of small projects -
which did not necessarily initially have scalability plans - becoming large
projects very quickly.

On the one hand, we were trying to deliver new functionality as soon as possible
and expand projects at the same pace as companies grew. On the other hand, we
accumulated technical debt and our agility and confidence in deliveries were
increasingly compromised as time passed.

After I joined thoughtbot's [Platform Engineering team](https://thoughtbot.com/devops-sre-cloud-platform) and started to learn more
about SRE fundamentals, I had the opportunity to see several strategies we could
have used applying SRE principles to overcome these obstacles.

Let's start by looking at some of the more common challenges I've dealt with on
my teams. You've probably already seen some of these or are even facing them
right now!

### Continuous Integration bottleneck

The number of changes incorporated each day increased as the teams grew.
Additionally, our tests could have been better optimized, and even running in
parallel in continuous integration took 20 to 30 minutes to finish - depending
on the project, they took up to 40 minutes! The process of adding changes in
production was constantly blocked, as were the people who needed those changes
to keep working.

### Accumulated errors

Addressing less urgent production errors was never prioritized over delivering
new features. Errors were accumulating over time, and the team had to learn an
additional and unusual context: which errors could be ignored because they were
not important enough to be addressed. There's no way this could work -
inevitably, we overlooked problems that needed attention. Without a strategy to
solve those errors, they were going nowhere, and they were growing
proportionally with new features.

### Lack of confidence to deploy new features to production

Even if we tested every bit of new code before deployment, we had a low overall
test coverage, especially in old code. Besides that, we didn't have good testing
practices spread through the team. Those factors caused the team to fear
changing some behavior of older code or making a change that had an unknown
impact in some other end of the application, as the flows were large and highly
dependent on each other.

Another significant aggravating factor that made people feel insecure about
deploying was that we constantly discovered problems in the application after
promoting the changes to production instead of development or staging. Without
question, tests were the problem, but more than that, the application was
growing so complex and coupled with legacy code - where no one had the
confidence to put their hands on - that due to not having better high-level
context, developers couldn't imagine all the test scenarios that would
ultimately impact users.

The risk of putting bugs into production reflected even greater insecurity
knowing that this could cause very big financial losses.

### Leads lost across flows

The teams knew where the application was slower for users because of external
API calls that took very long; they also knew that many users usually gave up on
continuing using the application mainly at those slow moments. It was possible
to have an idea of this happening by identifying the last stage that the users
stopped and left the website.

Awareness of these problems is a good first step, but it could be better if we
could extract more data to understand what we can do to avoid losing users: how
many users do we lose per day because of this? What are the reasons that lead
them to give up on continuing using the app (e.g., impatience, false feeling
that the website or their network wasn't working)? What are the exact times of
sluggishness? With that information we could build a better design to avoid
losing users.

## Using SRE to address technical debt

These were not the only issues we had, but these examples help paint the whole
picture of how they affected our projects (and our satisfaction developing
code). All these growing flaws and obstacles were strong symptoms of technical
debt. The accumulation of technical debt is often a reflection of a culture that
does not listen, not on purpose, but because of the lack of means of measuring
and illustrating those problems for the people involved to design a plan to
handle these problems finally. That happens because the concept of technical
debt itself can be abstract and hard to identify.

That's where SRE, Site Reliability Engineering, kicks in: by applying its
principles and practices, it is possible to have wide visibility of failures,
build scalability and risk management strategies and setting goals to improve
what currently exists, as well as delivering new code with more confidence and
speed. Personally, learning about the fundamentals of SRE represented an
expectation about how pleasant it can be to develop code for a big project
without major crises.

> "SRE is what happens when you ask a Software Engineer to design an operations
> team." - Google SRE Book

Therefore, the first five steps that could be taken to start implementing SRE
practices in projects that faced the problems mentioned would be:

### Step 1: Define SLIs and SLOs

Identifying technical debt is one of the challenges in addressing it. Issues
related to uptime and deployment are frequently signs that there is technical
debt that is actively hurting your app. SLOs are explicitly a way of measuring
reliability and can be used to have concrete data on where some of your
technical debt is. As determined by Google in the [SRE
Book](https://sre.google/sre-book/service-level-objectives/), the SLI (Service
Level Indicator) is a quantitative measure of some aspect of the service level
provided. At the same time, the SLO (Service Level Objective) is the target
value or target variation of a service level measured by an SLI.

SLOs are critical measures for making data-driven reliability decisions and key
measures in SRE practices. By using these metrics, we can have relevant
availability information for users. We can, for example, track periods with
lower response rates during user flows and design a strategy to improve those
times.

In an example involving a search, the SLI could be the time it takes for the
search to display results to the user, and an SLO could target to return the
search results of 99% of the searches in less than 1,000 milliseconds, and to
return the search results of 90% of the searches in less than 100 milliseconds.

### Step 2: Determine Error Budgets

The concept of Error Budget is intrinsic to the definition of SLIs and SLOs. As
the name suggests, Error Budgets are given by how much the application can
behave out of the expected within that metric. In other words, how much the
application can violate an SLO.

An aspect of using error budgets is that you must come to an agreement with the
entire organization that once a budget is exhausted, the entire team responsible
for that area must stop any development and solve that problem. It may sound
hard to convince the whole organization about how to use error budgets, but it
shouldn't be if you have realistic SLOs that are needed to keep your application
working well for users. Assuming that you defined your SLOs to achieve the ideal
level of service, committing to prioritizing work that ensures that level of
service should be crucial.

From the example in step 1, we could say that the Error Budget would allow up to
1% of 1,000,000 queries to take 1,000 milliseconds or more to return results to
the users and up to 10% of 1,000,000 queries to take 100 milliseconds or more in
a 3-month window.

### Step 3: Create Alerts

You can imagine that people responsible for the project would always want to
know when something wrong is happening, but good alerts don't alert all the
time! In real life, uninformative notifications that beep all the time become
noisy and are often ignored, especially when most of them do not require an
immediate reaction. As a result, critical and urgent alerts can go unnoticed.

So, when is the best time to alert? Always assuming that we know things aren't
always going to go well, we should set alerts for when things are *going too
wrong for too long*. That happens when, in a given time window, SLOs are not
being met, affecting a significant amount of customers. It is also important to
alert when the Error Budget is burning too quickly or, logically, when it has
been exhausted.

### Step 4: Refactoring tests

Refactoring your tests helps to decrease the CI time and regain the confidence
to make deployments. It is critical that the code is well-tested to ensure that
it is reliable going forward. However, reducing the CI time is not necessarily
an SRE responsibility; it may depend on your goal. In this case, as it directly
impacts the possibility of users having new features and bug fixes faster, it is
an SRE responsibility.

### Step 5: Using Canary Releases

When we're talking about applications that reach a large scale of users,
releasing new code can be riskier. Even if you test a new feature very well and
confidently and rely on several code reviews, it's necessary to recognize the
risk of unpredictability.

What if there was a way to reduce the impact of a flawed change that could cause
a catastrophe? Canary Release is a mechanism where a new application version is
made available for a small set of users. To do that, you should route part of
the traffic to the version being tested, while the rest of the traffic is
directed normally to the known stable version. The metrics for this new version
available for the small set of users should be monitored. Once you're confident
that the new version is working well, you can promote it to the latest stable
version and make it available to 100% of the users.

Reducing the impact of the risks brings back the confidence to deliver code
fast!

***

Investing in implementing SRE practices brings healthy benefits that reinforce
your chances of succeeding with your application and the long-term
sustainability of the projects. The cost of start using SRE is paid when
problems stop burdening the team with so much effort and time.

While only a few SRE practices have been named for resolving these common
issues, countless others are suitable for various scenarios. thoughtbot's
Platform Engineering team is happy to help you to ensure your site is reliable -
available and responsive to users - while maintaining a rapid pace of feature
development. Read more about how you can work with the Platform Engineering team
[here](https://thoughtbot.com/devops-sre-cloud-platform).

## Recommended resources:

- [DevOps, SRE & Cloud Platform with Joe Ferris](https://www.giantrobots.fm/403)
- [Google SRE Book](https://sre.google/sre-book/table-of-contents)
- [Google SRE Workbook](https://sre.google/sre-book/table-of-contents/)
- [School of SRE](https://sre.google/workbook/table-of-contents/)
- [thoughtbot's AWS Platform Guide](https://tbot.io/aws-platform-guide)
