Ask thoughtbot CTO - All About CI / CD

What are the goals for CI/CD?

What are the properties of a good CI/CD pipeline and how has that influenced the choices we’ve made in technology and process?

The main goal of Continuous Integration (CI) is to give people confidence that their stuff is working. Experienced Ruby developers experience CI through automated tests. They expect to see their tests run when they open a pull request. At thoughtbot we ask ourselves, how can we give developers the same confidence in deployments?

One way is by having the ability to pre-build a container image, packages, and applications. This provides a confidence boost when deploying pipelines, particularly to containerized environments.

Another big goal is the seamless deployment of changes when you merge your pull request. We want to provide transparency to developers during deployments: to alert them when deployments are happening, when they’re complete, and to give a reason if they fail.

What is the difference between Continuous Integration, Continuous Delivery, and Continuous Deployment?

With Continuous Integration, you’re talking about running tests and merging changes back into the main branch. It’s about making sure that everything is still running after the latest updates, and going as far as you can without actually deploying the changes.

Continuous Delivery and Continuous Deployment, however, aren’t defined consistently within the industry. At a high-level, Continuous Deployment and Continuous Delivery are both about releasing changes as they are ready. This is in contrast to specific, planned releases, e.g. the release of January 5, that are scheduled to address several issues.

It’s really interesting to consider how we can distinguish between deployment and delivery using feature flags, which can be implemented with a tool like Rollout. Let’s say, you have some feature code that is running in the background and collecting data, but the feature is not actually visible to users yet. By keeping the feature flag off, you can continually deploy the latest code without actually delivering an unfinished feature. That’s a very specific distinction between delivery and deployment and not everybody in the industry is making that same distinction.

It’s exciting to see what’s possible that can make it easy for developers to know they’re building things well, whilst also giving operations people the confidence that code changes won’t break the system.

We’ll now consider how thoughtbot does CI/CD across several anonymized example projects.

First Example: Docker

Let’s start by looking at a CI build from a mono-repo project running through Docker, starting with the workflow section, where we’re doing the actual build. If you’re familiar with Heroku, a lot of what happens with Docker is similar to what happens with a push to Heroku, where the build and the deployment are joined together.

Using Heroku, when you push the latest version of the code, Heroku tries to bundle all the gems and get your application packaged into what they call a slug. Scaling an application then downloads and expands the slug to a dyno for execution.

In our example project, we’re using docker build, which is another way of building a container image. Essentially, it’s a set of instructions to take you from a base blank state image to an image that contains the application’s dependencies and latest code.

The goal for both Heroku and docker build is the same: you start with the application code, and you come up with some package that can be deployed to a container. And so it’s the same as in the days when you might have compiled a JAR file for Java. Except, instead of trying to deploy to some Java VM, you’re trying to deploy to a containerized solution.

Most people associate containers heavily with Docker, but they’re actually separate technologies. Only a subset of Docker has to do with the Open Container Initiative (OCI).

The essentials are twofold: on the one hand, you need a toolchain that goes from code to a container image. OCI is the industry standard to describe how those images will be formatted. On the other hand, you need a container runtime that knows how to take one of those images and turn it into running containers. There are a few container runtimes, the most well-known, again, is Docker.

But nowadays, most platforms that are running open containers no longer run Docker. They use alternative container runtimes, e.g. containerd. Likewise, there are now some alternatives to building Dockerfiles, e.g. Podman.

And so the world of containers is much bigger than docker build and docker run. At thoughtbot, we still use docker build but we don’t use docker run. The idea of a standardized container image has really taken hold in the industry, so that’s what we’ve been using.

Second example: Buildpacks

Sticking with that same standard, and with the same runtime, there’s an alternative process that has grown up in the industry called buildpacks, which is based on the Heroku approach.

The docker build approach is based on the idea of layers. You start with a base, and then the Dockerfile acts as a list of instructions to add layers to that base image. This gives you the final image that gets deployed as a container.

Buildpacks take a different approach: rather than working in layers, they have their own base archive, which is the root of the package. They then inspect the code and add things to the package programmatically.

The differences between docker build and buildpacks are subtle but important. For example, the caching in Docker is entirely based on layers, which means if you need to redo `bundle install`, you need to redo that layer and install all the gems again. Cache invalidation is notoriously glitchy. By making the process very straightforward, Docker makes it unlikely that you end up with unexpectedly cached binaries, which could be very difficult to debug.

Buildpacks, on the other hand, have emphasized speed and convenience, rather than the consistency and simplicity of layer images. Heroku uses a buildpack approach, and the OCI buildpack images are modeled after the Heroku buildpacks. The idea is that a buildpack is a program that knows how to look at the code you’re pushing and can turn it into an OCI image. It can maintain its own kind of cache. So, for example, instead of running `bundle install` from scratch every time, a builpack can maintain the cache of gems between runs. Therefore it only needs to download and install the gems that have been added since the last run, which can be really useful.

Take the example project we’re looking at: it’s an Angular app that has a very long compile step. A buildpack can cache the compilation artifacts in a much more meaningful way than a layer cache, because all of the compilation artifacts come out of running the steps in the layer instructions. So, if you have to redo that layer, all of it is gone, just like with `bundle install`. However, by using a buildpack, we can keep most of the compiled JavaScript assets and just recompile the files that have changed significantly. This increased the speed of the step.

The other advantage we’ve mentioned with buildpacks is convenience. Developers are used to `git push`, which works intuitively on Heroku. That’s because buildpacks can intelligently inspect the code that they’re running.

In contrast, a Dockerfile is a very simple and unintelligent process: it just follows a set of instructions to add layers. And there’s no cleverness in there. I’m sure you’ve seen people trying to be clever with Dockerfiles, and it inevitably leads to horrible confusion.

Buildpacks approach it from the other angle: they want to make it so that developers don’t have to remember to do things like cache their dependencies or figure out what version of Ruby is in there. In a Dockerfile, you always have to declare the base image, including the Ruby version. But a buildpack will try to deduce that by intelligently inspecting the Gemfile. And so it adds that convenience.

The trade-off is the complexity and loss of reliability from using buildpacks. Because if you are caching dependencies in between builds, it means that the builds are interdependent. Just like when you don’t clean up properly after a test, it can cause the next test to fail. The idea that one build can change the results of a future build introduces a lot of potential confusion for Continuous Integration. Heroku has put a monumental amount of effort into working out those bugs for the platforms it supports. If you’ve worked on one of the well-supported platforms, like Ruby or Node, you’ve probably had a pretty seamless experience. But if you ever end up outside the bounds of what Heroku has covered thoroughly, you start to run into weird things where one deployment succeeds while another fails because it happens to be picking up something weird from the cache. And you also start doing some arcane things like manually clearing the Heroku slug build cache.

People have generally been pulled towards the simplicity of Dockerfiles, even though they’re not as convenient as buildpacks, and they are, on average, slower to build. But they are very simple so people can learn them quickly. You can read a Dockerfile and understand what it’s doing because they’re made of commands that you would run yourself. And there’s no question as to what the output will be. Because there are no conditionals to unravel as there are in a buildpack.

Aside from Heroku, there’s the cloud-native buildpacks effort, buildpacks.io, which are Heroku-like buildpacks that build an OCI container image that will run on Docker ,containerd, or cri-o. We’re using containerd on Kubernetes. EKS nodesrun an operating system called Bottlerocket, which runs containerd. It’s a Linux-based container-focused operating system. It’s open source and mainly used by Amazon.

Most of the major hosted repositories, including the one we use (ECR from AWS), support automatically scanning images for known vulnerabilities. So they’ll look for binaries built into the image (e.g. an old version of open SSL that might contain a vulnerability) and they’ll warn you about it. And you can even set up policies in AWS that will reject the image from the repository if it could be vulnerable.

There are some tools out there that will scan and look to see if your Dockerfile is using the latest version. But a tool that automatically submitted pull requests that way would be pretty cool. Some of the reasons that process isn’t as seamless for Dockerfiles is that there isn’t wide agreement on where those versions will be declared.

So one downside to the Dockerfile approach is that the Ruby version in the base image must match the version in your Gemfile. That is not automated. If they mismatch, it simply won’t work. There are a few different ways of declaring a Ruby version, so perhaps they haven’t yet worked out a dependable way to make sure each way is updated simultaneously.

The main reason we continually update Dockerfiles is that they contain the runtime for whatever we’re deploying. If the application is being upgraded to a new version of, e.g. Ruby, the Dockerfile has to change. But Ruby is not always backward-compatible. And so, if we have a good CI process, we can’t update Ruby without also checking everything else. That helps because if we bumped the Ruby version in the Gemfile, and we bumped the Dockerfile, we’ll see if the tests run and we’ll see if the docker build still succeeds. But it’s something we need to make sure we closely coordinate with the rest of the development team. Nobody wants to have their Ruby version upgraded unexpectedly.

Third example: GitHub Actions

Our next example project is a fulfillment app, which is deployed using GitHub Actions.

The steps for GitHub Actions are pretty similar to what we just saw in the previous example. Basically: build using Docker and push to ECR. I think the difference here is the CI/CD: the docker build and push of the image and the deployment to the cluster are dependent on each other and run at the same time. This means the docker build doesn’t run with every PR, it only runs on PR mode. This is because we didn’t want to deploy the code to the cluster with every PR push.

This approach is similar to the Heroku approach, where we have a configuration vs. code separation. We want to maintain what we’ve done in Kubernetes, where we have an application repository that contains all the application code, but then we also have a manifest repository that contains the configuration. The manifest repository includes things like environment variables, which processes to run with which arguments, how many processes to run, how to scale them, and so on. It’s convenient because you don’t have to worry about merge conflicts and things when you’re updating the configuration. In the CI/CD process, it means deployment needs to use both of these repositories.

It was a challenge to make sure that when changes are made to the manifest repository, that deploy also gets triggered for the application repository. This workflow dispatch from the manifest repository could be improved, because, at the moment, we just have one pipeline for everything, which makes it simpler. But we could definitely get smarter with it and avoid having the two different kinds of changes trigger the same workflow.

In addition, there is not a built-in security model for it, so there are two permissions that need to be granted to make it work. The first is that the workflow that’s doing the deploy needs access to the other repository because GitHub workflows automatically have access to their own repository to check out the code. But they can check out any other repository even within the same organization. And there’s nothing in the UI or in the configuration that lets you do that. The only way to do it is to manually create an access token and set it as a secret on the project. So. when it checks out the second repository, it does it as a different user. There’s significant security overhead there: that credential must be created and must be maintained, and you have to be careful not to leak it.

Secondly, if you want to be able to kick off the repository workflow, and the second repository changes, you need to take a token that has permission to run workflows. And again, there’s no mechanism within GitHub Actions itself for doing this. You can put together those kinds of recipes, but it’s sort of like AWS, where they keep telling them to just read a lambda. With GitHub, you can always just make a Personal Access Token (PAT) and use the API. But in their documentation, they generally stress the low security of PATs and how you should probably try to use an application or another feature whenever possible.

We’ve been hoping that they would build a solution for this, but as time goes on, it seems they’re pretty settled on the one repository workflow model.

Fourth Example: Single Repository combining code and manifests

In this example we have a single repository that includes both the application code and the manifest definitions. The intention is that the developers interact directly with the application repository for all the application code changes, and we have a directory for manifests. Within the manifest directory, we have a definition for the Kubernetes manifests, or any other manifest that has to be defined for Kubernetes. Doing things this way removes the complexity of having to trigger a workflow dispatch to trigger the pipeline for the manifest repository once the application code is completed. And now taking a step to the actual workflow definition, we will have the workflow definition for the application. And then for the manifest, I think I could just go to the actions instead.

The first job in our workflow builds the Docker image and pushes it out to ECR, and this is only dependent on the application code. The next job, which handles deployments to EKS, triggers right after the first job is completed. We then go forward to generate the manifest files using Helm and then do a deploy to EKS, using the image that was built in the first step.

Doing it this way is be easier, because you don’t have to create permissions or tokens to give access to a different repository. However, one potential drawback here is that the developers and the DevOps engineers have to make changes to the same repository. Keeping track of who made which changes might not be straightforward. On the other hand, having developers actively participate in updating manifest definition files also has benefits, because developers clearly know that if they have to make changes to the manifest definition, it’s also within the array, which is right in the main repository. And they have the freedom to interact with the manifest files whenever they want. So it’s a different ideology that has its advantages as well as it’s limitations.

One further complication with this approach is that you’re unable to make changes to the manifest files without pushing any code changes that have been merged into the main branch, but which aren’t ready to deploy yet. This goes against the Twelve-Factor App idea of separating code from configuration. But the simplicity of this approach, and how close the configuration is to developers, are both very appealing. It’s a small change, but having the manifest in the same repository as the application code significantly increases the likelihood that you get developer engagement.

We’ve seen firsthand with this approach that when developers need to add new environment variables or secrets, it’s easier for them to do so and they are willing to do it. Then they just need to request a review from the DevOps team.

We’ve had projects using different approaches where developers have made changes to a manifest repository, but it’s been more of a leap and a learning curve. Using a separate repository makes it feel like you’re outside of your domain: that you’re in the platform engineers’ domain. Whereas this way, by keeping the manifests in the app repo, it feels like we’re coming to their house, which is really interesting.

Before we move onto our next example, let’s just dive a bit deeper into how we test Helm, the tool we use to generate the manifest files in this project. We pass sample values to the Helm chart and validate the generated manifests to confirm that the Helm chart is properly configured.

In most of our applications, we’ve been using a tool called kustomize which lets you structurally layer changes to Kubernetes manifests. You start with a base, and then you slowly transform it by adding labels or patching and values. Helm takes a string template approach, sort of like PHP or ERP, where you start with a set of templates, and then you interpolate values. And it can also do simple constructions, like `for` loops and conditionals. So we had the idea that, in order to make it easier for developers to do common things like change environment variables, we could make a Rails Helm chart template that would handle the common cases we have. Then all the developers would have to do would be to update the values file. Taking a TDD approach, we also wrote tests for the Helm chart templates we were writing.

We’ve been liking Helm because it’s made it very nice to do simple template construction across the various environments that we’re building for. Having just that one values file, as opposed to needing multiple different folders, has been very helpful.

Fifth Example: CodeBuild and CodePipeline

Finally, let’s take a look at another approach using AWS’s CodeBuild and CodePipeline, which is the approach used in this final example and is what we used for a long time before adopting GitHub Actions. CodePipeline links together a set of steps that should be executed when code has changed. CodeBuild is the environment for them to run in. So it’s sort of like GitHub Actions: a workflow versus a job. The one feature we really like about CodePipeline and that we haven’t seen in other tools is the way they manage multi-repository workflows.

You can simply declare multiple sources for your pipeline, it will keep track of the latest version of each. And whenever either one changes, it will run it with both of the latest repositories. So the pipeline is very clear in terms of what’s happening with the separate source and manifest repositories: you can see which version is changed.

When it goes to the build step, it does separate projects for the manifest and the application code, and then the deploy combines them. So the presentation and implementation is much simpler than the cross-repository approach we take with GitHub Actions. It’s built-in that you can specify up which pipelines can access which repositories. You don’t need to have one repository trigger a workflow on the other repository, because the pipeline lives outside the repository and is part of the infrastructure.

It would be nice if GitHub had something like this. However, using Github Actions is much simpler and it cuts out a lot of the complexity, with the added benefit of putting the CI/DC at developers’ fingertips.