Evolving CI/CD: From Manual Automation to GitHub Actions

Introduction

In a dynamic development environment, the efficiency and automation of deployment processes are key to the speed and reliability of software releases. This post details my journey in facing and overcoming CI/CD challenges at the company I work for, culminating in the adoption of GitHub Actions for a more efficient and maintainable process.

The Starting Point

At the beginning of our journey to improve our deployment processes, we faced a scenario common to many startups and growing development teams. Our application consisted on a backend written in Python and an Single-Page Application frontend in React. The first being deployed at a Docker container running on Kubernetes and the second served as a static application through S3 and CloudFront.

Initially, I was responsible for developing and maintaining both parts of the application. This “jack-of-all-trades” approach is not uncommon in the early stages of a project but presents its own challenges, especially as the team grows in size and different skill sets.

Growing Pains

With the addition of a new developer to focus exclusively on the frontend, the process began to split. In the testing environment, I continued to package the backend and send it to the repository, while the new developer took on the responsibility of compiling the frontend and updating the S3 bucket. On the production environment, however, to ensure proper integration, I continued to handle both processes, compiling and packaging both parts before sending them to their respective destinations.

With the addition of a second — more junior — frontend developer to the team, we were faced with new problems: the lack of experience and a different operating system made it quite difficult to maintain the same process. The new team member worked on a Windows machine with WSL, which couldn't manage to run properly the tooling we've developed for our macOS and Linux environments. The time spent adjusting build and deployment scripts to work on Windows and WSL was unsustainable.

We Needed to Automate

It was at this point that I recognized the critical need to automate more of our deployment process. The initial solution was designed only for the frontend and was pretty much a cloud wrapper around the tooling that we already had.

It involved the creation of a pipeline that started with a code push to our GitHub repository. This event triggered a webhook call to a Lambda function on AWS. Once triggered, the Lambda function was responsible for instantiating a virtual machine on EC2 out of a custom AMI that I've built including all runtime dependencies and automation scripts.

To ensure isolation and reproducibility, the VM built a Docker image out of the code pulled from GitHub, compile it and sync the resulting assets with our S3 bucket.

Aiming for consistency, we decided on the early stages of our development to keep the backend and frontend on the same repository, so that a single commit hash could represent a version of the code base for the whole application, reducing the chance of a clash between versions due to API changes, for example.

The frontend deployment solution worked well enough that it made sense to generalize it so that the backend was built on a similar process. For this case, it involved downloading the latest code from the GitHub repository, building it into a Docker container and pushing the resulting image to the Amazon Elastic Container Registry (ECR).

After a successful upload, the script would automatically adjust the configuration of our Kubernetes to use the new Docker image, ensuring that the latest version of our backend was effectively deployed.

Each part would be built in parallel, so that the frontend and the backend were deployed at about the same time, maintaining their synchronization.

Recognizing the Problem

As our project and team grew, it became evident that automating our deployment process was not just a matter of convenience, but a critical necessity to maintain the efficiency of our development workflow.

The previous solution, despite being functional, carried too much operational complexity. It required someone to maintain the AMI and all the glue logic scripts. It also required our team to have at least one person who had the knowledge and experience to do so, which could become a problem for the company in my absence.

Taking into account that this was not an unique problem, that it wasn't in any way something that only we had, I decided to embark on a journey of simplification, looking out for a tool or service we could outsource this to, and avoid the hassle of keeping our custom solution.

The Choice for GitHub Actions

The natural choice for the service would be either GitHub Actions or AWS Code Pipeline, being the two major vendors we're already integrated with. Since we already have all of our infrastructure setup at AWS, it is a growing concern, that we become too dependent them, and that if we ever need to use another cloud service, that would cause too much of a disruption on our processes and a source of instability. For this reason, GitHub Actions started out as a preferred way. Add to that how easy and inexpensive it is, it really became a no brainer.

With it, we were able to define workflows directly in our project’s Git repository, using simple YAML configuration tightly coupled with the specific code revision.

The workflow consists of two main jobs that run in parallel: one for the backend and another for the frontend. This approach not only saves time but also allows for finer management of the dependencies and environments of each part of the application.

Similar to what we were doing before, the backend job involves building the Docker image and pushing it to the Amazon Elastic Container Registry (ECR). To overcome the challenge of cross-compiling for arm64 architectures, we use QEMU along with docker buildx. This allowed us to maintain compatibility with arm64 infrastructures, despite the current limitations of GitHub Actions in terms of native arm64 runners.

jobs:
    backend:
        runs-on: ubuntu-latest
        steps:
            - name: Checkout the code
              uses: actions/checkout@v4
            - name: Set up QEMU
              uses: docker/setup-qemu-action@v3
            - name: Set up Docker Buildx
              uses: docker/setup-buildx-action@v3
              with:
                  platforms: linux/arm64
                      - name: Configure AWS Credentials
                        uses: aws-actions/configure-aws-credentials@v4
              with:
                  aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
                  aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
                  aws-region: ...
            - name: Login to Amazon ECR
              id: login-ecr
              uses: aws-actions/amazon-ecr-login@v2
            - name: Run Buildx
              run: |
                  docker buildx build \
                      --platform linux/arm64 \
                      ...
                        --push

In parallel, the frontend job compiles the static assets and sends them to Amazon S3, from where they are served to users.

jobs:
    frontend:
        runs-on: ubuntu-latest
        steps:
        - uses: actions/setup-node@v4
          with:
              node-version: '20.11.1'
        - name: Checkout the code
          uses: actions/checkout@v4
        - name: Build Frontend
          working-directory: ./frontend
          run: |
              npm install
              node ...
        - name: Configure AWS Credentials
          uses: aws-actions/configure-aws-credentials@v4
          with:
              aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
              aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
              aws-region: ...
        - name: Deploy to S3
          working-directory: ./frontend
          run: aws s3 sync --delete ...

But there was still an issue

In our previous process, the backend pipeline ended with a call to Kubernetes to update the current image. Two solutions crossed my mind: just appending the kubectl command to the GitHub workflow to set the new image or to associate a Lambda function on AWS to be triggered when the image got pushed to ECR.

The first solution, though seemingly simple, isn't without downsides. The first of it is that our ECR would have to be open to the world. This isn't so much of a problem, specially given that this is actually the default configuration for it — and one that we've used for quite some time. But going forward, I'd like to keep the access to it restricted to our VPC (which we access using a VPN). Additionally, it's another access given to an AWS role (the one used by GitHub) which is already too powerful. And on top of it, it's a series of packages that I'd have to install (kubectl, awscli and dependencies) on the builder machine — which then, would have to be maintained.

Mostly due to my level of familiarity with adding glue logic with AWS Lambda functions, implementing one to accomplish this task didn't seem a big deal. It does carry the problem of not having automatic error reporting, like the GitHub Action does, but that's also a piece that once set up I've never seen it fail. As said, it allows the use of a role with limited powers to only access our EKS and it keeps all the communication within our VPC.

Creating this Lambda function ended up being a careful and interesting exercise. My goal was to keep the function as simple and free of external dependencies as possible. This led me to dive deep into the inner workings of AWS EKS authentication and how kubectl, the Kubernetes command-line tool, manages this integration. This investigation led to an elegant solution that authenticates with EKS using boto3 (already available at the Lambda environment) and communicates to Kubernetes directly through a REST API call.

By opting for a solution that utilizes the standard library and the internal capabilities of the Lambda environment, we managed to end up with an efficient and low-maintenance update process.

Conclusion

The migration to GitHub Actions allowed the simplification of our deployment process, reducing our workload and increasing the maintainability of the process. Configurable workflows gave us the flexibility to define specific deployment tasks for the backend and frontend, with the additional advantage of being integrated into the GitHub ecosystem, which increases the visibility of the process and facilitates collaboration.

We were able to reduce manual workload and improve the reliability and maintainability of our deployment process. At the time of this writing, we’re still struggling with the longer cross build times — an issue that we expect to see resolved as GitHub rolls out the ARM64 machines.

The journey to improve the CI/CD process reflects the importance of always learning and adapting. By adopting new tools and practices, we were not only able to solve immediate problems but also prepare our infrastructure for the future.

Evolving CI/CD: From Manual Automation to GitHub Actions

Introduction

The Starting Point

Growing Pains

We Needed to Automate

Recognizing the Problem

The Choice for GitHub Actions

But there was still an issue

Conclusion

Comments

More from this blog

Reading through the US-EAST-1 Service Disruption Summary Report

It's Time to Write Tests

A path for improving LLM coding tools

AWS Lambda Cold Starts: Real-World Cost Optimization

Making Error Paths Visible: Learning from Rust's Type System

Command Palette

Introduction

The Starting Point

Growing Pains

We Needed to Automate

Recognizing the Problem

The Choice for GitHub Actions

But there was still an issue

Conclusion

Comments

More from this blog