An optimal CI/CD system

Our CI/CD in Istio is fairly slow, despite many hours of optimizations.

In this post, I will explore what a hypothetically ideal CI system might be, to optimize test execution time while keeping costs reasonable.

This isn't meant to be a good idea - the proposed system has a lot of issues. The intent here is just to explore how we can maximize test speed, at expense of anything else.

Current state

We run about 15-20 integration tests, that consist of:

Create a Kubernetes cluster
Build Istio images
Install Istio
Run a variety of tests
Repeat 3-4 with different setups.

We have about 45 actual tests, but we somewhat arbitrarily divide them into different jobs to keep things running quickly. We also repeat some of the same tests with different configurations (such as different Kubernetes versions or settings).

These take 5-45mins, depending on luck, cache state, and the job size (multi-cluster ones are slowest).

A trace of one of our faster jobs:

With the time spent broken down:

Step	Time	Overall
Total	678s	100.00%
Git clone	36s	5.31%
Image pull	75s	11.06%
Cluster setup	75s	11.06%
Image Build	125s	18.44%
Test Compile	20s	2.95%
Test Setup	92s	13.57%
Test Execution	196s	28.91%
Post Test	30s	4.42%

We can see while our total time here is 11min, only 3min (~25%) is actually spent running tests.

Our hypothetical optimal is not really 3 minutes, though. This is executing 2 packages (about 90s each), each with many subtests. So we could go smaller.

Current Architecture

Currently, our CI/CD platform is prow. This isn't too important, it mostly just schedules a Kubernetes Pod for each job we run, and we have ~full control over the Pod.

Our test pods use a (large) image with everything we need in it, rather than installing dependencies on-demand. This is usually cached on the node.

The main optimizations we apply, aside from roughly partitioning the tests and making each step as fast as possible, is mounting a node-local cache for go builds and modules.

In the above trace, I must have gotten a new node, so there was no cached image or build cache to rely on.

One thing that is terrible about this approach is that each test job is completely independent. This means each job is building the same images, etc. Our platform has no DAG, only a list of jobs to run.

Ideal state

Right off the bat, some obvious things we will want.

Shared builds

Caching builds is table stakes. Building only once instead of N times is the next step, so we will need to get a CI/CD platform with a DAG execution:

graph LR Builder--write-->I{{"Image Registry"}} Builder-->Job1 Builder-->Job2

A initial task can do all the image building. However, it will need to store the state somewhere for the jobs to read. A docker image registry should work here.

However, this is probably worse than the current state if the jobs just do go test ./... after. Compiling the tests has a ton of overlap with compiling the images. So while this will shave off the "build images" time, it will almost certainly increase the later step time.

Instead, the builder can also build the tests (go test -c can compile tests without executing them). This can then be passed to the tests, which simply run them as a binary. The test environment wouldn't even need go in most cases.

This common builder task doesn't really speed things up. It de-duplicates work, but that work was done in parallel.

It does have a decent cost saving potential, though, as we can make the build task run on large machines and the test executors smaller machines (which, in turn, may be a bit faster).

Using a stateful builder may help resolve these problems though. Using a single (very very large) builder between runs can allow us to share a ton of state to speed things up.

Git clones (5% of execution) can be near instant, as they just need to pull the incremental change. This can be cheap using git worktree as well.
Go modules and build cache can be re-used.

With this new approach, we can also parallelize some of the test setup. When a new pipeline is kicked off, in parallel:

We tell the builder to start building the images and test binaries
We spin up a few workers. They can setup immediately (spin up a cluster, etc), then wait for the builder to be complete enough for them to actually start testing.

graph LR subgraph builder direction LR I{{"Image Registry and Test Binaries"}} Builder end Builder--write-->I subgraph "Jobs (Many)" direction LR Setup1["Setup"]-->FetchTest1["Fetch Test"] FetchTest1-->RunTest1["Run Test"] FetchTest1--Depends-->I end

No setup

While we are pretty parallelized now -- our cluster setup happens concurrently with building -- having a slow setup phase limits us. If tests take 10s each but setup is a minute, we probably want to have fewer test runners. If setup is fast, though, its more efficient to scale out and have more concurrency.

On most platforms, running N tasks in sequence on a single machine and running 1 task on N machines will cost the same, as long as the time is the same for each task. If setup is trivial, then we meet this property and can more aggressively parallelize our jobs.

Instead of running an image with our dependencies and then starting long-running processes, we can instead start the long processes and snapshot the state once they are ready. This can be done in docker, firecracker, and others. This makes our startup phase nearly instant.

New architecture

With this pieces in place we end up with something like this:

When a change needs testing:

The large builder instance picks it up
It clones the changes (fast, due to our changes above)
Builds images and test binaries
As each test binary is ready, we trigger a new test runner to execute it

Our (completely made up) test timing now looks like:

Step	Time	Overall
Total	171.0s	100.00%
Git clone	1.0s	0.58%
Image pull	0.0s	0.00%
Cluster setup	0.0s	0.00%
Image Build	5.0s	2.92%
Test Compile	5.0s	2.92%
Test Setup	45.0s	26.32%
Test Execution	100.0s	58.48%
Post Test	15.0s	8.77%

58% of time is spent running our tests now (and 85%, if we include our application-specific test setup, which is kind of part of the test). This is up from 28%/42%.

Our total runtime is also 4x faster than before.

Going further

One thing that still bugs me about this approach is that the builders may be processing multiple changes at once. Go is not good at compiling multiple things in separate go invocations.

What we really want is a long running go build daemon, that can add actions to run to efficiently deduplicate these across calls.

At a very high level this seems plausible. The go compiler works by building a DAG of steps to process, so we "just" need to merge two DAGs together.

Current state#

Current Architecture#

Ideal state#

Shared builds#

Sharing state#

No setup#

New architecture#

Going further#