Go is often praised for its fast build times. While they are pretty quick, they are slow enough that I spend a lot of time waiting for them, enough that it prompted me to go down the rabbit hole of thoroughly analyzing them. This post covers all aspects of what makes Go builds fast or slow.

Throughout this blog, we will use Istio as an example of real-world codebase.

For reference on its size:

$ tokei -t=Go
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 Language            Files        Lines         Code     Comments       Blanks
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 Go                   1814       453735       358883        54151        40701
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 Total                1814       453735       358883        54151        40701
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

About 350k lines of code.

All tests will run on an n2d-standard-48 GCE VM (vCPU: 48, RAM: 192 GB, Disk: 2TB PD_SSD).

All tests will run in docker to avoid accidental cache sharing, and to easily test reduced CPU counts.

Key Takeaways

The post is a big long, here is the tl;dr:

Go modules

Before starting a build, we need to pull down all our dependencies:

$ docker run --rm -v $PWD:$PWD -w $PWD -it --init golang:1.20 bash -c '
time go mod download
du -sh `go env GOMODCACHE`
'
real    0m5.585s
user    0m17.749s
sys     0m9.369s
1.1G    /go/pkg/mod

This is fast. On my home network, it was 50s instead.

This is basically a network speed test, though, which isn't very interesting. It also is the simplest aspect of Go builds to cache, so from now on we will ignore it and always use caching. We will also make a simple helper:

function run() {
  docker run --rm -v $PWD:$PWD -w $PWD -e 'GOFLAGS=-buildvcs=false' -v `go env GOMODCACHE`:/go/pkg/mod -it --init golang:1.20 bash -c "$*"
}

Building

Now lets get to the actual building.

$ run time go build ./pilot/cmd/pilot-discovery

real    0m33.102s
user    6m31.190s
sys     1m45.876s

So we are looking at just over 30s for a standard build. But remember, this is on a massive machine.

Scaling CPU

Most builds are not done on 48 cores; GitHub actions default runners only have 2 cores, for example. Lets look at how build times scale with CPUs.

We will need another helper to test limited cores. I was also interested if setting GOMAXPROCS would be helpful here -- this defaults to the number of processors the machine has, but is not aware of cgroup limits, such as the ones docker is imposing. We will create a helper for both and compare.

function run-with-cores() {
  cores="$1"
  shift
  docker run --rm -v $PWD:$PWD -w $PWD \
    --cpus="$cores" \
    -e 'GOFLAGS=-buildvcs=false' \
    -v `go env GOMODCACHE`:/go/pkg/mod \
    -it --init golang:1.20 bash -c "$*"
}
function run-with-cores-gomaxprocs() {
  cores="$1"
  maxprocs="$2"
  shift; shift;
  docker run --rm -v $PWD:$PWD -w $PWD \
    --cpus="$cores" -e GOMAXPROCS="$maxprocs" \
    -e 'GOFLAGS=-buildvcs=false' \
    -v `go env GOMODCACHE`:/go/pkg/mod \
    -it --init golang:1.20 bash -c "$*"
}

Comparing the two

$ run-with-cores 4 'time go build ./pilot/cmd/pilot-discovery'
real    2m0.627s
user    6m21.382s
sys     1m25.052s

$ run-with-cores-gomaxprocs 4 4 'time go build ./pilot/cmd/pilot-discovery'
real    1m26.253s
user    4m34.381s
sys     0m59.795s

A huge improvement setting GOMAXPROCS.

I had assumed cores == GOMAXPROCS is ideal, but we ought to verify this:

$ run-with-cores-gomaxprocs 4 1 'time go build ./pilot/cmd/pilot-discovery'
real    4m3.413s

$ run-with-cores-gomaxprocs 4 2 'time go build ./pilot/cmd/pilot-discovery'
real    1m41.406s

$ run-with-cores-gomaxprocs 4 4 'time go build ./pilot/cmd/pilot-discovery'
real    1m24.643s

$ run-with-cores-gomaxprocs 4 8 'time go build ./pilot/cmd/pilot-discovery'
real    1m38.170s

$ run-with-cores-gomaxprocs 4 16 'time go build ./pilot/cmd/pilot-discovery'
real    1m53.609s

So lets make our function just always do this:

function run-with-cores() {
  cores="$1"
  shift
  docker run --rm -v $PWD:$PWD -w $PWD \
    --cpus="$cores" -e GOMAXPROCS="$cores" \
    -e 'GOFLAGS=-buildvcs=false' \
    -v `go env GOMODCACHE`:/go/pkg/mod \
    -it --init golang:1.20 bash -c "$*"
}

Finally, we can look at how build times scale with CPU:

run-with-cores 1 'time go build -o /tmp/build ./pilot/cmd/pilot-discovery'
real    7m12.354s

run-with-cores 2 'time go build -o /tmp/build ./pilot/cmd/pilot-discovery'
real    3m50.390s

run-with-cores 4 'time go build -o /tmp/build ./pilot/cmd/pilot-discovery'
real    2m4.813s

run-with-cores 8 'time go build -o /tmp/build ./pilot/cmd/pilot-discovery'
real    1m10.144s

run-with-cores 16 'time go build -o /tmp/build ./pilot/cmd/pilot-discovery'
real    0m44.286s

run-with-cores 32 'time go build -o /tmp/build ./pilot/cmd/pilot-discovery'
real    0m33.755s

Charting this against a theoretically perfect scaling (that is, using 100 cores is 100x faster than 1 core):

Compile times per core

We can see it scales pretty well as the number of cores increases, but there is diminishing returns at a certain point. We will investigate this a bit more in the Tracing section.

Really Really Big machines

I got curious and ran a build on a c3-standard-176 (yes, that is 176 cores!).

$ time go build ./pilot/cmd/pilot-discovery

real    0m25.188s
user    6m28.325s
sys     1m26.272s

A bit disappointing - barely any speed up at all compared to 48 cores.

Caching

Fortunately, Go comes with pretty solid build caching out of the box:

$ run '
time go build -o /tmp/build ./pilot/cmd/pilot-discovery
time go build -o /tmp/build ./pilot/cmd/pilot-discovery
time go build -o /tmp/build2 ./pilot/cmd/pilot-discovery
'
real    0m32.577s
real    0m0.810s
real    0m4.918s

We can see subsequent builds are nearly instant - especially if we write to the same file. If we do not write to them same file, we are forced to link the binary, which seems to take about 4s.

With updates

The above test isn't terribly realistic -- normally we change code in between builds:

$ run '
time go build -o /tmp/build ./pilot/cmd/pilot-discovery
echo 'var _ = 1' >> pilot/cmd/pilot-discovery/main.go
time go build -o /tmp/build ./pilot/cmd/pilot-discovery
echo 'var _ = 2' >> pilot/cmd/pilot-discovery/main.go
time go build -o /tmp/build2 ./pilot/cmd/pilot-discovery
'
real    0m32.601s
real    0m5.017s
real    0m4.995s

Here we can see the benefit of outputting to the same file is entirely lost, showing that is likely an optimization for completely unchanged builds. Beyond that, though, the cost is pretty minimal.

In the above test, however, we are modifying a file at the top of the dependency chain. Things look a bit different if we change a deep dependency. In our app, the log package is used almost everywhere; let's try changing that:

$ run '
time go build -o /tmp/build ./pilot/cmd/pilot-discovery
sed -i  's/"message"/"new"/g' pkg/log/config.go
time go build -o /tmp/build ./pilot/cmd/pilot-discovery
'

real    0m31.760s
real    0m15.268s

Here we can see quite a bit more time is spent. Likely nearly every package in our repo is rebuilt, while dependencies are cached.

Cache Persistence

What if we are switching between branches constantly? Or, in the case of CI, caching across branches?

run '
git config --global --add safe.directory /usr/local/google/home/howardjohn/go/src/istio.io/istio
time go build -o /tmp/build ./pilot/cmd/pilot-discovery
git checkout release-1.18
time go build -o /tmp/build ./pilot/cmd/pilot-discovery
git checkout master
time go build -o /tmp/build ./pilot/cmd/pilot-discovery
'
real    0m31.690s
Switched to branch 'release-1.18'

real    0m37.476s
Switched to branch 'master'

real    0m5.006s

Just as fast. At least in this simple cases, switching between branches doesn't hurt caching.

Build tags

What if we change build tags? Below I test a tag that is not used anywhere:

run '
time go build -o /tmp/build ./pilot/cmd/pilot-discovery
time go build -o /tmp/build -tags=nothing ./pilot/cmd/pilot-discovery
'

real    0m31.719s
real    0m4.956s

No impact on caching, as expected

Concurrent Builds

When caching in CI, we may be building the same things concurrently. This is common when a commit schedules multiple jobs at the same time, and they both build similar things.

How well is it cached?

We will limit to 4 cores to ensure benefits are not masked by our high concurrency.

First we will run the build on its own to get a basseline:

$ run-with-cores 4 'time go build -o /dev/null ./pilot/cmd/pilot-discovery'

real    1m29.778s
$ run-with-cores 4 '
time go build -o /dev/null ./pilot/cmd/pilot-discovery &
time go build -o /dev/null ./pilot/cmd/pilot-discovery &
time wait
'

real    2m58.492s
real    2m58.650s

Interestingly, this is far slower than I expected. This test is wrong since we are running both builds under the same 4 core CPU limit.

Instead we need to run them separately, but with a shared cache:

$ function run-with-cores-cached() {
  cores="$1"
  shift
  docker run --rm -v $PWD:$PWD -w $PWD \
    --cpus="$cores" -e GOMAXPROCS="$cores" \
    -e 'GOFLAGS=-buildvcs=false' \
    -v `go env GOMODCACHE`:/go/pkg/mod \
    -v /tmp/gocache:/root/.cache/go-build \
    --init golang:1.20 bash -c "$*"
}
$ run-with-cores 4 'time go build -o /dev/null ./pilot/cmd/pilot-discovery' &
$ run-with-cores 4 'time go build -o /dev/null ./pilot/cmd/pilot-discovery' &
$ wait
real    1m34.677s
real    1m36.572s
$ run-with-cores 4 'time go build -o /dev/null ./pilot/cmd/pilot-discovery'
real    0m5.163s

And we can see there is basically no caching going on with concurrent runs. If we do another build afterward, however, we can see the cache is working - just not concurrently.

This is understandable: theoretically if each execution is taking the same steps in the same amount of time, no actions will be cached. I am surprised at how little improvement there is, though.

Staggered builds

What if we stagger the builds a bit?

$ run-with-cores 4 'time go build -o /dev/null ./pilot/cmd/pilot-discovery' &
$ sleep 20
$ run-with-cores 4 'time go build -o /dev/null ./pilot/cmd/pilot-discovery' &
$ wait
real    1m11.095s
real    1m31.409s
$ run-with-cores 4 'time go build -o /dev/null ./pilot/cmd/pilot-discovery' &
$ sleep 60
$ run-with-cores 4 'time go build -o /dev/null ./pilot/cmd/pilot-discovery' &
$ wait
real    1m31.126s
real    0m30.614s

We can see a trend: however much we wait, the build is that much faster. This makes sense: we can either do the compilation work in N seconds, or wait for the other process to do it (also in N seconds) and read the cached result ~instantly.

Mismatched build sizes

Above, we had the exact same CPU allocation. What if we have a mismatch?

$ run-with-cores 40 'time go build -o /dev/null ./pilot/cmd/pilot-discovery' &
$ run-with-cores 4 'time go build -o /dev/null ./pilot/cmd/pilot-discovery' &
$ wait
real    0m32.319s
real    0m32.959s

Again we see the exact same build times, but this time it's the duration of the fastest build. So our slower builder can benefit from the faster builder.

Interestingly, this means there may be benefits to throttling some jobs. Running the same test but giving each job access to all cores results in a slower overall runtime:

$ run-with-cores 48 'time go build -o /dev/null ./pilot/cmd/pilot-discovery' &
$ run-with-cores 48 'time go build -o /dev/null ./pilot/cmd/pilot-discovery' &
$ wait
real    0m37.321s
real    0m37.571s

The difference, though, isn't that high. This may be because with 48 cores, Go's compiler isn't maxing out CPU. We would likely see different results if our host machine was smaller.

We can take this to the extreme and run a bunch of jobs at once:

$ for i in {0..15}; do
  run-with-cores-cached 48 'time go build -o /dev/null ./pilot/cmd/pilot-discovery' &
done
$ wait
...
real    2m11.136s # (repeated 16x)
...

Substantially slower. On the plus side, this was one of the first times I maxed out my machines CPU usage.

If we just have one fast job and the rest smaller, we see much better results:

$ run-with-cores-cached 36 'time go build -o /dev/null ./pilot/cmd/pilot-discovery' &
$ for i in {0..14}; do
  run-with-cores-cached 1 'time go build -o /dev/null ./pilot/cmd/pilot-discovery' &
done
$ wait
...
real    0m39.383s # (repeated 16x)
...

Build concurrency

(Thanks to @josharian for the suggestion!)

The Go compiler has a variety of flags we can tune. Among these is -c, which controls the "concurrency during compilation". This defaults to the max of GOMAXPROCS and 4. Note this differs from -p, which controls how many build operations occur concurrently (discussed more in integration tests); this flag controls the concurrency of a single build operation.

Since we are using Go 1.20 which doesn't yet include a bugfix for the flag, we also set GO19CONCURRENTCOMPILATION=0 to workaround. This will disable Go from setting its own -c flag, so that it will respect our manually-set flag (otherwise, it would always append its own default after ours).

Lets try it out:

$ run time env GO19CONCURRENTCOMPILATION=0 go build '-gcflags=all="-c=1"' ./pilot/cmd/pilot-discovery
run time env GO19CONCURRENTCOMPILATION=0 go build '-gcflags=all="-c=4"' ./pilot/cmd/pilot-discovery
run time env GO19CONCURRENTCOMPILATION=0 go build '-gcflags=all="-c=48"' ./pilot/cmd/pilot-discovery
run time env GO19CONCURRENTCOMPILATION=0 go build '-gcflags=all="-c=256"' ./pilot/cmd/pilot-discovery

real    0m40.473s
real    0m32.385s
real    0m29.666s
real    0m31.711s

We can see, like setting GOMAXPROCS, we get optimal results setting the value to our actual processor count. The difference, though, is quite small compared to controlling GOMAXPROCS or -p.

Building multiple binaries

So far we have looked at building just one binary. What if we need to build multiple?

Lets try a few different approaches to building multiple binaries:

  1. Building in sequence
$ BINARIES="./istioctl/cmd/istioctl ./pilot/cmd/pilot-discovery ./pkg/test/echo/cmd/client ./pkg/test/echo/cmd/server ./samples/extauthz/cmd/extauthz ./operator/cmd/operator ./tools/bug-report"
$ run "
time {
for bin in $BINARIES; do
  go build -o /tmp/ \$bin
done
};
"

real    1m25.276s
  1. Building in parallel
$ run "
for bin in $BINARIES; do
  go build -o /tmp/ \$bin &
done
time wait
"

real    1m4.005s
  1. Building in a single go build
$ run "time go build -o /tmp/ $BINARIES"

real    0m44.410s

Building in a single go build command is substantially faster!

Based on our other analysis, this makes sense.

Building sequentially can access to the cache of previous builds, but any new dependencies will need to be built, and linking cannot be done in parallel.

Building in parallel fares a bit better, but as we saw in Concurrent Builds, caching is limited during concurrent executions.

Using a single command allows Go to maximize concurrency.

Tests

What about compiling tests? In Go, each package generates an individual test binary.

$ go list ./... | wc -l
497

That means almost 500 binaries are generated when we run go test ./...! As a result, compile times can be especially important when running tests.

We don't have an apples-to-apples test to compile compared to the binary, but the one used below is close enough in terms of dependencies.

$ run '
time go test -c -o /tmp/build ./pilot/pkg/xds
time go test -c -o /tmp/build ./pilot/pkg/xds
'
real    0m31.007s
real    0m1.760s

This is about the time spent as same as building

Race

The Race Detector is extremely useful for tests; we always run tests with it enabled. However, it comes at a cost:

$ run 'time go test -c -o /tmp/build -race ./pilot/pkg/xds'

real    0m44.167s

Nearly 50% longer build times.

Vet

Go automatically runs some linters on test compilation. These can be turned off:

$ run 'time go test -c -o /tmp/build -vet=off ./pilot/pkg/xds'

real    0m28.965s
user    5m8.443s
sys     1m18.009s
$ run 'time go test -c -o /tmp/build ./pilot/pkg/xds'

real    0m30.658s
user    6m49.299s
sys     1m43.606s

Here we see a small improvement in real time, but a pretty large improvement to user time; I suspect the 48 cores is masking some of the costs of vet.

Caching tests

Does building a binary warm up the cache for building a test?

$ run '
time go build -o /tmp/build ./pilot/cmd/pilot-discovery
time go test -c -o /tmp/build ./pilot/pkg/xds
'

real    0m31.690s
real    0m8.980s

Yep, it does!

What about if we use -race?

$ run '
time go build -o /tmp/build ./pilot/cmd/pilot-discovery
time go test -c -o /tmp/build -race ./pilot/pkg/xds
'

real    0m31.551s
real    0m43.568s

Nope, we lose the benefits here.

Linking vs compiling

To measure the time spent linking vs compiling, we can build a custom tool to compile without linking. This is really helpful to test compiling all code/tests, so we will measure that below.

This can be done in Go 1.21, but it does linking and doesn't work if you have any packages with the same name. Instead, we can use a workaround to just not execute them.

run '
time go test -exec=true -vet=off ./...
time go test -exec=true -vet=off ./...
'
real    2m33.337s
real    0m51.978s
'

Even repeated runs are pretty slow. Given we know all compilation is cached, we can infer that linking all the test binaries takes roughly 50s.

Now without linking at all:

# Setup our tool
cat <<'EOF' > ./go-compile-without-link
#!/bin/bash

if [[ "${2}" == "-V=full" ]]; then
  "$@"
  exit 0
fi
case "$(basename ${1})" in
  link)
    # Output a dummy file
    touch "${3}"
    ;;
  *)
    "$@"
esac
EOF
chmod +x go-compile-without-link
# Now build without linking
run '
time go test -exec=true -toolexec=$PWD/go-compile-without-link -vet=off ./...
time go test -exec=true -toolexec=$PWD/go-compile-without-link -vet=off ./...
real    1m18.623s
real    0m3.055s
'

Not only is this 2x faster overall, with a warm cache its nearly 20x faster. This is a huge win for development iteration; I regularly run the above command before committing as a lightweight check.

Warning: this may miss some link-time errors. In my experience, these are rare enough to make this still extremely useful for development cases. If you use CGO this may not be true.

Alternative linkers

Given linking is such a large contributor to slow build times, what if we used a faster linker? mold is a commonly used linker in other languages offering impressive improvements over other linkers. While not common, it can be used with Go as well.

Since the linking is on the order of seconds, we will use hyperfine to more accurately measure the timings.

$ hyperfine 'rm test; time go build -ldflags "-linkmode external -extld clang -extldflags -fuse-ld=mold"' 'rm test; time go build'
Benchmark 1: rm test; time go build -ldflags "-linkmode external -extld clang -extldflags -fuse-ld=mold"
  Time (mean ± σ):      1.880 s ±  0.038 s    [User: 2.876 s, System: 2.142 s]
  Range (min … max):    1.815 s …  1.938 s    10 runs

Benchmark 2: rm test; time go build
  Time (mean ± σ):      1.582 s ±  0.022 s    [User: 2.693 s, System: 1.043 s]
  Range (min … max):    1.549 s …  1.615 s    10 runs

Summary
  'rm test; time go build' ran
    1.19 ± 0.03 times faster than 'rm test; time go build -ldflags "-linkmode external -extld clang -extldflags -fuse-ld=mold"'

Looks like in this case, linking with mold is slightly slower.

Analyzing what contributes to build times

There are a few tools that can help analyze build times.

Custom toolexec wrapper.

Note: this option is not the best one, but I wrote this section before I found the other (better) tools.

Go can emit verbose logs of builds with -x, but its really verbose. Fortunately, we can wrap tool invocations with a script that logs what is taking time.

This can be invoked using -toolexec during builds.

Here is the results of building the same ./pilot/cmd/pilot-discovery binary. Note: since we are using multiple cores, the sum adds up to longer than the actual runtime. This roughly corresponds to the user time from the previous time commands.

Action SUM of Duration
link 3.789593331
gcc 0.022386331
compile 305.6692171
cgo 1.526077562
asm 2.99016495
Grand Total 313.9974393

Overall we can see the vast majority of the time is spent on compile, so we will focus there. link is also interesting, though -- despite being only 1% of the time, it is the least cache-able and parallelize-able.

Looking at each module is more interesting:

Location SUM of Duration
k8s.io/client-go 67.86363857
github.com/envoyproxy/go-control-plane 48.72126327
local 44.15472995
std 25.53394494
k8s.io/api 23.03486699
k8s.io/apimachinery 7.960915542
google.golang.org/grpc 6.152948778
google.golang.org/protobuf 5.491289355
istio.io/client-go 4.632729928
istio.io/api 4.437340682
github.com/google/cel-go 4.240151317
k8s.io/kube-openapi 3.00321029
github.com/google/gnostic 2.668591864
golang.org/x/net 2.611716846
k8s.io/apiextensions-apiserver 2.411767233
github.com/google/s2a-go 2.291722416
github.com/gogo/protobuf 2.07997357
github.com/google/go-containerregistry 2.050610622

There is quite a few of these, so only included are those that took >2s. local means the local module code (Istio), and std means Go's own standard library.

What about individual files? Again, only those that are >1s:

File SUM of Duration
core/v1/zz_generated.deepcopy.go 3.901002923
src/runtime/write_err.go 1.755470748
envoy/config/route/v3/scoped_route.pb.validate.go 1.579764655
parser/gen/doc.go 1.490590828
src/net/http/transport_default_other.go 1.475054625
utils_set.go 1.436838859
envoy/config/core/v3/udp_socket_config.pb.validate.go 1.436230022
pilot/pkg/serviceregistry/kube/controller/util.go 1.399423632
pilot/pkg/xds/xdsgen.go 1.373964964
pilot/pkg/model/xds_cache.go 1.341942513
ztypes.go 1.336568761
yamlprivateh.go 1.220097377
proto/wrappers_gogo.go 1.193810348
openapiv2/document.go 1.122113029
openapiv3/document.go 1.086110353
pilot/pkg/bootstrap/webhook.go 1.054076742
pilot/pkg/config/kube/gateway/model.go 1.051825111
pilot/pkg/networking/core/v1alpha3/waypoint.go 1.000428311

Overall, I suspect the data looking at individual files is likely a bit skewed, though it seems unlikely write_err.go, a 14 line file, is actually taking 1.7s to compile on its own. Likely, analyzing at the package level is more appropriate. Other tools handle this; see below.

Action Graph

Go provides a very useful, undocumented flag -debug-actiongraph which can emit the "action graph" generated during the build. The action graph is basically a DAG of steps the compiler takes to compile, link, run tests, etc. The -debug-actiongraph flag will emit this DAG as a JSON file, along with timing information.

$ run 'go build -debug-actiongraph=/tmp/actiongraph ./pilot/cmd/pilot-discovery'

The total output is quite large, but for some context a single element looks like this:

{
  "ID": 1500,
  "Mode": "build",
  "Package": "github.com/modern-go/concurrent",
  "Deps": [
    27,
    8,
    266,
    175,
    5,
    96,
    6,
    931,
    45,
    24
  ],
  "Objdir": "/tmp/go-build1707548832/b322/",
  "Priority": 313,
  "NeedBuild": true,
  "ActionID": "rfdTmXlzNWtJZwTp13Ge",
  "BuildID": "rfdTmXlzNWtJZwTp13Ge/OeZ8FbMdRkPAoFt_nshl",
  "TimeReady": "2023-07-05T17:13:41.827369909Z",
  "TimeStart": "2023-07-05T17:13:41.838775857Z",
  "TimeDone": "2023-07-05T17:13:41.930219542Z",
  "Cmd": [
    "/usr/local/go/pkg/tool/linux_amd64/compile -o /tmp/go-build1707548832/b322/_pkg_.a -trimpath \"/tmp/go-build1707548832/b322=>\" -p github.com/modern-go/concurrent -lang=go1.16 -complete -buildid rfdTmXlzNWtJZwTp13Ge/rfdTmXlzNWtJZwTp13Ge -goversion go1.20.5 -c=4 -nolocalimports -importcfg /tmp/go-build1707548832/b322/importcfg -pack /go/pkg/mod/github.com/modern-go/[email protected]/executor.go /go/pkg/mod/github.com/modern-go/[email protected]/go_above_19.go /go/pkg/mod/github.com/modern-go/[email protected]/log.go /go/pkg/mod/github.com/modern-go/[email protected]/unbounded_executor.go"
  ],
  "CmdReal": 87273795,
  "CmdUser": 52534000,
  "CmdSys": 8082000
}

Fortunately, the actiongraph tool can help parse this into a useful data:

$ actiongraph -f /tmp/actiongraph top
  4.143s   1.41%  build k8s.io/api/core/v1
  4.077s   2.80%  link  istio.io/istio/pilot/cmd/pilot-discovery
  1.936s   3.45%  build net
  1.851s   4.08%  build runtime
  1.533s   4.61%  build github.com/google/cel-go/parser/gen
  1.518s   5.12%  build github.com/envoyproxy/go-control-plane/envoy/config/route/v3
  1.492s   5.63%  build github.com/envoyproxy/go-control-plane/envoy/config/core/v3
  1.464s   6.13%  build github.com/antlr/antlr4/runtime/Go/antlr/v4
  1.463s   6.63%  build istio.io/istio/pilot/pkg/xds
  1.463s   7.12%  build istio.io/istio/pilot/pkg/serviceregistry/kube/controller
  1.428s   7.61%  build istio.io/istio/pilot/pkg/model
  1.420s   8.09%  build net/http
$ actiongraph -f /tmp/actiongraph tree -L 2 | head -n 20
289.792s          (root)
101.954s            k8s.io
 64.735s              k8s.io/client-go
 23.827s              k8s.io/api
  6.883s              k8s.io/apimachinery
  2.853s              k8s.io/kube-openapi
  2.424s              k8s.io/apiextensions-apiserver
  0.610s              k8s.io/utils
  0.400s              k8s.io/klog
  0.222s              k8s.io/apiserver
 78.280s            github.com
 42.103s              github.com/envoyproxy
 12.069s              github.com/google
  2.473s              github.com/prometheus
  2.052s              github.com/gogo
  1.562s              github.com/lestrrat-go
  1.464s              github.com/antlr
  1.449s              github.com/klauspost
  1.331s              github.com/miekg
  0.990s              github.com/cncf

We get roughly the same information as above, but much easier to use and query against.

Tracing

Go also has a newer -debug-trace which emits fairly similar info, but in the Trace Event Format. This is a really simple format but gives us a lot of information

Entries in the trace look like:

[{"name":"Executing action (link istio.io/istio/pilot/cmd/pilot-discovery)","ph":"B","ts":1688578195814105.2,"pid":0,"tid":43}
,{"name":"Executing action (link istio.io/istio/pilot/cmd/pilot-discovery)","ph":"E","ts":1688578195827795.8,"pid":0,"tid":43}]

ph: B represents the "Begin" time and, while ph: E reprents the "End" time. tid is the goroutine.

These can be visualized to give an overview of how each goroutine is spending time. Some examples are below:

One thing we can see here is while the 48 core run executes faster, there is a lot of time spent not executing actions. Further investigation reveals this is because there is no remaining work to be done -- all cores are blocked waiting for another core to complete a dependency.

Another interesting thing is in both cases we can see build runtime blocks all other threads for about 2s.

The tracing view is incredibly useful to help understand not just the timing of each action, but the context as well. It's much worse for something to take 10s when it blocks all other threads, rather than running in parallel.

Profiling

Go also provides the ability to profile compilation and linking. For example: go build -ldflags='-cpuprofile /tmp/linker-cpu.prof' -gcflags='-cpuprofile /tmp/compiler-cpu.prof' ..

The resulting cpuprofile is empty, though, so I suspect I did something wrong. However, the linker is what I am more interested in, anyways.

Linking an example program results this profile. As expected, nothing here stands out as low hanging fruit - the Go compiler is pretty well optimized.

Goda

goda, the "Go Dependency Analysis toolkit", is one of my favorite tools for understanding Go builds.

While it isn't directly analyzing build times, it is invaluable for understanding dependencies in a Go program. As we saw above, dependencies are a huge contribution to build times, so the two are tightly related.

While goda has a large number of functions, my favorite is tree and reach. For example, to see why a package depends on another package:

$ $ goda tree 'reach(./pilot/cmd/pilot-discovery:all, k8s.io/api/authorization/v1beta1)' | head
  ├ istio.io/istio/pilot/cmd/pilot-discovery
    └ istio.io/istio/pilot/cmd/pilot-discovery/app
      └ istio.io/istio/pilot/pkg/bootstrap
        ├ istio.io/istio/pilot/pkg/config/kube/crdclient
          ├ istio.io/istio/pkg/kube
            ├ istio.io/client-go/pkg/clientset/versioned
              └ k8s.io/client-go/discovery
                └ k8s.io/client-go/kubernetes/scheme
                  └ k8s.io/api/authorization/v1beta1

Real world analysis

Using the tools above, I set out to understand why our builds were slow.

Slow test execution

In analyzing real-world compilation traces during our CI, I discovered something quite odd during our unit test job. This more or less runs go test ./....

Below shows a snippet of the trace:

Trace execution of unit tests

The full trace is also available, but it is 150MB.

We can see the test run istio.io/istio/pkg/test/framework/resource/config/cleanup runs almost the entire runtime of the test, over 50s. Digging deeper reveals something even more interesting -- this package has no tests.

The root cause of this can be found by investigating what the test run action actually does. In general, the Go compilers action graph ensures actions are not run until all dependencies are complete. However, in order to ensure test start logs were emitted in the correct order, some serialization of test starts was added outside of this dependency mechanism. This can be found here.

The end result is that a test run even one that would be very quick (or even has no tests at all) maybe start, then immediately block for other tasks to complete. This means we end up hogging a worker (generally, there is one worker per core) waiting on another task to complete.

In the extreme case, the very last alphabetical package may start first, blocking an entire worker for the entire test duration.

Based on this analysis, I have opened a Go issue and possible fix.

Integration Tests

In Istio, integration tests are separated from unit tests with a build tag integ. Additionally, tests are expected to be run with -p 1, which tells Go to only execute one package at a time.

This ensures shared resources (Kubernetes clusters, generally) are not accessed concurrently.

The integration tests, in particular, always felt a bit slow in Istio. Looking at the build tracing makes the reason obvious: -p 1 did more than we wanted!

What we wanted was to execute tests sequentially -- but this also compiles them sequentially.

This takes 8m25s.

I didn't find any great ways to fix this out of the box, but we can use some of our findings from earlier to improve things.

$ run '
time go test -exec=true -toolexec=$PWD/go-compile-without-link -vet=off -tags=integ ./tests/integration/...
time go test -exec=true -vet=off -tags=integ -p 1 ./tests/integration/...
'

Here, we precompile everything (without linking) first, to warm the build cache, then execute the tests as usual. The first step takes just 40s, while the execution takes 3m40s -- overall 2x faster than the naive approach.

However, we could still do a lot better. Most of the time is spent on linking the test binaries, which is still done sequentially. We cannot just build and link everything before we execute, either, as we saw earlier -- linking is only cached if the binary is saved.

This is somewhat solved in Go 1.21 which allows -c (which saves the test binary) when testing multiple packages. This can be awkward to use, though, as any package names will conflict.

We had quite a few conflicts, which led to this hacky approach:

$ run-21 '
grep -r TestMain tests/integration -l |
  xargs -n1 dirname |
  sort |
  uniq |
  grep -v tests/integration/telemetry/stackdriver/api |
  grep -v tests/integration/security/fuzz |
  grep -v tests/integration$ |
  xargs -I{} echo './{}' |
  xargs go test -tags=integ -c -o testsout/ -vet=off 
grep -r TestMain tests/integration -l |
  xargs -n1 dirname |
  sort |
  uniq |
  grep -v tests/integration/telemetry/stackdriver/api |
  grep -v tests/integration/security/fuzz |
  grep -v tests/integration$ |
  xargs -I{} echo './{}' |
  xargs go test -p 1 -tags=integ -o testsout/ -vet=off  -exec=true
'

The results here are superb - the first run takes just 50s, and the latter is almost instant.

Kubernetes

From Analyzing Build Times, we saw the root of most of our problems is Kubernetes libraries. Just from k8s.io/client-go and a few of its dependencies (k8s.io/api and k8s.io/apimachinery), we are up to 100s compile time!

This is the standard for any project interacting with Kubernetes, which is a pretty huge segment of Go usage. Excluding forks, it is used by over 10k projects on GitHub!

It's not just compile time, either.

Simply importing this library causes surprising results to binary size as well:

$ go mod init test
$ cat <<EOF>main.go
package main

import _ "k8s.io/client-go/kubernetes"

func main() {}
EOF
$ go mod tidy
$ go build
$ du -h main
40M     main

Yikes! 40mb for our most empty binary.

Kubernetes seems almost uniquely bad in this regard; On the surface, github.com/envoyproxy/go-control-plane looks pretty bad as well - at 50s its about half the cost of Kubernetes. But Istio is explicitly importing every package in the module, which is pretty atypical. Removing this file drops the compile time for go-control-plane to 15s (there is still a decent amount of the library imported through core usage).

The same simple binary above importing github.com/envoyproxy/go-control-plane/envoy/type only comes out to 6mb. Go is also smart enough to completely elide the dependency in some cases; for example grpc-go depends on go-control-plane but doesn't show up as a dependency unless the xds package is imported (uncommon for grpc-go users).

The reason Kubernetes' client fares so much worse is that it cannot be consumed in parts. I would imagine most applications do not use every version of every Kubernetes API, but the first thing a client-go user does it import a package that depends on every version of every API. This, in turn, forces the Go compiler to compile all Kubernetes APIs for any user of the library.

These APIs don't just contain (fairly massive) Go structs, either, but generated protobuf, deepcopy, and swagger docs.

All of this adds up to a perfect storm of slow compile times and large binary sizes.

Test execution ordering

While building order is pretty intuitive, tests are a bit more involved. Run go test ./... is compiling a bunch of code, linking a test binary for each package, running go vet on it, executing the test, and printing the result.

In a real world project this can be a bit overwhelming, so we will use a dummy project to analyze:

for i in a b c d; do
	mkdir -p $i
	cat <<EOF > "$i/${i}_test.go"
package $i

import (
	"testing"
	"time"
)

func TestSleep(t *testing.T) {
	if "$i" == "b" {
		time.Sleep(time.Second)
	}
}
EOF
done
GOMAXPROCS=2 go test -debug-trace=trace ./...

Results in this trace.

Here we can see a few interesting things:

  • Test execution will start in order (alphabetical).
  • A single package will be fully executed before moving onto other work (that is, we build and execute package a first, rather than building a-d before we start executing).
  • Multiple processes can make the above non-blocking, however. In our trace, we have one package that has a slow test. While that is executing, the rest of the packages full

Building and Testing

Istio's unit-tests job -- arguably one of the most important jobs to be fast for rapid iteration -- has for years ran, effectively:

$ go build ./...
$ go test -race ./...

This was done because want want to check the size of the release binaries to avoid regressions. At the time, it was assumed this was most "free".

However, from the Race detection analysis, we know this is false!

Unless the cache is already warm, this would add roughly 45s to our jobs, which run with 16 CPUs. Not the end of the world, but sometimes every second counts.

This is easily fixed by moving the check to a job which already builds the binaries.

Comparing to Rust

In contrast to Go, Rust is often criticized for "slow build times". Is it really worse than Go?

Answer this question is pretty hard, as they are very far from apples-to-apples.

That being said, we can get some very rough data by looking at some other projects. Note that unlike Go, Rust has different development profiles. Typically development builds are used when testing locally, while a release build is used for... releases. Both times are important.

linked2-proxy:

$ docker run --rm -v $PWD:$PWD -w $PWD -it --init --entrypoint bash ghcr.io/linkerd/dev:v40-rust -c '
cargo fetch
time cargo build --package=linkerd2-proxy
cargo clean
time cargo build --package=linkerd2-proxy --profile release
'

Finished dev [unoptimized] target(s) in 1m 43s
Finished release [optimized] target(s) in 5m 18s

ztunnel:

$ docker run --rm -v $PWD:$PWD -w $PWD -it --init --entrypoint bash gcr.io/istio-testing/build-tools:master-4c71169512fe79b59b89664ef1b273da813d1c93 -c '
cargo fetch
time cargo build --package=linkerd2-proxy
cargo clean
time cargo build --package=linkerd2-proxy --profile release
'
Finished dev [unoptimized + debuginfo] target(s) in 40.71s
Finished release [optimized] target(s) in 2m 19s

So at a very high level, the claim holds up to some extent, especially when comparing release build times. At least comparing these projects, Rust is a bit slower - but not as slow as commonly perceived, in my opinion.