Go is often praised for its fast build times. While they are pretty quick, they are slow enough that I spend a lot of time waiting for them, enough that it prompted me to go down the rabbit hole of thoroughly analyzing them. This post covers all aspects of what makes Go builds fast or slow.
Throughout this blog, we will use Istio as an example of real-world codebase.
For reference on its size:
$ tokei -t=Go
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Language Files Lines Code Comments Blanks
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Go 1814 453735 358883 54151 40701
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Total 1814 453735 358883 54151 40701
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
About 350k lines of code.
All tests will run on an n2d-standard-48
GCE VM (vCPU: 48, RAM: 192 GB, Disk: 2TB PD_SSD).
All tests will run in docker to avoid accidental cache sharing, and to easily test reduced CPU counts.
Key Takeaways
The post is a big long, here is the tl;dr:
- Go's build caching is highly effective and didn't give any surprises. Just make sure your CI/CD is actually using it!
- Go provides a number of (mostly hidden) tools to thoroughly analyze build times.
- If you are building in containers (or otherwise limiting CPU usage), make sure to set
GOMAXPROCS
. - Be wary of using
-p 1
to limit parallel test execution. - Consider using
-vet=off
on test executions, if you are already runningvet
in other contexts. - If you are just checking if code compiles, consider skipping the linking step. This is especially useful when checking if tests compile.
- Use a single
go build
command if you need to build multiple binaries.
Go modules
Before starting a build, we need to pull down all our dependencies:
$ docker run --rm -v $PWD:$PWD -w $PWD -it --init golang:1.20 bash -c '
time go mod download
du -sh `go env GOMODCACHE`
'
real 0m5.585s
user 0m17.749s
sys 0m9.369s
1.1G /go/pkg/mod
This is fast. On my home network, it was 50s instead.
This is basically a network speed test, though, which isn't very interesting. It also is the simplest aspect of Go builds to cache, so from now on we will ignore it and always use caching. We will also make a simple helper:
function run() {
docker run --rm -v $PWD:$PWD -w $PWD -e 'GOFLAGS=-buildvcs=false' -v `go env GOMODCACHE`:/go/pkg/mod -it --init golang:1.20 bash -c "$*"
}
Building
Now lets get to the actual building.
$ run time go build ./pilot/cmd/pilot-discovery
real 0m33.102s
user 6m31.190s
sys 1m45.876s
So we are looking at just over 30s for a standard build. But remember, this is on a massive machine.
Scaling CPU
Most builds are not done on 48 cores; GitHub actions default runners only have 2 cores, for example. Lets look at how build times scale with CPUs.
We will need another helper to test limited cores. I was also interested if setting GOMAXPROCS would be helpful here -- this defaults to the number of processors the machine has, but is not aware of cgroup limits, such as the ones docker is imposing. We will create a helper for both and compare.
function run-with-cores() {
cores="$1"
shift
docker run --rm -v $PWD:$PWD -w $PWD \
--cpus="$cores" \
-e 'GOFLAGS=-buildvcs=false' \
-v `go env GOMODCACHE`:/go/pkg/mod \
-it --init golang:1.20 bash -c "$*"
}
function run-with-cores-gomaxprocs() {
cores="$1"
maxprocs="$2"
shift; shift;
docker run --rm -v $PWD:$PWD -w $PWD \
--cpus="$cores" -e GOMAXPROCS="$maxprocs" \
-e 'GOFLAGS=-buildvcs=false' \
-v `go env GOMODCACHE`:/go/pkg/mod \
-it --init golang:1.20 bash -c "$*"
}
Comparing the two
$ run-with-cores 4 'time go build ./pilot/cmd/pilot-discovery'
real 2m0.627s
user 6m21.382s
sys 1m25.052s
$ run-with-cores-gomaxprocs 4 4 'time go build ./pilot/cmd/pilot-discovery'
real 1m26.253s
user 4m34.381s
sys 0m59.795s
A huge improvement setting GOMAXPROCS.
I had assumed cores == GOMAXPROCS
is ideal, but we ought to verify this:
$ run-with-cores-gomaxprocs 4 1 'time go build ./pilot/cmd/pilot-discovery'
real 4m3.413s
$ run-with-cores-gomaxprocs 4 2 'time go build ./pilot/cmd/pilot-discovery'
real 1m41.406s
$ run-with-cores-gomaxprocs 4 4 'time go build ./pilot/cmd/pilot-discovery'
real 1m24.643s
$ run-with-cores-gomaxprocs 4 8 'time go build ./pilot/cmd/pilot-discovery'
real 1m38.170s
$ run-with-cores-gomaxprocs 4 16 'time go build ./pilot/cmd/pilot-discovery'
real 1m53.609s
So lets make our function just always do this:
function run-with-cores() {
cores="$1"
shift
docker run --rm -v $PWD:$PWD -w $PWD \
--cpus="$cores" -e GOMAXPROCS="$cores" \
-e 'GOFLAGS=-buildvcs=false' \
-v `go env GOMODCACHE`:/go/pkg/mod \
-it --init golang:1.20 bash -c "$*"
}
Finally, we can look at how build times scale with CPU:
run-with-cores 1 'time go build -o /tmp/build ./pilot/cmd/pilot-discovery'
real 7m12.354s
run-with-cores 2 'time go build -o /tmp/build ./pilot/cmd/pilot-discovery'
real 3m50.390s
run-with-cores 4 'time go build -o /tmp/build ./pilot/cmd/pilot-discovery'
real 2m4.813s
run-with-cores 8 'time go build -o /tmp/build ./pilot/cmd/pilot-discovery'
real 1m10.144s
run-with-cores 16 'time go build -o /tmp/build ./pilot/cmd/pilot-discovery'
real 0m44.286s
run-with-cores 32 'time go build -o /tmp/build ./pilot/cmd/pilot-discovery'
real 0m33.755s
Charting this against a theoretically perfect scaling (that is, using 100 cores is 100x faster than 1 core):
We can see it scales pretty well as the number of cores increases, but there is diminishing returns at a certain point. We will investigate this a bit more in the Tracing section.
Really Really Big machines
I got curious and ran a build on a c3-standard-176
(yes, that is 176 cores!).
$ time go build ./pilot/cmd/pilot-discovery
real 0m25.188s
user 6m28.325s
sys 1m26.272s
A bit disappointing - barely any speed up at all compared to 48 cores.
Caching
Fortunately, Go comes with pretty solid build caching out of the box:
$ run '
time go build -o /tmp/build ./pilot/cmd/pilot-discovery
time go build -o /tmp/build ./pilot/cmd/pilot-discovery
time go build -o /tmp/build2 ./pilot/cmd/pilot-discovery
'
real 0m32.577s
real 0m0.810s
real 0m4.918s
We can see subsequent builds are nearly instant - especially if we write to the same file. If we do not write to them same file, we are forced to link the binary, which seems to take about 4s.
With updates
The above test isn't terribly realistic -- normally we change code in between builds:
$ run '
time go build -o /tmp/build ./pilot/cmd/pilot-discovery
echo 'var _ = 1' >> pilot/cmd/pilot-discovery/main.go
time go build -o /tmp/build ./pilot/cmd/pilot-discovery
echo 'var _ = 2' >> pilot/cmd/pilot-discovery/main.go
time go build -o /tmp/build2 ./pilot/cmd/pilot-discovery
'
real 0m32.601s
real 0m5.017s
real 0m4.995s
Here we can see the benefit of outputting to the same file is entirely lost, showing that is likely an optimization for completely unchanged builds. Beyond that, though, the cost is pretty minimal.
In the above test, however, we are modifying a file at the top of the dependency chain.
Things look a bit different if we change a deep dependency.
In our app, the log
package is used almost everywhere; let's try changing that:
$ run '
time go build -o /tmp/build ./pilot/cmd/pilot-discovery
sed -i 's/"message"/"new"/g' pkg/log/config.go
time go build -o /tmp/build ./pilot/cmd/pilot-discovery
'
real 0m31.760s
real 0m15.268s
Here we can see quite a bit more time is spent. Likely nearly every package in our repo is rebuilt, while dependencies are cached.
Cache Persistence
What if we are switching between branches constantly? Or, in the case of CI, caching across branches?
run '
git config --global --add safe.directory /usr/local/google/home/howardjohn/go/src/istio.io/istio
time go build -o /tmp/build ./pilot/cmd/pilot-discovery
git checkout release-1.18
time go build -o /tmp/build ./pilot/cmd/pilot-discovery
git checkout master
time go build -o /tmp/build ./pilot/cmd/pilot-discovery
'
real 0m31.690s
Switched to branch 'release-1.18'
real 0m37.476s
Switched to branch 'master'
real 0m5.006s
Just as fast. At least in this simple cases, switching between branches doesn't hurt caching.
Build tags
What if we change build tags? Below I test a tag that is not used anywhere:
run '
time go build -o /tmp/build ./pilot/cmd/pilot-discovery
time go build -o /tmp/build -tags=nothing ./pilot/cmd/pilot-discovery
'
real 0m31.719s
real 0m4.956s
No impact on caching, as expected
Concurrent Builds
When caching in CI, we may be building the same things concurrently. This is common when a commit schedules multiple jobs at the same time, and they both build similar things.
How well is it cached?
We will limit to 4 cores to ensure benefits are not masked by our high concurrency.
First we will run the build on its own to get a basseline:
$ run-with-cores 4 'time go build -o /dev/null ./pilot/cmd/pilot-discovery'
real 1m29.778s
$ run-with-cores 4 '
time go build -o /dev/null ./pilot/cmd/pilot-discovery &
time go build -o /dev/null ./pilot/cmd/pilot-discovery &
time wait
'
real 2m58.492s
real 2m58.650s
Interestingly, this is far slower than I expected. This test is wrong since we are running both builds under the same 4 core CPU limit.
Instead we need to run them separately, but with a shared cache:
$ function run-with-cores-cached() {
cores="$1"
shift
docker run --rm -v $PWD:$PWD -w $PWD \
--cpus="$cores" -e GOMAXPROCS="$cores" \
-e 'GOFLAGS=-buildvcs=false' \
-v `go env GOMODCACHE`:/go/pkg/mod \
-v /tmp/gocache:/root/.cache/go-build \
--init golang:1.20 bash -c "$*"
}
$ run-with-cores 4 'time go build -o /dev/null ./pilot/cmd/pilot-discovery' &
$ run-with-cores 4 'time go build -o /dev/null ./pilot/cmd/pilot-discovery' &
$ wait
real 1m34.677s
real 1m36.572s
$ run-with-cores 4 'time go build -o /dev/null ./pilot/cmd/pilot-discovery'
real 0m5.163s
And we can see there is basically no caching going on with concurrent runs. If we do another build afterward, however, we can see the cache is working - just not concurrently.
This is understandable: theoretically if each execution is taking the same steps in the same amount of time, no actions will be cached. I am surprised at how little improvement there is, though.
Staggered builds
What if we stagger the builds a bit?
$ run-with-cores 4 'time go build -o /dev/null ./pilot/cmd/pilot-discovery' &
$ sleep 20
$ run-with-cores 4 'time go build -o /dev/null ./pilot/cmd/pilot-discovery' &
$ wait
real 1m11.095s
real 1m31.409s
$ run-with-cores 4 'time go build -o /dev/null ./pilot/cmd/pilot-discovery' &
$ sleep 60
$ run-with-cores 4 'time go build -o /dev/null ./pilot/cmd/pilot-discovery' &
$ wait
real 1m31.126s
real 0m30.614s
We can see a trend: however much we wait, the build is that much faster. This makes sense: we can either do the compilation work in N seconds, or wait for the other process to do it (also in N seconds) and read the cached result ~instantly.
Mismatched build sizes
Above, we had the exact same CPU allocation. What if we have a mismatch?
$ run-with-cores 40 'time go build -o /dev/null ./pilot/cmd/pilot-discovery' &
$ run-with-cores 4 'time go build -o /dev/null ./pilot/cmd/pilot-discovery' &
$ wait
real 0m32.319s
real 0m32.959s
Again we see the exact same build times, but this time it's the duration of the fastest build. So our slower builder can benefit from the faster builder.
Interestingly, this means there may be benefits to throttling some jobs. Running the same test but giving each job access to all cores results in a slower overall runtime:
$ run-with-cores 48 'time go build -o /dev/null ./pilot/cmd/pilot-discovery' &
$ run-with-cores 48 'time go build -o /dev/null ./pilot/cmd/pilot-discovery' &
$ wait
real 0m37.321s
real 0m37.571s
The difference, though, isn't that high. This may be because with 48 cores, Go's compiler isn't maxing out CPU. We would likely see different results if our host machine was smaller.
We can take this to the extreme and run a bunch of jobs at once:
$ for i in {0..15}; do
run-with-cores-cached 48 'time go build -o /dev/null ./pilot/cmd/pilot-discovery' &
done
$ wait
...
real 2m11.136s # (repeated 16x)
...
Substantially slower. On the plus side, this was one of the first times I maxed out my machines CPU usage.
If we just have one fast job and the rest smaller, we see much better results:
$ run-with-cores-cached 36 'time go build -o /dev/null ./pilot/cmd/pilot-discovery' &
$ for i in {0..14}; do
run-with-cores-cached 1 'time go build -o /dev/null ./pilot/cmd/pilot-discovery' &
done
$ wait
...
real 0m39.383s # (repeated 16x)
...
Build concurrency
(Thanks to @josharian for the suggestion!)
The Go compiler has a variety of flags we can tune.
Among these is -c
, which controls the "concurrency during compilation". This defaults to the max of GOMAXPROCS
and 4
.
Note this differs from -p
, which controls how many build operations occur concurrently (discussed more in integration tests); this flag controls the concurrency of a single build operation.
Since we are using Go 1.20 which doesn't yet include a bugfix for the flag, we also set GO19CONCURRENTCOMPILATION=0
to workaround. This will disable Go from setting its own -c
flag, so that it will respect our manually-set flag (otherwise, it would always append its own default after ours).
Lets try it out:
$ run time env GO19CONCURRENTCOMPILATION=0 go build '-gcflags=all="-c=1"' ./pilot/cmd/pilot-discovery
run time env GO19CONCURRENTCOMPILATION=0 go build '-gcflags=all="-c=4"' ./pilot/cmd/pilot-discovery
run time env GO19CONCURRENTCOMPILATION=0 go build '-gcflags=all="-c=48"' ./pilot/cmd/pilot-discovery
run time env GO19CONCURRENTCOMPILATION=0 go build '-gcflags=all="-c=256"' ./pilot/cmd/pilot-discovery
real 0m40.473s
real 0m32.385s
real 0m29.666s
real 0m31.711s
We can see, like setting GOMAXPROCS
, we get optimal results setting the value to our actual processor count. The difference, though, is quite small compared to controlling GOMAXPROCS
or -p
.
Building multiple binaries
So far we have looked at building just one binary. What if we need to build multiple?
Lets try a few different approaches to building multiple binaries:
- Building in sequence
$ BINARIES="./istioctl/cmd/istioctl ./pilot/cmd/pilot-discovery ./pkg/test/echo/cmd/client ./pkg/test/echo/cmd/server ./samples/extauthz/cmd/extauthz ./operator/cmd/operator ./tools/bug-report"
$ run "
time {
for bin in $BINARIES; do
go build -o /tmp/ \$bin
done
};
"
real 1m25.276s
- Building in parallel
$ run "
for bin in $BINARIES; do
go build -o /tmp/ \$bin &
done
time wait
"
real 1m4.005s
- Building in a single
go build
$ run "time go build -o /tmp/ $BINARIES"
real 0m44.410s
Building in a single go build
command is substantially faster!
Based on our other analysis, this makes sense.
Building sequentially can access to the cache of previous builds, but any new dependencies will need to be built, and linking cannot be done in parallel.
Building in parallel fares a bit better, but as we saw in Concurrent Builds, caching is limited during concurrent executions.
Using a single command allows Go to maximize concurrency.
Tests
What about compiling tests? In Go, each package generates an individual test binary.
$ go list ./... | wc -l
497
That means almost 500 binaries are generated when we run go test ./...
!
As a result, compile times can be especially important when running tests.
We don't have an apples-to-apples test to compile compared to the binary, but the one used below is close enough in terms of dependencies.
$ run '
time go test -c -o /tmp/build ./pilot/pkg/xds
time go test -c -o /tmp/build ./pilot/pkg/xds
'
real 0m31.007s
real 0m1.760s
This is about the time spent as same as building
Race
The Race Detector is extremely useful for tests; we always run tests with it enabled. However, it comes at a cost:
$ run 'time go test -c -o /tmp/build -race ./pilot/pkg/xds'
real 0m44.167s
Nearly 50% longer build times.
Vet
Go automatically runs some linters on test compilation. These can be turned off:
$ run 'time go test -c -o /tmp/build -vet=off ./pilot/pkg/xds'
real 0m28.965s
user 5m8.443s
sys 1m18.009s
$ run 'time go test -c -o /tmp/build ./pilot/pkg/xds'
real 0m30.658s
user 6m49.299s
sys 1m43.606s
Here we see a small improvement in real
time, but a pretty large improvement to user
time; I suspect the 48 cores is masking some of the costs of vet.
Caching tests
Does building a binary warm up the cache for building a test?
$ run '
time go build -o /tmp/build ./pilot/cmd/pilot-discovery
time go test -c -o /tmp/build ./pilot/pkg/xds
'
real 0m31.690s
real 0m8.980s
Yep, it does!
What about if we use -race
?
$ run '
time go build -o /tmp/build ./pilot/cmd/pilot-discovery
time go test -c -o /tmp/build -race ./pilot/pkg/xds
'
real 0m31.551s
real 0m43.568s
Nope, we lose the benefits here.
Linking vs compiling
To measure the time spent linking vs compiling, we can build a custom tool to compile without linking. This is really helpful to test compiling all code/tests, so we will measure that below.
This can be done in Go 1.21, but it does linking and doesn't work if you have any packages with the same name. Instead, we can use a workaround to just not execute them.
run '
time go test -exec=true -vet=off ./...
time go test -exec=true -vet=off ./...
'
real 2m33.337s
real 0m51.978s
'
Even repeated runs are pretty slow. Given we know all compilation is cached, we can infer that linking all the test binaries takes roughly 50s.
Now without linking at all:
# Setup our tool
cat <<'EOF' > ./go-compile-without-link
#!/bin/bash
if [[ "${2}" == "-V=full" ]]; then
"$@"
exit 0
fi
case "$(basename ${1})" in
link)
# Output a dummy file
touch "${3}"
;;
*)
"$@"
esac
EOF
chmod +x go-compile-without-link
# Now build without linking
run '
time go test -exec=true -toolexec=$PWD/go-compile-without-link -vet=off ./...
time go test -exec=true -toolexec=$PWD/go-compile-without-link -vet=off ./...
real 1m18.623s
real 0m3.055s
'
Not only is this 2x faster overall, with a warm cache its nearly 20x faster. This is a huge win for development iteration; I regularly run the above command before committing as a lightweight check.
Warning: this may miss some link-time errors. In my experience, these are rare enough to make this still extremely useful for development cases. If you use CGO this may not be true.
Alternative linkers
Given linking is such a large contributor to slow build times, what if we used a faster linker? mold is a commonly used linker in other languages offering impressive improvements over other linkers. While not common, it can be used with Go as well.
Since the linking is on the order of seconds, we will use hyperfine
to more accurately measure the timings.
$ hyperfine 'rm test; time go build -ldflags "-linkmode external -extld clang -extldflags -fuse-ld=mold"' 'rm test; time go build'
Benchmark 1: rm test; time go build -ldflags "-linkmode external -extld clang -extldflags -fuse-ld=mold"
Time (mean ± σ): 1.880 s ± 0.038 s [User: 2.876 s, System: 2.142 s]
Range (min … max): 1.815 s … 1.938 s 10 runs
Benchmark 2: rm test; time go build
Time (mean ± σ): 1.582 s ± 0.022 s [User: 2.693 s, System: 1.043 s]
Range (min … max): 1.549 s … 1.615 s 10 runs
Summary
'rm test; time go build' ran
1.19 ± 0.03 times faster than 'rm test; time go build -ldflags "-linkmode external -extld clang -extldflags -fuse-ld=mold"'
Looks like in this case, linking with mold is slightly slower.
Analyzing what contributes to build times
There are a few tools that can help analyze build times.
Custom toolexec
wrapper.
Note: this option is not the best one, but I wrote this section before I found the other (better) tools.
Go can emit verbose logs of builds with -x
, but its really verbose.
Fortunately, we can wrap tool invocations with a script that logs what is taking time.
This can be invoked using -toolexec
during builds.
Here is the results of building the same ./pilot/cmd/pilot-discovery
binary.
Note: since we are using multiple cores, the sum adds up to longer than the actual runtime.
This roughly corresponds to the user
time from the previous time
commands.
Action | SUM of Duration |
---|---|
link | 3.789593331 |
gcc | 0.022386331 |
compile | 305.6692171 |
cgo | 1.526077562 |
asm | 2.99016495 |
Grand Total | 313.9974393 |
Overall we can see the vast majority of the time is spent on compile
, so we will focus there.
link
is also interesting, though -- despite being only 1% of the time, it is the least cache-able and parallelize-able.
Looking at each module is more interesting:
Location | SUM of Duration |
---|---|
k8s.io/client-go | 67.86363857 |
github.com/envoyproxy/go-control-plane | 48.72126327 |
local | 44.15472995 |
std | 25.53394494 |
k8s.io/api | 23.03486699 |
k8s.io/apimachinery | 7.960915542 |
google.golang.org/grpc | 6.152948778 |
google.golang.org/protobuf | 5.491289355 |
istio.io/client-go | 4.632729928 |
istio.io/api | 4.437340682 |
github.com/google/cel-go | 4.240151317 |
k8s.io/kube-openapi | 3.00321029 |
github.com/google/gnostic | 2.668591864 |
golang.org/x/net | 2.611716846 |
k8s.io/apiextensions-apiserver | 2.411767233 |
github.com/google/s2a-go | 2.291722416 |
github.com/gogo/protobuf | 2.07997357 |
github.com/google/go-containerregistry | 2.050610622 |
There is quite a few of these, so only included are those that took >2s.
local
means the local module code (Istio), and std
means Go's own standard library.
What about individual files? Again, only those that are >1s:
File | SUM of Duration |
---|---|
core/v1/zz_generated.deepcopy.go | 3.901002923 |
src/runtime/write_err.go | 1.755470748 |
envoy/config/route/v3/scoped_route.pb.validate.go | 1.579764655 |
parser/gen/doc.go | 1.490590828 |
src/net/http/transport_default_other.go | 1.475054625 |
utils_set.go | 1.436838859 |
envoy/config/core/v3/udp_socket_config.pb.validate.go | 1.436230022 |
pilot/pkg/serviceregistry/kube/controller/util.go | 1.399423632 |
pilot/pkg/xds/xdsgen.go | 1.373964964 |
pilot/pkg/model/xds_cache.go | 1.341942513 |
ztypes.go | 1.336568761 |
yamlprivateh.go | 1.220097377 |
proto/wrappers_gogo.go | 1.193810348 |
openapiv2/document.go | 1.122113029 |
openapiv3/document.go | 1.086110353 |
pilot/pkg/bootstrap/webhook.go | 1.054076742 |
pilot/pkg/config/kube/gateway/model.go | 1.051825111 |
pilot/pkg/networking/core/v1alpha3/waypoint.go | 1.000428311 |
Overall, I suspect the data looking at individual files is likely a bit skewed, though it seems unlikely write_err.go, a 14 line file, is actually taking 1.7s to compile on its own. Likely, analyzing at the package level is more appropriate. Other tools handle this; see below.
Action Graph
Go provides a very useful, undocumented flag -debug-actiongraph
which can emit the "action graph" generated during the build.
The action graph is basically a DAG of steps the compiler takes to compile, link, run tests, etc.
The -debug-actiongraph
flag will emit this DAG as a JSON file, along with timing information.
$ run 'go build -debug-actiongraph=/tmp/actiongraph ./pilot/cmd/pilot-discovery'
The total output is quite large, but for some context a single element looks like this:
{
"ID": 1500,
"Mode": "build",
"Package": "github.com/modern-go/concurrent",
"Deps": [
27,
8,
266,
175,
5,
96,
6,
931,
45,
24
],
"Objdir": "/tmp/go-build1707548832/b322/",
"Priority": 313,
"NeedBuild": true,
"ActionID": "rfdTmXlzNWtJZwTp13Ge",
"BuildID": "rfdTmXlzNWtJZwTp13Ge/OeZ8FbMdRkPAoFt_nshl",
"TimeReady": "2023-07-05T17:13:41.827369909Z",
"TimeStart": "2023-07-05T17:13:41.838775857Z",
"TimeDone": "2023-07-05T17:13:41.930219542Z",
"Cmd": [
"/usr/local/go/pkg/tool/linux_amd64/compile -o /tmp/go-build1707548832/b322/_pkg_.a -trimpath \"/tmp/go-build1707548832/b322=>\" -p github.com/modern-go/concurrent -lang=go1.16 -complete -buildid rfdTmXlzNWtJZwTp13Ge/rfdTmXlzNWtJZwTp13Ge -goversion go1.20.5 -c=4 -nolocalimports -importcfg /tmp/go-build1707548832/b322/importcfg -pack /go/pkg/mod/github.com/modern-go/[email protected]/executor.go /go/pkg/mod/github.com/modern-go/[email protected]/go_above_19.go /go/pkg/mod/github.com/modern-go/[email protected]/log.go /go/pkg/mod/github.com/modern-go/[email protected]/unbounded_executor.go"
],
"CmdReal": 87273795,
"CmdUser": 52534000,
"CmdSys": 8082000
}
Fortunately, the actiongraph
tool can help parse this into a useful data:
$ actiongraph -f /tmp/actiongraph top
4.143s 1.41% build k8s.io/api/core/v1
4.077s 2.80% link istio.io/istio/pilot/cmd/pilot-discovery
1.936s 3.45% build net
1.851s 4.08% build runtime
1.533s 4.61% build github.com/google/cel-go/parser/gen
1.518s 5.12% build github.com/envoyproxy/go-control-plane/envoy/config/route/v3
1.492s 5.63% build github.com/envoyproxy/go-control-plane/envoy/config/core/v3
1.464s 6.13% build github.com/antlr/antlr4/runtime/Go/antlr/v4
1.463s 6.63% build istio.io/istio/pilot/pkg/xds
1.463s 7.12% build istio.io/istio/pilot/pkg/serviceregistry/kube/controller
1.428s 7.61% build istio.io/istio/pilot/pkg/model
1.420s 8.09% build net/http
$ actiongraph -f /tmp/actiongraph tree -L 2 | head -n 20
289.792s (root)
101.954s k8s.io
64.735s k8s.io/client-go
23.827s k8s.io/api
6.883s k8s.io/apimachinery
2.853s k8s.io/kube-openapi
2.424s k8s.io/apiextensions-apiserver
0.610s k8s.io/utils
0.400s k8s.io/klog
0.222s k8s.io/apiserver
78.280s github.com
42.103s github.com/envoyproxy
12.069s github.com/google
2.473s github.com/prometheus
2.052s github.com/gogo
1.562s github.com/lestrrat-go
1.464s github.com/antlr
1.449s github.com/klauspost
1.331s github.com/miekg
0.990s github.com/cncf
We get roughly the same information as above, but much easier to use and query against.
Tracing
Go also has a newer -debug-trace
which emits fairly similar info, but in the Trace Event Format.
This is a really simple format but gives us a lot of information
Entries in the trace look like:
[{"name":"Executing action (link istio.io/istio/pilot/cmd/pilot-discovery)","ph":"B","ts":1688578195814105.2,"pid":0,"tid":43}
,{"name":"Executing action (link istio.io/istio/pilot/cmd/pilot-discovery)","ph":"E","ts":1688578195827795.8,"pid":0,"tid":43}]
ph: B
represents the "Begin" time and, while ph: E
reprents the "End" time. tid
is the goroutine.
These can be visualized to give an overview of how each goroutine is spending time. Some examples are below:
One thing we can see here is while the 48 core run executes faster, there is a lot of time spent not executing actions. Further investigation reveals this is because there is no remaining work to be done -- all cores are blocked waiting for another core to complete a dependency.
Another interesting thing is in both cases we can see build runtime
blocks all other threads for about 2s.
The tracing view is incredibly useful to help understand not just the timing of each action, but the context as well. It's much worse for something to take 10s when it blocks all other threads, rather than running in parallel.
Profiling
Go also provides the ability to profile compilation and linking.
For example: go build -ldflags='-cpuprofile /tmp/linker-cpu.prof' -gcflags='-cpuprofile /tmp/compiler-cpu.prof' .
.
The resulting cpuprofile
is empty, though, so I suspect I did something wrong. However, the linker is what I am more interested in, anyways.
Linking an example program results this profile. As expected, nothing here stands out as low hanging fruit - the Go compiler is pretty well optimized.
Goda
goda
, the "Go Dependency Analysis toolkit", is one of my favorite tools for understanding Go builds.
While it isn't directly analyzing build times, it is invaluable for understanding dependencies in a Go program. As we saw above, dependencies are a huge contribution to build times, so the two are tightly related.
While goda
has a large number of functions, my favorite is tree
and reach
.
For example, to see why a package depends on another package:
$ $ goda tree 'reach(./pilot/cmd/pilot-discovery:all, k8s.io/api/authorization/v1beta1)' | head
├ istio.io/istio/pilot/cmd/pilot-discovery
└ istio.io/istio/pilot/cmd/pilot-discovery/app
└ istio.io/istio/pilot/pkg/bootstrap
├ istio.io/istio/pilot/pkg/config/kube/crdclient
├ istio.io/istio/pkg/kube
├ istio.io/client-go/pkg/clientset/versioned
└ k8s.io/client-go/discovery
└ k8s.io/client-go/kubernetes/scheme
└ k8s.io/api/authorization/v1beta1
Real world analysis
Using the tools above, I set out to understand why our builds were slow.
Slow test execution
In analyzing real-world compilation traces during our CI, I discovered something quite odd during our unit test job.
This more or less runs go test ./...
.
Below shows a snippet of the trace:
The full trace is also available, but it is 150MB.
We can see the test run istio.io/istio/pkg/test/framework/resource/config/cleanup
runs almost the entire runtime of the test, over 50s.
Digging deeper reveals something even more interesting -- this package has no tests.
The root cause of this can be found by investigating what the test run
action actually does.
In general, the Go compilers action graph ensures actions are not run until all dependencies are complete.
However, in order to ensure test start logs were emitted in the correct order, some serialization of test starts was added outside of this dependency mechanism.
This can be found here.
The end result is that a test run even one that would be very quick (or even has no tests at all) maybe start, then immediately block for other tasks to complete. This means we end up hogging a worker (generally, there is one worker per core) waiting on another task to complete.
In the extreme case, the very last alphabetical package may start first, blocking an entire worker for the entire test duration.
Based on this analysis, I have opened a Go issue and possible fix.
Integration Tests
In Istio, integration tests are separated from unit tests with a build tag integ
.
Additionally, tests are expected to be run with -p 1
, which tells Go to only execute one package at a time.
This ensures shared resources (Kubernetes clusters, generally) are not accessed concurrently.
The integration tests, in particular, always felt a bit slow in Istio.
Looking at the build tracing makes the reason obvious: -p 1
did more than we wanted!
What we wanted was to execute tests sequentially -- but this also compiles them sequentially.
This takes 8m25s
.
I didn't find any great ways to fix this out of the box, but we can use some of our findings from earlier to improve things.
$ run '
time go test -exec=true -toolexec=$PWD/go-compile-without-link -vet=off -tags=integ ./tests/integration/...
time go test -exec=true -vet=off -tags=integ -p 1 ./tests/integration/...
'
Here, we precompile everything (without linking) first, to warm the build cache, then execute the tests as usual.
The first step takes just 40s
, while the execution takes 3m40s
-- overall 2x faster than the naive approach.
However, we could still do a lot better. Most of the time is spent on linking the test binaries, which is still done sequentially. We cannot just build and link everything before we execute, either, as we saw earlier -- linking is only cached if the binary is saved.
This is somewhat solved in Go 1.21 which allows -c
(which saves the test binary) when testing multiple packages.
This can be awkward to use, though, as any package names will conflict.
We had quite a few conflicts, which led to this hacky approach:
$ run-21 '
grep -r TestMain tests/integration -l |
xargs -n1 dirname |
sort |
uniq |
grep -v tests/integration/telemetry/stackdriver/api |
grep -v tests/integration/security/fuzz |
grep -v tests/integration$ |
xargs -I{} echo './{}' |
xargs go test -tags=integ -c -o testsout/ -vet=off
grep -r TestMain tests/integration -l |
xargs -n1 dirname |
sort |
uniq |
grep -v tests/integration/telemetry/stackdriver/api |
grep -v tests/integration/security/fuzz |
grep -v tests/integration$ |
xargs -I{} echo './{}' |
xargs go test -p 1 -tags=integ -o testsout/ -vet=off -exec=true
'
The results here are superb - the first run takes just 50s, and the latter is almost instant.
Kubernetes
From Analyzing Build Times, we saw the root of most of our problems is Kubernetes libraries.
Just from k8s.io/client-go
and a few of its dependencies (k8s.io/api
and k8s.io/apimachinery
), we are up to 100s compile time!
This is the standard for any project interacting with Kubernetes, which is a pretty huge segment of Go usage. Excluding forks, it is used by over 10k projects on GitHub!
It's not just compile time, either.
Simply importing this library causes surprising results to binary size as well:
$ go mod init test
$ cat <<EOF>main.go
package main
import _ "k8s.io/client-go/kubernetes"
func main() {}
EOF
$ go mod tidy
$ go build
$ du -h main
40M main
Yikes! 40mb for our most empty binary.
Kubernetes seems almost uniquely bad in this regard;
On the surface, github.com/envoyproxy/go-control-plane
looks pretty bad as well - at 50s its about half the cost of Kubernetes.
But Istio is explicitly importing every package in the module, which is pretty atypical.
Removing this file drops the compile time for go-control-plane
to 15s (there is still a decent amount of the library imported through core usage).
The same simple binary above importing github.com/envoyproxy/go-control-plane/envoy/type
only comes out to 6mb.
Go is also smart enough to completely elide the dependency in some cases; for example grpc-go depends on go-control-plane
but doesn't show up as a dependency unless the xds
package is imported (uncommon for grpc-go
users).
The reason Kubernetes' client fares so much worse is that it cannot be consumed in parts. I would imagine most applications do not use every version of every Kubernetes API, but the first thing a client-go user does it import a package that depends on every version of every API. This, in turn, forces the Go compiler to compile all Kubernetes APIs for any user of the library.
These APIs don't just contain (fairly massive) Go structs, either, but generated protobuf, deepcopy, and swagger docs.
All of this adds up to a perfect storm of slow compile times and large binary sizes.
Test execution ordering
While building order is pretty intuitive, tests are a bit more involved.
Run go test ./...
is compiling a bunch of code, linking a test binary for each package, running go vet
on it, executing the test, and printing the result.
In a real world project this can be a bit overwhelming, so we will use a dummy project to analyze:
for i in a b c d; do
mkdir -p $i
cat <<EOF > "$i/${i}_test.go"
package $i
import (
"testing"
"time"
)
func TestSleep(t *testing.T) {
if "$i" == "b" {
time.Sleep(time.Second)
}
}
EOF
done
GOMAXPROCS=2 go test -debug-trace=trace ./...
Results in this trace.
Here we can see a few interesting things:
- Test execution will start in order (alphabetical).
- A single package will be fully executed before moving onto other work (that is, we build and execute package
a
first, rather than buildinga
-d
before we start executing). - Multiple processes can make the above non-blocking, however. In our trace, we have one package that has a slow test. While that is executing, the rest of the packages full
Building and Testing
Istio's unit-tests
job -- arguably one of the most important jobs to be fast for rapid iteration -- has for years ran, effectively:
$ go build ./...
$ go test -race ./...
This was done because want want to check the size of the release binaries to avoid regressions. At the time, it was assumed this was most "free".
However, from the Race detection analysis, we know this is false!
Unless the cache is already warm, this would add roughly 45s to our jobs, which run with 16 CPUs. Not the end of the world, but sometimes every second counts.
This is easily fixed by moving the check to a job which already builds the binaries.
Comparing to Rust
In contrast to Go, Rust is often criticized for "slow build times". Is it really worse than Go?
Answer this question is pretty hard, as they are very far from apples-to-apples.
That being said, we can get some very rough data by looking at some other projects. Note that unlike Go, Rust has different development profiles. Typically development builds are used when testing locally, while a release build is used for... releases. Both times are important.
$ docker run --rm -v $PWD:$PWD -w $PWD -it --init --entrypoint bash ghcr.io/linkerd/dev:v40-rust -c '
cargo fetch
time cargo build --package=linkerd2-proxy
cargo clean
time cargo build --package=linkerd2-proxy --profile release
'
Finished dev [unoptimized] target(s) in 1m 43s
Finished release [optimized] target(s) in 5m 18s
$ docker run --rm -v $PWD:$PWD -w $PWD -it --init --entrypoint bash gcr.io/istio-testing/build-tools:master-4c71169512fe79b59b89664ef1b273da813d1c93 -c '
cargo fetch
time cargo build --package=linkerd2-proxy
cargo clean
time cargo build --package=linkerd2-proxy --profile release
'
Finished dev [unoptimized + debuginfo] target(s) in 40.71s
Finished release [optimized] target(s) in 2m 19s
So at a very high level, the claim holds up to some extent, especially when comparing release build times. At least comparing these projects, Rust is a bit slower - but not as slow as commonly perceived, in my opinion.