Building a better Kubernetes Client

Like most other Kubernetes controllers in, Istio is written in Go and relies on the client-go library. While this provides an excellent low-level building block, usage in higher level code in Istio led to a variety of issues that led us to develop our own higher level, opinionated client for Istio.

This post covers the issues we faced and how we incrementally solved them.

Background knowledge

At a high level, client-go provides a few layers for interactions with the API server:

kubeconfig provides a raw configuration for how to connect to the API server
A Client (or sometimes Clientset, grouping a few) acts as a client to the API server. This is built from a kubeconfig and maintains connection pools, etc.
An Informer is a higher level built on Client that creates a Watch for a resource and provides cached access. These are the bread and butter for controllers.
An InformerFactory helps build Informers, by making it easy to share the same Informer. Calls to construct the same Informer will use cached instances.

There is a bunch of other goodies in the library, but these are the core ones we are concerned with.

Towards a common client

After years of organic codebase growth, our usage of clients was fairly messy. Various parts of the codebase were building Clients redundantly, which made things complicated to track and lead to less efficiency at runtime due to missing connection pooling, etc. Additionally, we were often creating redundant informers across the codebase, leading to multiple watches of the exact same resource. These watches put strain, both on Istio and Kubernetes.

In client-go, Clients are created using code generation. This, unfortunately, means third-party types need to make their own Clients, using the same code generation. This gave way to istio/client-go, offering the same experience for accessing Istio's custom types (and later, a few other project's types we depend on) as core types. However, this further led to proliferation of random Client constructions and messy dependency injection attempts trying to pass around multiple clients.

As an initial fix for this, we built a single consolidated type that stored all clients we needed. This was designed to be the one-stop-shop for components that need access to Kubernetes resources - any component that needs this should simply accept a kube.Client, bringing uniformity to the codebase.

More importantly, we took this opportunity to start using InformerFactorys, to start sharing informers across the codebase. What was once a tedious task (that was never done in practice) was now trivial, as the entire project now shared the same kube.Client.

Here is a (slightly reduced) snippet of where we started:

type Client interface {
	// Kube returns the core kube client
	Kube() kubernetes.Interface

	// Dynamic client.
	Dynamic() dynamic.Interface

	// KubeInformer returns an informer for core kube client
	KubeInformer() informers.SharedInformerFactory

	// DynamicInformer returns an informer for dynamic client
	DynamicInformer() dynamicinformer.DynamicSharedInformerFactory

	// RunAndWait starts all informers and waits for their caches to sync.
	RunAndWait(stop <-chan struct{})
}

Better testing

One important note is the Client is actually an interface, enabling us to make a fake implementation! Aside from simplifying code, though, this was not a critical improvement -- client-go, fortunately, already provides fake client implementations.

However, having this single point-of-usage offers the ability to centralize workarounds, optimizations, etc. In our tests, we had routinely had issues with reliability and performance.

Most tests wait for the client to "Sync" before they can start, which is done by polling. A 100ms poll is fine for real usage, but when we run 100s of tests on fake clients (which are ready almost instantly), its wasteful. With the consolidated fake client, we were able to simply busy-poll readiness. This simple change led to some test suites runtime dropping from 25s to 2s -- all without changing the test code at all.

However, we still had random test flakes hampering stability. Eventually, we were able to track this down to a bug in the client-go fake implementation which could cause events to be lost. While the workaround is not simple, fortunately it can be consolidated in our new centralized client, without changes to any test code.

A higher level generic client

Since the initial client-go, Go has added generics support to the language. This has the ability to reduce boilerplate and codegen substantially, as well as reduce binary sizes and builds times.

Because client-go lacks support, we built our own wrapper. This not only added generics support, but a variety of helpers to avoid common issues with client usage. This looks something like this (with a variety of functions omitted):

type Reader[T controllers.Object] interface {
        Get(name, namespace string) T
        List(namespace string, selector klabels.Selector) []T
}

type Client[T controllers.Object] interface {
        Reader[T]
        Writer[T]
        Informer[T]
}

Usage -- especially creation of a Client -- is dramatically simplified:

c := kube.NewFakeClient()
deployments := kclient.New[*appsv1.Deployment](c)
for _, deploy := range deployments.List("default", labels.Everything) {
	fmt.Println("Deployment", deploy.Name)
}

Note the lack of error handling here; this isn't because we are brazenly ignoring the possibility of errors, but rather because we are listing from the internal informer cache which is infallible.

The existence of this higher level client gave way to a variety of improvements over time.

Lazy loading custom resources

Istio watches a variety of custom resources. These don't always exist in the cluster, and when they are missing we want to simply treat them as having 0 resources. If the CRD is latter created, however, we need to start watching them.

Previously, everywhere that wanted to catch CRDs needed to implement this logic, with varying degrees of correctness. With the new client, we were able to make this trivial: simply replacing New() with NewDelayed().

This is done by replacing each implementation with one that calls the real client if it is available, or a dummy one if not:

func (s *delayed[T]) List(namespace string, selector klabels.Selector) []T {
    if c := s.client.Load(); c != nil {
        return (*c).List(namespace, selector)
    }
    return nil
}

Once the CRD is created, this triggers a swapping mechanism:

func (s *delayed[T]) swap(client Client[T]) {
	s.client.Swap(&client)
	for _, h := range s.handlers {
		client.AddEventHandler(h)
	}
	client.Start()
}

The client also deals with some tricky logic around ensuring HasSyned and AddEventHandlers works correctly before the client is fully initialized. Fortunately, this tricky logic is consolidated in one heavily tested location rather than spread across many controllers.

Dynamic object filters

We have a few use cases that benefit from client-side filtering of objects. While ideally objects are filtered server-side (generally by labels), some more advanced filters must be done locally. While client-go does offer a fairly simple filtering mechanism, it doesn't handle filters that change dynamically. That is, if the filter changes, we do not add/delete objects that start to/no longer match the filter.

In the new client, we automatically handle this for filters.

Syncing and shutdown

Improperly handling informer startup and shutdown is an extremely common bug, impacting nearly every project, including the core of Kubernetes itself.

Controllers should generally wait until all initial state is synced before taking action, to avoid making incorrect decisions on incomplete data. In Istio's case, this could mean sending the proxy an empty configuration or similar.

Often, this is done by just checking the informer.HasSynced() method, which checks that all initial data has been written to the internal cache. However, this isn't enough: often we need event handlers to have also run on each item first. HasSynced doesn't handle this!

In our client, we make HasSynced also check that all handlers are synced as well. This required collaboration on upstream changes to expose this information. By only exposing the safe method, we avoid controllers accidentally forgetting to properly check each handler is synced.

Similarly, handlers should be removed when they are no longer needed. In many projects controller lifecycle matches the binary, so this isn't important. Istio, however, has a few controllers that may start and stop throughout the binary's lifecycle. Shutting down handlers is critical to avoid memory leaks or wasted cycles.

client-go exposes this option on each handler added, but can be tedious to track. In our client, we simply track these internally and expose a ShutdownHandlers method.

Test Helpers

Writing tests with client-go can be pretty obnoxious. We don't often care about errors, Context, *Options, etc, but are required to set them all the time. Additionally, often higher level constructs like GetOrCreate or CreateAndUpdateStatus are desired. Due to lack of generics, these end up copy-pasted for each type throughout the project.

c := kube.NewFakeClient()
pods := kclient.New[*corev1.Pod](c)
testPods := clienttest.Wrap(t, pods)

pod := &corev1.Pod{
    ObjectMeta: metav1.ObjectMeta{Name: "pod"},
    Status: corev1.PodStatus{PodIP: "1.2.3.4"},
}
testPods.CreateOrUpdateStatus(pod)

Background knowledge#

Towards a common client#

Better testing#

A higher level generic client#

Lazy loading custom resources#

Dynamic object filters#

Syncing and shutdown#

Test Helpers#