Kubernetes Operators are powerful tools when used right, pushing complex operations from human operators to code. This sounds great - and is in some cases - but often the tradeoffs operators expose are not taken into account, both by operator developers and users. This article goes over my take on when operators are useful and not, and what makes a good operator.

A basic installation operator

A common feature amongst almost all operators is the ability to deploy Kubernetes resources. For example, lets consider a hypothetical nginx operator.

To deploy nginx to a cluster manually, you would likely be creating a variety of resources - a Deployment, Service, ServiceAccount, PodDisruptionBudget, Role, RoleBinding, and more - may all be required.

Faced with this complexity, it can be tempting to deploy an operator, which then presents a trivial API and get a full nginx deployment running:

apiVersion: operator.example.com/v1
kind: Nginx
metadata:
  name: my-nginx
spec: {}

However, this doesn't take into account the costs of the operator. I'd argue that almost all cases of operators that only offer this functionality are better suited without an operator.

Who operates the operator

In the above example, the Nginx API is undeniably simpler than the full suite of resources required to deploy by hand. Unfortunately, we still need to actually deploy the operator itself, which puts us back at square one. In many ways, it is worse than square one, as now we need to operator, maintain, and upgrade two deployments, rather than just one.

There are a few cases where this concern is addressed:

  • Multi-purpose operators. For example, a "Helm Operator" which renders arbitrary HelmCharts.
  • Operators you don't run. For example, if the Kubernetes platform provides them. This could be another team running an operator in the cluster, or for cloud vendors often can be completely outside the cluster.
  • Operators in components that are already deployed. Example discussed below.

Customization

While the simple example above looks great for demos, when it comes to a production deployment customization is going to be needed. Even just considering the Deployment resource, there are over 1000 fields exposed in the Kubernetes API.

If an operator exposes a subset of fields, they will continually be pressured to add more and more fields, until they end up with a mess of an API that does most (but not all) of what Deployment does, but with a slightly different syntax.

An operator may decide to bypass that and directly expose the Deployment spec. However, at that point the operator benefits quickly becomes indistinguishable from plain YAML.

Customizing YAML is already a solved problem by a variety of tools such as kustomize and kpt. Because operators move this YAML into code, these tools can no longer be composed, pushing operators to poorly re-invent them.

Good Examples

While the above describes what makes a bad operator, there are some operators I feel add real value and are worth considering. Note this is not conclusively, just some canonical examples.

Prometheus Operator

Prometheus is de-facto standard in Kubernetes clusters, but it doesn't actually have an amazing Kubernetes integration out of the box. Most configuration is driven by a single config map, lacking delegating parts of config to other namespaces, granular RBAC, (Kubernetes) syntax validation, etc.

The Prometheus Operator does install Prometheus, but its much more than A basic installation operator. In addition to install, it also creates a Kubernetes-native API (via CRDs), that exposes an improved API on top of the Prometheus API.

For example, rather than modifying a configmap, a native Kubernetes resource like below can be configured:

apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: example-app
  labels:
    team: frontend
spec:
  selector:
    matchLabels:
      app: example-app
  podMetricsEndpoints:
  - port: web

This type of operator is common and useful for applications that were not designed as Kubernetes-first. An operator allows building on top of the application to provide a layer that is Kubernetes-first.

Istio Gateway API

In Who operates the operator, it was discussed operators can be useful when it is already deployed for other reasons.

Istio's Automated Deployment mechanism is a good example of this.

Among others, Istio consists of a shared control plane, istiod, and a variety of gateway deployments. Previously, users had to deploy a gateway (Deployment, Service, etc) and configure it using a Gateway CRD. These resources need to be kept in sync, and had a lot of redundancy.

In the Automated Deployment mode, the Gateway resource will fully configure the gateway (Deployment and Service included), eliminating the two-step process and need to keep things in sync.

This improves operations, and because istiod is already a required part of Istio, most of the concerns in A basic installation operator are mitigated.

GitOps

A commonly given reason to why a user chose an operator is "GitOps". GitOps means different things to different people, but for this article the important part is storing the state of the system declaratively. This is usually meant to mean "YAML in a git repo".

Operators are not necessary for GitOps.

In Kubernetes, abstractions are layered on top of each other. Users must chose which abstraction level to store in their repository.

At the extremely granular, the repo could store Pod, Endpoints, and other low-level resources typically derived from higher level resources. You can do this, but it quickly becomes a burden. While our entire state is stored in the repository (The goal of GitOps!), any changes are extremely noisy and slow (for example, relative to the Deployment controller which can scale up quicker than a git push).

A more reasonable next step would be storing the higher level Kubernetes resources like Service and Deployment. This is what users not using higher level tools like operators or helm would typically do.

Another level up is using operators. This could mean storing an Nginx resource, for example. An even more extreme example would be storing an empty MyCluster resource that retains the entire state of the cluster and implements it as an operator.

While the extreme MyCluster example is clearly wrong, I think the Nginx one is as well. By doing this, we are essentially moving the state of our cluster from configuration to code. From looking at an Nginx, I have no clue what will be created in my cluster. If you are careful about pinning the version of the nginx operator, its mostly immutable, but the diff on a PR to change the operator will just present a version change - there is no info about what changes will actually happen to the system. A simple 1 line PR may actually be adding a new ClusterRole to nginx allowing it to read Secrets from the whole cluster, for example, which would be completely obfuscated by the layer of indirection.

While a translation from Deployment to Pod is almost similar, it is a very well established and mechanical translation. The claim isn't that all operators/controllers are bad, just that many are. The Deployment controller is an excellent example of a useful one.

Alternatives

Based on the above, you may agree its best not to use an operator for GitOps, but you may also not want to be manually creating a bunch of Deployments and Services. So what can you do?

Higher level APIs can still be used without an in-cluster operator reconciling a CRD. Some examples include:

  • Rather than have an Nginx resource reconciled in cluster, have the same but render it to native Kubernetes resources in a CLI.
  • Use higher level constructs like Helmfile or Argo Application

Some patterns to use are to have an inputs/ and rendered/ folder checked in; diff reviewers can typically look at the higher level configs in inputs/ but can see the full changes when needed as well.

Alternatively, just have the higher level configs, and rendered them as part of CI. This keeps the repo simpler, but also masks changes, so is typically not preferred.