Ambient and the SPOF Myth

One common question I hear about Istio ambient mode is "Isn't this just introducing a single point of failure?".

Apparently, Istio has been facing similar questions since 2017!

The simple answer is No, ambient mode does not introduce a SPOF. Read on for details.

Background

First, a brief overview of the ambient architecture:

Each Kubernetes node gets a per-node Ztunnel proxy, and namespaces can (optionally) deploy a waypoint Deployment. The Istio docs provide a lot more details.

Additionally, just to be clear: "A single point of failure (SPOF) is a part of a system that, if it fails, will stop the entire system from working." (source).

Ztunnel

First, we will discuss ztunnel. The general argument goes like this:

All traffic on the node traverses Ztunnel.
There is only one Ztunnel per node.
If Ztunnel goes down, all my traffic on the node goes down.
Therefore, ambient introduces a SPOF into a Kubernetes cluster.

Everything here is correct up until the conclusion.

Kubernetes (and distributed systems in general) are built on the premise that we can combine a bunch of potentially unreliable components and get an availability that is greater than the sum of its parts. It is fundamental to modern operations that individual nodes/VMs can and will fail, and the broader system needs to be robust to errors in individual nodes.

For instance, GCE offers "Industry-leading" VM reliability, with 99.9% uptime on an individual VM basis. If each node was a single-point-of-failure, that would mean our yearly uptime would be 0.999^{# Nodes}. At 1,000 nodes, that is an abysmal 36% uptime. Clearly, users are not operating this way.

That being said, while we must accept the possibility of node failures, we still want to minimize them. In order to achieve as high reliability as possible, Ztunnel was purpose built specifically with a huge emphasis on reliability. It joins a variety of other critical infrastructure running on each node, such as:

The linux kernel itself, which handles all network traffic (among other things)
The container runtime
Kubelet
Kube-Proxy or equivalent

These, like ztunnel, are all critical to a node operating successfully. Failures in any one of these can cause a localized outage on the node. In a properly designed system, node outages do not lead to cluster outages.

Waypoints

Waypoints are a bit different, and much easier to debunk. The general argument goes like this:

There is a single waypoint for the whole namespace.
If that waypoint goes down, the entire namespace goes down.
Therefore, ambient introduces a SPOF into a Kubernetes cluster.

This is simply a misunderstanding of how Ambient works.

Waypoint proxies are not 1:1 with namespaces.

They are (generally) deployed as a Kubernetes Deployment, and are fully horizontally and vertical scalable. As a standard Kubernetes Deployment, existing tooling for high availability and autoscaling such as PodDisruptionBudget, HorizontalPodAutoscaler, VerticalPodAutoscaler, and Pod Topology Spread Constraints all work out of the box.

Conclusion

Overall, Istio ambient does not introduce a SPOF into a Kubernetes cluster.

Ztunnel failures are scoped to a single node, which is considered a fallible component in a cluster. It behaves the same as other node-critical infrastructure running on every cluster such as the Linux kernel, container runtime, etc.
Waypoint proxies can easily be deployed in a High Availability mode making them robust to failures.

Background#

Ztunnel#

Waypoints#

Conclusion#

Background

Ztunnel

Waypoints

Conclusion