A common question around Istio's ambient mode is how it handles traffic during upgrades or restarts.
In the sidecar model, where the proxy lives 1:1 with an application, this isn't really a concern - the proxy shuts down when the application does, and always picks up the latest version when the application starts.
Ambient mode, however, has a dedicated proxy per-node ("ztunnel"), which implies that we made need to upgrade it at some point while the application is running.
Shutdown procedure
Ztunnel follows a standard rolling update process. This means when we want to bring up a new version, we...
- Bring up the new version
- Once the new version is ready, start shutting down the old one
- For some period of time, both run side-by-side
- Finally, the old version is shutdown
For Ztunnel, there are two types of traffic to consider.
The first is what happens to new connections established when there are multiple instances running at the same time.
For a very brief period, both instances will be accepting connections (utilizing SO_REUSEPORT
) assigned more-or-less at random by the kernel.
However, as soon as the new instance is fully ready, the old one will notified to stop accepting new connections, allowing the new instance to start handling any new connections.
Next is what happens to existing connections for the instance we are shutting down.
While application protocols commonly have mechanisms to notify a peer that they should stop using - for example, HTTP/1.1 can send a Connection: close
header and HTTP2 can send a GOAWAY
- Ztunnel operates at L4, and there is no such mechanism in TCP.
Instead, Ztunnel uses a grace period mechanism.
As long as there are still active connections, the old instance will continue to run and serve these connections (but, as mentioned above, will not handle any new connections).
Eventually, if the configurable grace period elapses, any remaining connections are sent a RST
.
Putting these together, the lifecycle of an upgrade for "blue" to "green" looks as follows:
The first bar shows which instance is actively accepting connections. We can see this transitions from "blue" to "green", with a brief period where both are accepting connections.
The second and third bar show the state of the old and new instance. The old instance starts it's grace period once the new instance is ready, and eventually forcefully terminates any remaining connections.
Is my application impacted?
For most, the important part isn't the internal details of how things work, but whether their applications will be impacted. The short answer -- it depends.
If you application has connections that live longer than the grace period, they are at risk of being reset. Depending on your application, this may have various impacts -- some will handle this more gracefully than others.
One important note is that at no point will new connections fail. So if your application handles a connection termination by attempting re-establish a new connection (as it should!), that will always work.
How can I mitigate connection resets?
If your application cannot handle any connection resets well, there are two primary approaches you can use.
The first is to ensure the Ztunnel grace period is longer than your maximum connection age.
Many usages of long-lived connections (such as in connection pools) can configure the maximum age of a connection before they establish a new one, which can be set to a lower value if needed.
Additionally, Ztunnel's grace period can be tuned by configuring its terminationGracePeriodSeconds
setting -- this value can be set quite high, even hours.
A more invasive, but extremely safe option, is to ensure no applications are running on a node undergoing an upgrade. This can be done by cordoning a node. For most cases, this is more trouble than its worth, but if you treat your TCP connections as "pets", this may be worthwhile.
Wondering why a node cordon/shutdown can be 100% safe from connection terminations, while a Ztunnel cannot? The issue with Ztunnel upgrades is there is no mechanism to notify an application to gracefully shutdown or close connections. However, when an application shuts down, it will get a
SIGTERM
signal which it can use to do a graceful shutdown. Of course, your application must actually gracefully handle this signal to get any benefits!