A common phrase when talking to Kubernetes users is "I just want all my traffic mTLS encrypted on Kubernetes." Occasionally, this comes with some additional constraints such as "...without the complexity of a service mesh."

Its a fair request, with many solutions offering different tradeoffs. In this post, I'll go over the options and provide recommendations.

Why an "mTLS" requirement, and not a more general "encryption" requirement? There are many reasons I have seen. These range from already having done the research on various encryption mechanisms and deciding on mTLS to not knowing other options, and everything in between. In this post I'll primarily focus on mTLS.

Why mTLS

Before we get into the best options for mTLS, we ought to understand why we are doing it.

mTLS stands for Mutual TLS. This is just like the encryption used by the vast majority of the internet (https://), but bidirectional.

In the standard public internet cases, the browser is authenticating the TLS certificates of the website they are targeting. This helps prove bank.com really is operated by bank.com, and not a MITM attack. The website may also authenticate the user, but this is typically done at the application layer, rather than with TLS.

Mutual TLS is exactly the same thing, but the client also presents a certificate which is validated by the server. Browsers technically support this as well, but its extremely rare due to ergonomic issues.

Mutual TLS inside a Kubernetes cluster (or other infrastructure) is useful for the same reasons its useful on the web. It provides Authenticity (both peers can prove who they are), Confidentiality (eavesdroppers cannot see the data that is exchanged), and Integrity (the data is not tampered with in transit).

These properties are an important step to achieving a zero trust security posture. Additionally, they are commonly requirements to meet various compliance requirements, whether they are internal company policies, or government standards such as FIPS.

mTLS Options

Once you have decided you want mTLS, you'll need to pick how you want to do it. There are a few different approaches.

Do it yourself

The traditional way to do this would be to provision certificates for all your applications and modify your applications to use them all. This can be tricky at small scales, and extremely challenging at large scales. A few issues come up here.

First, we need to actually manage certificate provisioning. This requires developing a naming scheme, distributing roots of trust, signing the certificates, ensuring they are rotated and up to date, and much more. Tools like cert-manager and SPIRE are some options that help solve some of these issues. In my experience, however, its common for users adopting this approach to have existing certificate infrastructure that they need to modify to adopt on Kubernetes.

Once we have the certificate management, we also need to change our application to start using TLS and properly sending and verifying the peer certificates. This can be challenging across large polyglot deployments. While its typically straightforward to change one application to start using TLS, there are all sorts of rough edges:

  • You cannot change the entire cluster atomically. How do you handle the partial state before everything is moved over to using TLS?
  • How do you verify that all traffic is using TLS, and not just some subset? It is common for applications to have multiple ingress and egress points; did you capture all of them?
  • How are you verifying peer certificates? Is there a common logic shared across applications? Is it supported in all the languages used? Is it kept up-to-date in all applications?
  • Do all the usages remember, and correctly implement, certificate rotation? Certificate expiry is a great way to cause a 2am outage!
  • Do all applications even support doing TLS? Do they support mutual TLS? Do they support your certificate scheme? This is especially important with off-the-shelf software where we cannot modify the code.
  • How long does it take to develop and rollout a code change on every application company-wide?

For these reasons, its common to use some higher level options that handle these issues.

Sidecar-based service mesh

One of the most prevalent approaches to adopting mutual TLS on Kubernetes has been the sidecar-based service mesh architecture. Istio and Linkerd are the most popular.

At a high level, rather than modifying applications to handle TLS in their code, a small network proxy is deployed alongside each application. This proxy can do all sorts of things (check the project's documentation!), but the for this post the important part is that they can automatically handle mutual TLS for you. This means you can automatically get mutual TLS for all traffic between two workloads in the mesh without changing your application.

Overview of the Istio sidecar architecture
Overview of the Istio sidecar architecture

Speaking for at least Istio and Linkerd (which I am familiar with) these solutions handle all of the tricky cases listed above. Particularly notable is the migration case. Both of these will automatically support sending and accepting plaintext and TLS, depending on if the peer supports TLS. This can later be locked down to strictly enforce mutual TLS is used once (or if) migration is complete.

This method is by far the most prevalent option for broadly deploying mutual TLS in Kubernetes; I would estimate over 95% of production usage.

Despite its success, there are still some friction points. Service meshes are sometimes seen as too complicated or too expansive (in CPU/memory or latency impact). Many of these concerns are related to service meshes doing a lot of things beyond mTLS; this is great if (or when, in the future) you need this functionality, but if you just want mTLS it can be overkill.

Node-based/Ambient mesh

The additional complexity required by sidecar approaches to provide functionality unrelated to mTLS led to the development of a new architecture for service mesh: ambient mode. In this architecture, a per-node proxy is deployed which handles mutual TLS automatically. Currently, Istio is the only service mesh offering this approach, so I will focus on it.

Istio ambient was specifically designed to meet the use case of "I just want mTLS on Kubernetes." Nearly every design decision stems from this initial goal. While the full suite of service mesh functionality is still present, there is a smooth onramp from "stock Kubernetes" to "mTLS everywhere" to "full service mesh", described more in my previous post.

Because of this specialization for the mutual TLS use cases, ambient mode can often be a better fit an alleviate some of the concerns of the sidecar approach. When using only the mTLS features on ambient mode:

  • Cost is dramatically reduced: in some deployments, the ambient approach may use 1% of the CPU and memory of the sidecar approach.
  • Performance is increased: while this depends on the application, its not uncommon to see the service mesh latency overhead reduce on the order of 20x. Note: this doesn't mean your application will be 20x faster, just the overheard; this could mean your application goes from 1000ms app latency + 1ms mesh latency to 1000ms app latency + .05ms mesh latency. As such, this is likely more important only if your application is already highly performant.
  • Compatibility is increased: one of the main benefits of service mesh, beyond mTLS, is HTTP traffic management. While this is often a great feature, it also fundamentally changes functionality. Ambient mode doesn't do this by default (but allows you to opt in).
  • Complexity is reduced: because the node proxy is purpose built to provide mTLS, there isn't much accidental complexity from other features creeping in.

Overall, ambient mode provides a simple option to deploy mTLS everywhere, without much downside. For users who just want mTLS, this is the best starting place, and offers the ability to incrementally adopt other mesh functionality if desired in the future.

CNI based approaches

One option that occasionally comes up in these discussions it to use a CNI based approach. The problem is: no CNI currently supports mutual TLS!

So why does this come up? There are a few reasons.

  1. CNIs (usually) implement NetworkPolicy. While this doesn't provide most aspects of TLS (including encryption, authenticity, integrity, and confidentiality), its often a part of a zero-trust network architecture, so often gets brought up. I have another post highlighting some concerns with NetworkPolicy.
  2. CNIs often implement other network encryption mechanisms. For example, Calico offers WireGuard, and Cilium offers WireGuard and IPSec. While these are not mutual TLS, there is some overlap in functionality. While I think these are often seen as more equivalent than in reality, this is getting beyond the scope of this post -- perhaps a followup post comparing these head-to-head is in order.
  3. Cilium offers a feature now called Mutual Authentication. This is often called mTLS because it was inspired by mTLS and has some overlapping ideas. However, this is not mutual TLS. This is not a pedantic semantic distinction: it its quite simply not offering TLS. This puts it in the category as above: a non-mTLS based option that provides some functionality that will need to be evaluated.

If your requirement is "Mutual TLS", this approach is not viable. If your requirement is a bit more flexible, some of these options may meet your requirements.

Recommendations

So, wrapping up: "I just want mTLS on Kubernetes", what do I do?

For most users, Ambient mode is likely the best fit. This offers the fastest way to get mTLS deployed across a cluster, with minimal cost, complexity, and overhead.

Sidecar based approaches are also a good choice and have substantial industry adoption. Just keep in mind these provide a ton of functionality beyond mTLS, which can be overkill depending on your use case.

Doing it yourself is extremely challenging. While this may be a good fit for organizations with very precise set of requirements, and extremely robust operations practices, it is an option of last resort.