Testing a Kubernetes Networking Implementation Without Kubernetes

Because I really don't like to waiting for things during development, one of my top priorities when building out Ztunnel - the underlying network proxy written for Istio's new Ambient mode - was to make sure tests were fast (both running and writing them), and easy to debug.

This is tricky, because the Ztunnel is pretty deeply integrated into Kubernetes in most real world usage. While it can be run outside of Kubernetes entirely, so many of the critical code paths behave completely differently that it makes testing solely in this manner infeasible.

A typical Ztunnel deployments looks as below:

A user will run a Kubernetes cluster with many nodes. Each node runs a Ztunnel, which configures both the host networking stack and each pod's networking stack.

Additionally, Ztunnel will actually enter into each pod's network namespace and send/receive traffic on it's behalf. This is really weird, really cool, and makes testing really hard! (More info).

Testing faster

Spinning up a whole Kubernetes environment, rebuilding images, deploying to every node, etc is really slow, and really hard to debug.

The gold standard here would be to run everything as a simple single binary -- just a cargo test away. This elides complex setup, slow rebuilds, and makes debugging a breeze (sure, you can hook up a debugger to a running pod... but its a pain).

Setting up the network

If we unravel endless layers of abstraction, Kubernetes pods are really just a few Linux namespaces and mounts glued together. Docker works great to manage this, but so does bash.

The part we care about in particular is network namespaces, which allow isolation of networking stacks. Each pod gets its own network namespace, and through various mechanisms these are hooked up to allow communicating with other pods on the same node, other nodes, and external destinations.

The good news: its really simple to make a network namespace.

$ sudo ip netns add testing

Ultimately, what we want is to setup a variety of network namespaces, that mirror how they would look on Kubernetes. This should look like our real world architecture:

Building out connectivity between network namespaces is a bit trickier. Tools like cnitool can help out here (this essentially runs the same logic some Kubernetes environments use to setup networks, but as a CLI tool), but you can do it all manually as well. We chose the latter route.

Ultimately, what we came up was a setup like this:

Each test gets its own network namespace. There is a bridge device (br0) to facilitate traffic between nodes.
Each node sets up a veth device. One end becomes eth0 on the node, and the other is connected to br0 in the root namespace.
Each pod sets up a veth device. One end becomes eth0 on the pod, and the other resides in the node network namespace.
Routes are set up for each pod to send traffic to the node.
Routes are setup for each node pair to allow cross-node traffic.

With the exception of the root namespace/bridge device, this is identical to how many real world Kubernetes clusters run (in the real world, the root namespace is the physical network between two machines).

You can find all the details here.

Running tests

Once we have our namespaces, we still need a way to actually use them. Fortunately, Linux allows change the current namespace of a thread (this will be important latter) at runtime. This lets us build up a foundational helper function (the real code is slightly more complex):

fn run_in_namespace(namespace: Namespace, f: Fn()) {
  let original_namespace = get_current_namespace();
  namespace.enter();
  f()
  original_namespace.enter()
}

With this, we can easily execute code from arbitrary "pods" or "nodes".

However, we still have a problem. All of our code runs in the tokio async runtime, which will schedule all of our various tasks onto physical OS threads as it pleases (similar to how the Go runtime works). Network namespaces are per thread, so this all blows up when our tasks are jumping around between threads.

Fortunately, Rust gives us a bit more flexibility over the async runtime than Go -- we can have multiple concurrently! With this, we are able to build an async-capable run_in_namespace. For each function we want to execute, we spin up a new thread and build a dedicated single-threaded async runtime to handle it:

async fn async_run_in_namespace(namespace: Namespace, f: async Fn()) {
    thread::spawn(move || {
        run_in_namespace(namespace, || {
            let rt = tokio::runtime::Builder::new_current_thread()
                .enable_all()
                .build();
            rt.block_on(f())
        })
    });
}

We run this function once-per-namespace, so the overhead here is minimal. If we wanted to run many small functions, an abstraction could be built on top to send work to the thread to be executed.

The last thing we need is a reasonable way to identify how to call each destination. While they will all get an IP assigned (based on a simple IPAM strategy in our code), we don't want each test to have to guess the IP. To handle this, we build a simple name resolver. This is just like DNS, but far simpler: for every "pod" we create, we record a mapping of name -> IP, and allow looking up the IP.

Putting it all together, a simple test spinning up 3 pods (client, server, and ztunnel) on a single node looks like this:

#[tokio::test]
async fn simple_test(){
  let ztunnel = manager.deploy_ztunnel(DEFAULT_NODE).await?;
  let server = manager
    .workload_builder("server", DEFAULT_NODE)
    .register()
    .await?;
  run_tcp_server(server)?;

  let client = manager
    .workload_builder("client", DEFAULT_NODE)
    .register()
    .await?;
  run_tcp_client(client, manager.resolve("server"))?;

  // ... some assertions here
}

Dropping privileges

The above setup worked great, but came with a few problems of its own.

Basically every step of the setup requires elevated root privileges; this is tedious to enable the simple cargo test case to work out of the box, and generally not desirable.

Additionally, the pollutes the host environment with a bunch of namespaces. While we have some cleanup processes, these are not 100% reliable and can lead to dangling namespaces which block future executions.

The solution to having too many namespaces? More namespaces! For this, we will need more than network namespaces.

User namespaces allow us to essentially pretend to be UID 0 (root), while actually mapping that back to our original UID. The power here is that within that namespace, we can do things that would otherwise require root - in particular, creating new network namespaces.

However, one thing we cannot do is modify host-root owned files (which would be a clear permission violation). While we could probably work around them, a lot of the tools we utilize in our test like to touch root files. This, again, can be worked around with a mount namespace, allowing us to bind-mount files we do own over the host-root owned files, without impacting things outside the namespace.

Putting it all together, we have something like this:

let original_uid = get_uid();
// First, drop into a new user namespace.
unshare(CloneFlags::CLONE_NEWUSER).unwrap();
// Map root in the user namespace to our original UID
File::create("/proc/self/uid_map").write(format!("0 {original_uid} 1"));

// Setup a new network namespace
unshare(CloneFlags::CLONE_NEWNET).unwrap();

// Setup a new mount namespace
unshare(CloneFlags::CLONE_NEWNS).unwrap();

// Mount a folder in our temporary per-test director over /var/run/netns
mount(tmp_dir.join("netns"), "/var/run/netns", MS_BIND);

// A nice helper message to facilitate manually debugging, if needed.
let pid = get_pid();
eprintln!("Starting test in {tmp_dir}. Debug with `sudo nsenter --mount --net -t {pid}`");

One trick is, as mentioned above, entering namespaces is per-thread. We need to set this up before we spawn any additional threads.

Rust actually provides us with the ability to do that, but it means we lose the #[tokio::test] macro helper. We could write our own macro, but that is a bit of a pain. Fortunately, through linker shenanigans we can force our code to run extremely early in the process execution.

A similar method works for Go (see helper library I wrote), and is actually required there, because the setup must be done before the Go runtime starts (which is long before any user code runs, typically).

Putting it all together

With all of this machinery in place, a full test with a few fake pods takes about 200ms. Everything runs in a single process, making debugging trivial. Everything is also fully isolated, so tests can be run fully in parallel (including the same test, for stress-testing to weed out test flakes).

Testing faster#

Setting up the network#

Running tests#

Dropping privileges#

Putting it all together#

Testing faster

Setting up the network

Running tests

Dropping privileges

Putting it all together