User space isn't slow

When talking with folks about networking, there is a common conception that doing things in the kernel is superior, or even that user space implementations are not viable.

This leads to a variety of sub-optimal decision making, ranging from skipping a perfectly viable product because it is "user space" to re-inventing well established security protocols so they can be "in kernel".

In my experience, this is often wrong. I won't claim that one is always faster than the other; obviously there is nuance. However, a blanket rule that doing networking purely in the kernel is required for performance doesn't hold up.

Why is, or would, the kernel be faster?

Kernel code and user space code all compile down to the same machine code, so there is nothing inherently faster about running the same piece of code in kernel vs user space. However, switching between the two triggers a context switch, which is fairly expensive, and sometimes requires copying data to/from kernel/user space.

The theory is, keeping all of this processing in the kernel avoids these context switches and copies, making it dramatically more efficient.

One important implication of this is that this benefit really only happens if we can do things entirely in the kernel. For example, if we implement load balancing in the kernel but need to do encryption in user space, it probably would have been easier to just do everything in user space; we are already paying the cost, and its much easier to write code in user space rather than the kernel (where we either need to fit into existing kernel features or use eBPF).

Benchmarks

Disclaimer: like all benchmarks, this one certainly has flaws. The goal is not to be the most precise benchmark in the world, just to show some broad trends.

For this benchmark, I setup a simple test environment with two GCE c3-standard-8 machines running Linux 5.10.

These have three different network paths to reach each other:

Direct: simply connect directly to the other VMs IP
WireGuard: traffic is sent over the (kernel) WireGuard interface
TLS: traffic is sent to a local proxy that adds TLS and forwards to the destination. The implementation is just a tiny Go program built for this.

Then we do some tests with iperf3 and fortio.

Throughput

First up, we have the standard throughput tests:

While "Direct" is the clear winner here, TLS holds it's ground pretty well.

CPU efficiency

Another important measurement is how much CPU is burned during this. Measuring the total system CPU usage, we see:

Direct: 50%
TLS: 80%
WireGuard: 100%

Note: 100% means 1 core, but its an 8 core machine so we could go over 100%.

This is a very subtle test though, as these CPU costs give different throughput values. Instead, it is better to look at efficiency: throughput per CPU.

So while WireGuard only had a 2x CPU overhead, when accounting for efficiency its nearly a 20x overhead.

Latency

Finally, testing latency.

Throughput is nice, but most applications are not sending GBs of data per second, and instead rely on sending small messages quickly.

I like to use HTTP load testing tools here, as HTTP is a common case for low-latency requirements. Here I used fortio.

Here we see WireGuard and TLS identical.

Why not user space wireguard?

Comping kernel wireguard to user space TLS is not really apples-to-apples. A user space wireguard implementation would be more appropriate here, for a "kernel" vs "user space" discussion.

However, I am mostly having this discussion in the context of TLS, so its useful to me to use TLS.

Fortunately, the folks at Tailscale have done similar tests (and optimizations!) for kernel WireGuard vs user space, so I recommend checking that out if you are interested in WireGuard.

Why is, or would, the kernel be faster?#

Benchmarks#

Throughput#

CPU efficiency#

Latency#

Why not user space wireguard?#