Recently I have been spending a lot of time analyzing and optimizing memory usage in our Rust reverse-proxy, agentgateway.
One thing that repeatedly came up was a surprisingly large amount of memory allocated to innocent-looking Tokio mpsc channels.
In my naive understanding, I would have assumed the following allocation pattern:
struct BigStruct {
data: [u8; 1024],
}
fn main() {
// Allocates ~1024 bytes
let _ = tokio::sync::mpsc::channel::<BigStruct>(1);
// Allocates ~1024*1024 bytes
let _ = tokio::sync::mpsc::channel::<BigStruct>(1024);
}
However, in practice both of these are wrong: they each allocate 32kb!
In our application we had two particular areas where this had pretty meaningful performance impact.
What's going on
To dig into this a bit more I built out a small playground to analyze allocations from mpscs.
The results were quite interesting:
msg_size |
capacity |
heap_after_create |
heap_after_fill |
|---|---|---|---|
| 8 | 1 | 800 B | 800 B |
| 8 | 16 | 800 B | 800 B |
| 8 | 1024 | 800 B | 9728 B |
| 128 | 1 | 4640 B | 4640 B |
| 128 | 16 | 4640 B | 4640 B |
| 128 | 1024 | 4640 B | 132608 B |
| 1024 | 1 | 33312 B | 33312 B |
| 1024 | 16 | 33312 B | 33312 B |
| 1024 | 1024 | 33312 B | 1050112 B |
Here we have:
msg_size: the size of the structT.capacity: capacity of the boundedmpscchannelheap_after_create: memory allocated immediately after creationheap_after_fill: memory allocated immediately after we sendcapacityitems on the channel to fill it up.
A few things stand out here:
- The capacity of the channel has no impact on the initial allocation size.
- Even when we start sending messages, we don't start to grow until a certain point (spoiler: it's after 32 messages).
- With some quick math we can see the initial cost is
(msg_size * 32) + 544. So there is a fixed cost and a multiplier based on the message size.
Implementation details
This 32 multiplier is easy to find looking at the Tokio source.
The channel is built up of a linked list of Blocks.
Each Block stores BLOCK_CAP T values, where BLOCK_CAP is hardcoded to 32.
Because of this, allocating the mpsc allocates a [T; 32] more or less.
The remaining fixed 544 bytes comes from the rest of the channel parts which I didn't analyze too closely.
Real world impact
Many tiny channels
The first real impact for us was a channel we created for each Kubernetes Service object in the cluster to send events about the health of the service on.
In many environments these number in the thousands.
The message we sent on each channel is small (only 24 bytes), and the throughput and latency requirements for these is very low.
As we learned above, however, each was taking 1312 bytes!
We moved this to another channel, futures-channel, which only allocates one T at a time (at the cost of a throughput degradation which was not relevant to us).
The end result cut our overall memory in half in a representative test:
Hyper connections
When using Hyper, we see a 16kb allocation per connection. When serving thousands of connections this can add up quite a bit.
Part of this is obvious: each connection has a hardcoded buffer INIT_BUFFER_SIZE: usize = 8192.
However, the other 8kb comes from the same mpsc issue!
Each SendRequest, the API that requests are sent on, utilizes a channel that dispatches http::Requests. Each request is typically around 250 bytes, multiplying by 32 gives us our remaining 8kb.
Unlike our previous use case, this codepath is very latency/throughput sensitive. The upfront allocation has a meaningful impact on end to end benchmarks, but we only really ever need 1-2 items in the channel at a time - the 32 block size is almost pure overhead.
This is tracked in Hyper issue #4057