Interpreting near native speeds with CEL and Rust

When building out Agentgateway, we had a desire to introduce an embedded expression language to allow users to write custom logic to be evaluated at runtime.

This is tremendously useful for a variety of use cases, such as:

Extracting fields to log (request.headers["user-agent"]).
Evaluating authorization conditions (jwt.sub == "admin" || request.path == "/public").
Manipulating fields in requests/responses (x-llm-model: 'json(request.body).model').

and so on.

This provides a powerful way to allow users to customize behavior without needing custom compile-time extensions, external callouts, or complex YAML-based configuration.

However, if we are going to evaluate 100s of expressions per request, it ought to be really fast!

CEL in Rust

From some past experience, we knew that the Common Expression Language (CEL) was a very good fit for this type of use case: simple enough to be easily embedded, but powerful enough to express complex logic. Even better, there was a really great Rust implementation of CEL that was very promising - not only did it work well, but it was also incredibly simple to understand the internals (which cannot be said for the official Google implementations in other languages).

While this was great to get us started, it wasn't enough to get where we needed to be in terms of performance. We could evaluate a few expressions per request without issue, but as we scaled up to 100s of expressions, the latency quickly became unacceptable. This, in turn, led to us avoiding putting CEL in places where it would have otherwise made sense.

Beyond just standard profiling which showed CEL using on the order of 10-20% of CPU, I also designed some tests where we amplified our CEL operations, repeating each one 100x with the following results:

Baseline: 60K QPS
Building the CEL context 100x (see below for more info): 15K QPS
Evaluating a fast expression 100x: 37.2k QPS
Evaluating a slow expression 100x: 9k QPS

So we can see that there is a tremendous degradation in CEL performance as we ramp up our usage.

In this post, I will share with you the optimizations we make to the implementation to get us to near native speeds - without impacting the user behavior at all.

Overview of the CEL evaluation process

Before we dive into the optimizations, let's first take a step back and understand how CEL evaluation works in the Rust implementation.

The core value type is represented by Value, which is similar to how you might represent JSON:

pub enum Value {
    List(Arc<Vec<Value>>),
    Map(Map),
    // Atoms
    Int(i64),
    Float(f64),
    String(Arc<String>),
    Bytes(Arc<Vec<u8>>),
    // .. some other fields
}

And the actual syntax tree is represented by an Expr:

pub enum Expr {
    Call(CallExpr),
    Identifier(String),
    Literal(Value),
}

Evaluation takes an Expr and Context and produces a Value. Context contains a set of functions, and variables (Map<String, Value>) that can be used during evaluation.

There are a few problems here that lead to performance issues in our use case:

Variables must be materialized to Value types before each evaluation. For example, in a simple lookup like request.headers["user-agent"], we need to take our request object (an http::Request) and convert it into a Value::Map before we can evaluate the expression. This is super expensive! To workaround this, we would attempt to only do this conversion once and reuse the same Value throughout a request. Additionally, we would attempt to process which variables they actually needed (for example, they may not use request at all in any expression), and only convert those variables to Value types. This was a huge improvement, but still not good enough, and led to a lot of complexity in the codebase to track which variables were materialized and which were not.
Values are always owned objects and heap allocated. This doesn't matter so much when (1) is still a bottleneck, since the conversion to Value just makes them into reference-counted objects which makes cloning cheap. However, when we optimize (1) this will become a big problem.
All structures are hashmaps, which imposes a ~20ns lookup cost for every field access. This is especially bad for deeply nested expressions like request.headers["user-agent"], which requires multiple lookups (60ns).

Native types in CEL

The ultimate solution I wanted was to be able to do something like this:

fn evaluate(expr: &Expr, req: http::Request) -> Value {
  Value::resolve(&expr, &req)
}

Here we just take a reference to a native Rust type, and allow the CEL interpreter to resolve fields directly on the native type without needing to convert it to a Value first. Ultimately we still return a Value, but we can delay the conversion to Value until the very end of the evaluation, and only convert the final result to a Value, which is a huge improvement. Additionally, we can make that Value a borrowed reference instead of an owned value, which can make that final conversion trivial.

Ultimately, an approach like this should be pretty close to writing native code like req.headers().get("user-agent") directly in Rust, since we are just adding a thin layer of interpretation on top of it.

References

First, we need references! We change Value to not be owned:

pub enum Value<'a> {
	List(ListValue<'a>),
	Map(MapValue<'a>),
	Int(i64),
	Float(f64),
	String(StringValue<'a>),
	Bytes(BytesValue<'a>)
}

Each type (aside from the trivial ones like Int), can now be either an owned value or a borrowed reference:

pub enum StringValue<'a> {
	Borrowed(&'a str),
	Owned(Arc<str>),
}
pub enum ListValue<'a> {
	Borrowed(&'a [Value<'a>]),
	PartiallyOwned(Arc<[Value<'a>]>),
	Owned(Arc<[Value<'static>]>),
}

These types are similar to Cow but allow some additional optimizations to match our specific use cases.

Dynamic types

Next, we added a new type, Dynamic, to wrap a native Rust type. For this type, we need to be able to do any operations that CEL supports, such as field access, function calls, additional, indexing, conversion to JSON, etc. To do this, we defined a trait, but avoided making each implementation implement every single operation possible. Instead, we only allow looking up a field (request.headers, etc) and "materializing" to a proper Value. For example, request materialized would result in a Map (expensive but rare), while request.headers["user-agent"] would materialize to a String (cheap, especially now that we have references!). Because materialization is basically always best for simple types, we also all auto-materialization.

pub trait DynamicType: std::fmt::Debug + Send + Sync {
	// If the value can be freely converted to a Value, do so.
	// This is anything but list/map
	fn auto_materialize(&self) -> bool {
		false
	}

	// Convert this dynamic value into a proper value
	fn materialize(&self) -> Value<'_>;

	fn field(&self, field: &str) -> Option<Value<'_>> {
		None
	}
}

Better variables

Now that we can pass in a &http::Request directly as a variable, we have most of the pieces in place. However, the way variables are passed in becomes problematic as well: the HashMap<String, Value> creation now becomes a bottleneck!

This was resolved by making another trait, similar to DynamicType, but for variables:

pub trait VariableResolver<'a> {
	fn resolve(&self, expr: &str) -> Option<Value<'a>>;
}

This means to resolve request.headers["user-agent"], we essentially call:

variable_resolver.resolve("request").field("headers").field("user-agent")

Throw some codegen on it

All of this was great, and some initial benchmarks showed very promising results, but there was a huge problem: implementing these traits for every single type we wanted to support was a huge amount of boilerplate, and would be a nightmare to maintain.

Fortunately, we can generate all of this with a derive macro. Because many of the same concerns in this and serde are shared (like flatten, rename, default, etc), we reused these struct attributes to make things simple.

For example, our Request struct looked like this:


#[derive(Debug, Clone, Serialize, cel::DynamicType)]
#[serde(rename_all = "camelCase")]
pub struct RequestRef<'a> {
	/// The request's method
	#[serde(with = "http_serde::method")]
	#[dynamic(with_value = "to_value_str")]
	pub method: &'a http::Method,

	/// The request's URI
	#[serde(with = "http_serde::uri")]
	#[dynamic(with_value = "to_value_owned_string")]
	pub uri: &'a http::Uri,

	/// The request's version
	#[serde(with = "http_serde::version")]
	#[dynamic(with_value = "version_to_value")]
	pub version: http::Version,

	/// The request's headers
	#[serde(with = "http_serde::header_map")]
	pub headers: &'a http::HeaderMap,

	#[serde(skip_serializing_if = "Option::is_none")]
	pub end_time: Option<&'a str>,
}

(we have a small helper to turn an http::Request into a RequestRef; this is just setting up references to the Request so it is not expensive).

The cel::DynamicType looks something like this (showing a simple example):

pub struct BasicStruct {
    name: String,
    age: i32,
    active: bool,
}
impl DynamicType for BasicStruct {
    fn materialize(&self) -> Value<'_> {
        let mut m = ::vector_map::VecMap::with_capacity(3);
        Dynamic::materialize_into(self, &mut m);
        Value::Map(MapValue::Borrow(m))
    }
    fn field(&self, field: &str) -> Option<Value<'_>> {
        match field {
            "name" => Some(maybe_materialize(&self.name)),
            "age" => Some(maybe_materialize(&self.age)),
            "active" => Some(maybe_materialize(&self.active)),
            _ => None,
        }
    }
}

Putting it all together

With all of this in place, we have a full end to end setup to evaluate CEL expressions directly on native Rust types, without needing to convert them to Value first, and with the ability to borrow references to avoid unnecessary allocations.

To test this out, I wanted to compare the performance not just to the old implementation, but to doing things natively, so I setup a few benchmarks:

#[divan::bench]
fn bench_cel(b: Bencher) {
	let expr = Arc::new(Expression::new(r#"request.headers['x-example']"#).unwrap());
	let req = ::http::Request::builder()
		.method(Method::GET)
		.header("x-example", "value")
		.body(Body::empty())
		.unwrap();
	let exec = crate::cel::Executor::new_request(&req);

	b.bench(|| {
		exec.eval(&expr).unwrap();
	});
}
#[divan::bench]
fn bench_native(b: Bencher) {
	let req = ::http::Request::builder()
		.method(Method::GET)
		.header("x-example", "value")
		.body(Body::empty())
		.unwrap();
	b.bench(|| {
		divan::black_box(req.headers().get("x-example").unwrap());
	});
}

The results are quite surprising!

bench_cel         21.2 ns
bench_native      21.9 ns

Our CEL implementation is faster than the native approach! How can it be??

Well the answer here is that I cheated and accidentally used a poor native implementation: each lookup needs to convert an x-example string into a HeaderValue; a better approach when we know the header name statically (which is common) is:

let hv: ::http::HeaderName = "x-example".try_into().unwrap();
b.bench(|| {
  divan::black_box(req.headers().get(&hv).unwrap());
});

Which completes in about 8ns on my machine.

In our CEL implementation, we automatically optimize the request.headers['x-example'] lookup to precompute a HeaderName for x-example, which is what allows us to get parity with the native approach.

Even without this, we are still pretty darn close to native -- just 10ns overhead!

Expression optimizers

I mentioned above that we have an automatic optimization for header lookups. The original CEL implementation didn't have any concept of expression optimizations, and always just traversed the raw AST.

As part of this work, I added a new optimization concept. For now, the optimizations are fairly simple but powerful:

Any inline values are constructed once (for example, ["a", "b"] can be precomputed; otherwise this is constructed at runtime since the AST expression and Value types differ).
For things like path.matches('^/user/[0-9]+$'), cidr("127.0.0.1/8").containsIP(request.headers['x-forwarded-for']) we can precompile the regex/cidr/etc at compile time instead of at runtime.

While fairly simple, these optimizations can have a huge impact on performance, especially for more complex expressions.

Comparison to the old implementation

In our test suite, we have a set of real-world expressions we use as a standard benchmark suite. When compared to the original implementation, we see a huge improvement in performance, at least 5x across all cases:

Test Case	Improvement
Regex	589x
Body based routing	16x
CIDR match	5x
Header lookup	7x
Simple access	6x

Even more importantly, we no longer have the expensive "build Value" step, which was the biggest bottleneck in the original implementation (there are no comparisons here, as this step is completely removed!).

Comparing a simple header lookup

Implementation	Runtime	Allocations/operation
Rust (Optimized)	21ns	0
Rust (Original)	147ns	Didn't measure
Golang	97ns	2
C++^*	347ns	Didn't measure

^* This seems unlikely to be correct, so I strongly suspect I measured it wrong.

Outside of the micro-benchmarks, I also did a real world test in a representative workload with about 20 expressions. Throughput raised from 250k QPS to 400k QPS end to end (or, 86K QPS to 400K QPS if we include a Regex, since that was so poor before)!

Overall, these changes made a huge impact on our performance and have been a big success in our adoption of CEL!

CEL in Rust#

Overview of the CEL evaluation process#

Native types in CEL#

References#

Dynamic types#

Better variables#

Throw some codegen on it#

Putting it all together#

Expression optimizers#

Comparison to the old implementation#