CEL is a great language (you are using it wrong)

Common Expression Language (CEL) is a great little language for embedding users' custom logic into an application. However, typically when I discuss it, I get a response something along the lines of "AHHHH!! I HATE CEL!!!"

I, too, have been in that position. However, when building Agentgateway I decided to fully embrace CEL throughout the entire stack, and the results have been great. The problems with CEL are not about CEL itself, but in its usage -- and a lot of usages are poor, giving CEL a bad reputation.

A quick intro to CEL

CEL claims to be "an expression language that's fast, portable, and safe to execute in performance-critical applications". It's general purpose is to allow users to provide short expressions that evaluate to some value (a string, a bool, etc). Some common use cases are defining authorization rules, filtering rules, etc.

Some example expressions:

has(jwt.sub) || request.url.startsWith("/public")

default(request.headers["x-user-id"], "anonymous")

These are fast enough that they can run 100s of times on a given request with minimal overhead. This means it's viable to liberally use them throughout an application.

Typically, the alternative to CEL in many of these applications is pretty terrible making CEL even more compelling.

CEL alternatives

For example, In Envoy you can filter logs like:

filter:
  and_filter:
    filters:
      - status_code_filter:
          comparison:
            value:
              default_value: 403
              runtime_key: unused
      - header_filter:
          header:
            name: ':path'
            string_match:
              prefix: /customer/v1/privileged/

In Agentgateway (with CEL), the same filter would be response.code == 403 || request.path == "/customer/v1/privileged/" -- unquestionably simpler.

In Kubernetes, the standard way to validate configuration was through webhooks. Running these is a huge operational burden, as you need to run a service, figure out certificates, etc, and now the webhook is on the hot-path for configuration modifications. With CEL (which was added to Kubernetes a few years back), these rules can be defined as CEL expressions like (self.type in ['Exact','PathPrefix']) ? !self.value.endsWith('/.') : true, removing the entire webhook service requirement. (If you haven't built a Kubernetes webhook before, I can assure you it's more of a pain than it sounds).

In Envoy, the state of the art to route a request based on a field in the JSON request payload is to run a 5,000 line of code external processor server. In Agentgateway, with CEL, the same logic can be expressed with an expression like json(request.body).model == gpt-4o without calling out to an external service.

CEL gone wrong

So, if CEL is so great, why does everyone seem to hate it?

In my mind, this is all due to the usage. Because CEL is embedded into applications, how it's embedded matters a great deal. It's entirely up to the application how a user can define CEL expressions, what data those expressions have access to, and what functions they can use. While CEL provides a modest standard library, most applications will benefit from a much broader set.

CEL in Kubernetes CRDs

Most CEL hate I see comes from users familiar with its usage in Kubernetes CRD validation. I want to use this as an example of how an integration's choices can lead to a poor user experience.

Note: I think CEL is a great feature in Kubernetes, and it's in fact why it gets so much scrutiny; a less useful feature would be silently ignored! This post isn't meant to disparage any of the great work done by the team.

Embedding

First, we need to look at how most developers end up interfacing with CEL expressions on CRDs: they write Go comments that embed CEL expressions in a string inside the single-line comment.

This means our relatively simple expression above actually looks like this:

// +kubebuilder:validation:XValidation:message="must not end with '/..' when type one of ['Exact', 'PathPrefix']",rule="(self.type in ['Exact','PathPrefix']) ? !self.value.endsWith('/..') : true"

I'll consider the fact this renders terribly in the post a feature, not a bug; this is the exact type of pain points embedding a language in a string in a comment causes.

This experience sucks. I cannot use line breaks to nicely format larger expressions. My IDE doesn't support syntax highlighting or syntax validation of code in a comment. If I want to test an expression, I really have no options.

Costs

The next problem is the Kubernetes-specific cost limitations on expression. Smartly so, Kubernetes does not want users to be able to write incredibly complex CEL expressions which bring down the API server. To achieve this, they assign a cost to each expression, and ensure the total cost of all expressions does not exceed some limit.

This all sounds great in theory, but in practice it is a massive pain.

A major annoyance is that the cost is extremely opaque. There is no good tooling to understand the cost of a given expression, or the total budget used. Extremely subtle changes to the expression (even ones that are functionally identical) can dramatically change the cost.

This is made worse by the cost function being unstable release-to-release (though, this seems to be something actively worked on and improving by the project). This means even if you do run tests to ensure you are under the budget, if you aren't running across a large range of Kubernetes versions, the results may not be right.

Worst of all, however, is how the costs are calculated.

Consider the following API that allows defining rules on how to match HTTP traffic (pulled from HTTPRoute):

rules:
- matches:
  - headers:
    - name: x-my-header
      value: bar

Note this is a triple nested list. So when deciding a cost on CEL expressions operating on this data, Kubernetes is extremely conservative and assumes each list could have a million elements, making any CEL expression operating on a header (even something absolutely trivial like name.len() < 5) assumed to need to run 1 million cubed -- a whopping quintillion times, so surely it must be rejected for exceeding the cost.

Of course, in practice, this is impossible. Well before a user hits a million elements in the list they would exceed the overall maximum object size (enforced by the size of the JSON object, not related to CEL). Even if one of the 3 lists had 1,000 elements, the worst case scenario of each nested list having 1,000 elements is unlikely; a user may have 1,000 rules, or 1,000 matches, but they are quite unlikely to have 1,000 rules each with 1,000 matches; even if they did, they would still have a hard limit due to the JSON size constraints.

What this means, in practice, is that to use CEL expressions at all, arbitrary limits must be placed on any list/map fields in an API. For existing APIs, this generally means a breaking change to reduce the limit. For all APIs, this means a rough user experience when users run into the arbitrary limits. No one likes to be told they had 9 header matches but your API decided to limit to only 8!

For evolution of APIs, this is even more problematic. Consider the above API, if I set a limit to 32 elements in each field, a number that would likely get me to about 50% of the budget if I had a few CEL expressions (but still low enough to annoy users!). Now, I want to add a field... and the cost blows the budget up!

All possible options are nonviable:

Removing validation rules (presumably, they were there for a reason!)
Not adding the new API field
Lowering the limits more (breaking existing users, and making it more likely to annoy future users)

Functions

As previously mentioned, without custom functions, CEL is pretty limited. Kubernetes provides an ever-evolving set of additional functions offered to CEL expressions.

Unfortunately, most CRD authors need to support many versions, meaning any new functions will be 2+ years before they are usable. During that time, it is tricky to avoid accidentally using the new functions, which can easily lead to accidental breakages.

How to make CEL great

To make CEL as good as it can be, I would recommend the following:

Make users write CEL expressions in a reasonable input form. Do not make users write them inside of Go comments. Do not make them write them inside of JSON strings and be forced to write expressions like "request.path == \"/i/hate/escaping\"". Usage in YAML is not too bad, as users can use blocks to avoid escaping and allow newlines:

expression: |
  request.path == "/my/path" ||
  request.method == "GET"

Define a useful set of functions. These should be for things specific to the domain the application runs in, but (unfortunately) also likely needs some to fill in gaps in the standard library. For example, agentgateway provides a mapValues function to apply a function to the values in a map, as CEL only offers a map function which applies to the keys in the map.

Document the environment. The set of functions and data available to expressions should be clearly laid out with examples.

Provide a playground and way to test. This gives users a REPL like experience to develop expressions, and a way to test their expressions. While there is a general CEL Playground available, because each environment runs in a different context with different functions, its use is limited in many cases. Additionally, there is no way for a user to utilize it to run automated testing against their expressions.

Don't play games with costs. If the expressions are going to be limited in some form, like based on cost, they should be done in a manner that doesn't hurt the user.

With all of these in place, CEL can be a huge enhancement to a project's capabilities and provide a great user experience.

A quick intro to CEL#

CEL alternatives#

CEL gone wrong#

CEL in Kubernetes CRDs#

Embedding#

Costs#

Functions#

How to make CEL great#