Your AI workloads still need a service mesh

For years I dismissed conversations around AI workloads needing new networking infrastructure. They are, after all, "just using HTTP", which standard service meshes and gateways are quite good at.

However, recently I have come to realize I was a little right but mostly wrong. AI workloads still have all the same requirements to secure, connect, and observe traffic, but the how is different. AI workloads exacerbates these requirements, putting additional demand on the networking infrastructure.

So you still need a service mesh... you just need a better one!

AI Demands

Overall I would group AI infrastructure into three areas:

LLM consumption (i.e. generated a completion from OpenAI)
LLM hosting (i.e. running a LLM model on your own machines)
AI protocol usage (i.e. hosting or consuming MCP or A2A applications)

Each of these puts additional infrastructure demands.

Iteration speed

Service meshes have never really done much that is terribly innovative. Anything a service mesh does could be done by the application itself -- its just really hard and tedious!

With AI workloads, this is even more important. Its one thing to spend months (or even years) rolling your own infrastructure for traditional services, but this approach is doomed with AI. With the rapid pace in the AI world, spending months getting policy and observability working for your AI workloads puts you miles behind competitors.

Worse, the space changes so fast that by the time you adopt the previous standards, a new one has emerged!

Being able to spend time on innovation, and leaving the boring parts to the infrastructure, is more important than ever.

Egress controls

~All organizations have ingress traffic controlled, and these days most are also managing service-to-service traffic (i.e. some form of service mesh). However, egress controls are often non-existent or rudimentary (simply doing NAT, etc).

LLM consumption is completely changing the status quo, and putting a huge demand on egress controls for a variety of reasons.

LLMs are extremely expensive, so the desire to monitor and control costs is important. Additionally, the reliability of the services is "unreliable" at best -- you probably want to make sure your AI model didn't decide it was a good idea to leak credit card numbers, or to insult a customer.

Traditional HTTP-based approaches fall short here, as they operate on requests. LLMs operate on tokens, meaning your infrastructure ought to as well so that you can effectively track costs and control traffic. Additionally, requirements such as modifying or blocking inappropriate content (often called "guardrails") require awareness of LLM protocols.

Observability

While observability is pretty important for traditional infrastructure, its even more important with AI due to their non-deterministic behavior and cost.

While very few would be comfortable with the costs to trace 100% of requests -- and if they did, would certainly not include the HTTP body -- this is practically a requirement to deploy AI infrastructure. When a single request can cost almost $1, the cost to store 1 trace span is well worth the visibility. As AI "agents" gain more autonomy and operate more and more dynamically, understanding the flow of requests is even more critical. You really want to know why your customer-service-agent decided to call the bank-account service!

AI protocols

Traditional networking infrastructure is generally operating on HTTP, and generally not even HTTP bodies but just headers and paths. For whatever reason, basically all AI protocols are built around JSON-RPC and SSE streams which traditional proxies like Envoy are not able to effectively handle.

However, the demands to control this traffic is as high as ever. For example, when an LLM is autonomously calling an MCP tool call, we really want to make sure we are doing fine-grained authorization. Native support for these protocols, therefor, becomes a table-stakes requirement for networking infrastructure.

Inference hosting

Typical gateways utilize load balancing algorithms that are optimized around quickly selecting a backend in a manner that is "good enough". When hosting your own LLMs, this is often a poor choice.

Given the astronomical cost, and limited supply, of GPUs, optimal utilization is critical. The load balancer spending 10ms to pick the optimal backend is well worth the latency if GPU utilization is increased by 10%, for example.

Projects like the Gateway API Inference Extension handle this by offering a customized inference-aware load balancer that is purpose built for inference hosting.

The state of infrastructure

From the above, hopefully you will agree there is a huge value possible from leaning on your networking infrastructure for AI workloads... the problem is, how do you do it?

Currently, the ecosystem is pretty immature here.

While "AI Gateways" or "LLM Gateways" have exploded in 2025, the majority of them are not very serious enterprise-compatible offerings. Additionally, they are only solving one targeted problem; equally important to functionality is integrations. You probably don't want to run a different gateway for HTTP, LLM, inference, and MCP. If you do, they really ought to work together and integrate with your other infrastructure (service mesh, IDP, telemetry backends, etc) which is unlikely.

I believe the key to success in this space will be a cohesive offering bundling these use cases together.

AI Demands#

Iteration speed#

Egress controls#

Observability#

AI protocols#

Inference hosting#

The state of infrastructure#