This workshop demonstrates how to route embedding requests through Agent Gateway to a self-hosted vLLM server running an OpenAI-compatible API. This is the Agent Gateway equivalent of a LiteLLM config like:
- model_name: qwen3
litellm_params:
model: hosted_vllm//apps/ecs_mounts/data/q3.6b
api_key: $VLLM_API_KEY
api_base: $VLLM_HOST_URL
mode: embedding- An EKS cluster with Solo Enterprise Agent Gateway installed
kubectlconfigured and connected to the cluster- A vLLM server accessible from the cluster (this workshop uses a mock server for demonstration)
Skip this step if you already have a vLLM server running. Replace the host/port in later steps with your actual vLLM endpoint.
Deploy a mock server that implements the OpenAI-compatible /v1/embeddings endpoint:
kubectl apply -n enterprise-agentgateway -f- <<'EOF'
apiVersion: apps/v1
kind: Deployment
metadata:
name: mock-vllm
labels:
app: mock-vllm
spec:
replicas: 1
selector:
matchLabels:
app: mock-vllm
template:
metadata:
labels:
app: mock-vllm
spec:
containers:
- name: server
image: python:3.11-slim
command:
- python3
- -c
- |
from http.server import HTTPServer, BaseHTTPRequestHandler
import json, sys, hashlib
class Handler(BaseHTTPRequestHandler):
def do_POST(self):
length = int(self.headers.get('Content-Length', 0))
body = self.rfile.read(length).decode() if length else ''
print(f"\n=== REQUEST ===", flush=True)
print(f"Path: {self.path}", flush=True)
print(f"Headers:", flush=True)
for k, v in self.headers.items():
print(f" {k}: {v}", flush=True)
print(f"Body: {body[:500]}", flush=True)
if '/v1/embeddings' in self.path:
req = json.loads(body) if body else {}
text = req.get('input', 'default')
if isinstance(text, list):
text = text[0]
h = hashlib.sha256(str(text).encode()).hexdigest()
embedding = [int(h[i:i+2], 16) / 255.0 - 0.5 for i in range(0, 20, 2)]
resp = json.dumps({
"object": "list",
"data": [{"object": "embedding", "index": 0, "embedding": embedding}],
"model": req.get("model", "qwen3"),
"usage": {"prompt_tokens": len(str(text).split()), "total_tokens": len(str(text).split())}
})
else:
resp = json.dumps({"error": "unknown endpoint", "path": self.path})
print(f"===============", flush=True)
self.send_response(200)
self.send_header('Content-Type', 'application/json')
self.end_headers()
self.wfile.write(resp.encode())
def do_GET(self):
self.do_POST()
print("Mock vLLM server listening on :8000", flush=True)
HTTPServer(('', 8000), Handler).serve_forever()
ports:
- containerPort: 8000
---
apiVersion: v1
kind: Service
metadata:
name: mock-vllm
labels:
app: mock-vllm
spec:
selector:
app: mock-vllm
ports:
- name: http
port: 8000
targetPort: 8000
EOFWait for the pod to be ready:
kubectl wait --for=condition=ready pod -l app=mock-vllm \
-n enterprise-agentgateway --timeout=60sStore the vLLM API key in a Kubernetes secret. Agent Gateway will attach this as a Bearer token on upstream requests.
kubectl apply -n enterprise-agentgateway -f- <<'EOF'
apiVersion: v1
kind: Secret
metadata:
name: vllm-secret
type: Opaque
stringData:
Authorization: <your VLLM_API_KEY value>
EOFConfigure an OpenAI-compatible backend pointing to the vLLM server. The key settings are:
provider.openai.model— the vLLM model identifier (matches the LiteLLMmodelfield after the provider prefix)provider.host/provider.port— the vLLM server address (maps to LiteLLMapi_base)provider.path— set to/v1/embeddingsto route to the embeddings endpoint (maps to LiteLLMmode: embedding)
kubectl apply -n enterprise-agentgateway -f- <<'EOF'
apiVersion: agentgateway.dev/v1alpha1
kind: AgentgatewayBackend
metadata:
name: qwen3
namespace: enterprise-agentgateway
spec:
ai:
provider:
openai:
model: /apps/ecs_mounts/data/q3.6b
host: mock-vllm.enterprise-agentgateway.svc.cluster.local
port: 8000
path: "/v1/embeddings"
policies:
auth:
secretRef:
name: vllm-secret
EOFFor a real vLLM server, replace
hostandportwith your actual vLLM endpoint. If it uses HTTPS, addpolicies.tls.sni: <hostname>.
kubectl apply -n enterprise-agentgateway -f- <<'EOF'
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: qwen3
namespace: enterprise-agentgateway
spec:
parentRefs:
- name: agentgateway
namespace: enterprise-agentgateway
rules:
- matches:
- path:
type: PathPrefix
value: /qwen3
backendRefs:
- name: qwen3
namespace: enterprise-agentgateway
group: agentgateway.dev
kind: AgentgatewayBackend
EOFBy default, Agent Gateway treats all AI backend traffic as Completions and parses the request body for messages. Since embeddings requests use input instead of messages, we need to configure the route type as Passthrough for the embeddings path so the gateway forwards the request body as-is.
kubectl apply -n enterprise-agentgateway -f- <<'EOF'
apiVersion: agentgateway.dev/v1alpha1
kind: AgentgatewayPolicy
metadata:
name: qwen3-embeddings
namespace: enterprise-agentgateway
spec:
targetRefs:
- group: gateway.networking.k8s.io
kind: HTTPRoute
name: qwen3
backend:
ai:
routes:
"/v1/chat/completions": "Completions"
"/v1/embeddings": "Passthrough"
"/v1/models": "Passthrough"
"*": "Passthrough"
EOFVerify all resources are accepted:
kubectl get agentgatewaybackend qwen3 -n enterprise-agentgateway \
-o jsonpath='{.status.conditions[0].reason}' && echo ""
kubectl get httproute qwen3 -n enterprise-agentgateway \
-o jsonpath='{.status.parents[0].conditions[0].reason}' && echo ""
kubectl get agentgatewaypolicy qwen3-embeddings -n enterprise-agentgateway \
-o jsonpath='{.status.ancestors[0].conditions[0].reason}' && echo ""Expected: all three show Accepted (or Valid).
Send an embeddings request:
curl -s -X POST "http://localhost:8080/qwen3" \
-H "Content-Type: application/json" \
-d '{
"model": "/apps/ecs_mounts/data/q3.6b",
"input": "The food was delicious and the waiter was friendly.",
"encoding_format": "float"
}' | jqExpected response:
{
"object": "list",
"data": [
{
"object": "embedding",
"index": 0,
"embedding": [-0.217, -0.127, -0.437, 0.025, ...]
}
],
"model": "/apps/ecs_mounts/data/q3.6b",
"usage": {
"prompt_tokens": 9,
"total_tokens": 9
}
}Verify the upstream server received the correct path and API key:
kubectl logs -n enterprise-agentgateway -l app=mock-vllm -c server --tail=15Expected:
=== REQUEST ===
Path: /v1/embeddings
Headers:
content-type: application/json
authorization: Bearer <your-api-key>
host: mock-vllm.enterprise-agentgateway.svc.cluster.local
Body: {"model": "/apps/ecs_mounts/data/q3.6b", "input": "The food was delicious..."}
Response: embedding with 10 dims
===============
kubectl delete agentgatewaypolicy qwen3-embeddings -n enterprise-agentgateway
kubectl delete httproute qwen3 -n enterprise-agentgateway
kubectl delete agentgatewaybackend qwen3 -n enterprise-agentgateway
kubectl delete secret vllm-secret -n enterprise-agentgateway
kubectl delete deployment mock-vllm -n enterprise-agentgateway
kubectl delete svc mock-vllm -n enterprise-agentgateway| LiteLLM Config | Agent Gateway Equivalent |
|---|---|
model_name: qwen3 |
AgentgatewayBackend name + HTTPRoute path /qwen3 |
model: hosted_vllm//<path> |
ai.provider.openai.model: <path> |
api_key: $VLLM_API_KEY |
Secret with policies.auth.secretRef |
api_base: $VLLM_HOST_URL |
ai.provider.host + ai.provider.port |
mode: embedding |
ai.provider.path: "/v1/embeddings" + AgentgatewayPolicy with routes: {"/v1/embeddings": "Passthrough"} |
- OpenAI-compatible provider: Use
ai.provider.openaiwith customhost/portto point to any OpenAI-compatible server (vLLM, Ollama, LM Studio, etc.) - Path override:
ai.provider.pathcontrols which API endpoint the gateway rewrites to on the upstream server - Route types via AgentgatewayPolicy: By default the gateway parses requests as chat completions. For embeddings, set the route to
Passthroughso the request body is forwarded as-is without parsing formessages - Auth injection: The gateway automatically attaches the API key from the referenced secret as a
Bearertoken on upstream requests