Skip to content

Instantly share code, notes, and snippets.

@rvennam
Created February 13, 2026 16:26
Show Gist options
  • Select an option

  • Save rvennam/0c7927ee3a8ba6728e8078b59c28afde to your computer and use it in GitHub Desktop.

Select an option

Save rvennam/0c7927ee3a8ba6728e8078b59c28afde to your computer and use it in GitHub Desktop.

Self-Hosted vLLM Embeddings with Agent Gateway

This workshop demonstrates how to route embedding requests through Agent Gateway to a self-hosted vLLM server running an OpenAI-compatible API. This is the Agent Gateway equivalent of a LiteLLM config like:

- model_name: qwen3
  litellm_params:
    model: hosted_vllm//apps/ecs_mounts/data/q3.6b
    api_key: $VLLM_API_KEY
    api_base: $VLLM_HOST_URL
    mode: embedding

Prerequisites

  • An EKS cluster with Solo Enterprise Agent Gateway installed
  • kubectl configured and connected to the cluster
  • A vLLM server accessible from the cluster (this workshop uses a mock server for demonstration)

Step 1: Deploy a Mock vLLM Server (for testing)

Skip this step if you already have a vLLM server running. Replace the host/port in later steps with your actual vLLM endpoint.

Deploy a mock server that implements the OpenAI-compatible /v1/embeddings endpoint:

kubectl apply -n enterprise-agentgateway -f- <<'EOF'
apiVersion: apps/v1
kind: Deployment
metadata:
  name: mock-vllm
  labels:
    app: mock-vllm
spec:
  replicas: 1
  selector:
    matchLabels:
      app: mock-vllm
  template:
    metadata:
      labels:
        app: mock-vllm
    spec:
      containers:
      - name: server
        image: python:3.11-slim
        command:
        - python3
        - -c
        - |
          from http.server import HTTPServer, BaseHTTPRequestHandler
          import json, sys, hashlib

          class Handler(BaseHTTPRequestHandler):
              def do_POST(self):
                  length = int(self.headers.get('Content-Length', 0))
                  body = self.rfile.read(length).decode() if length else ''
                  print(f"\n=== REQUEST ===", flush=True)
                  print(f"Path: {self.path}", flush=True)
                  print(f"Headers:", flush=True)
                  for k, v in self.headers.items():
                      print(f"  {k}: {v}", flush=True)
                  print(f"Body: {body[:500]}", flush=True)
                  if '/v1/embeddings' in self.path:
                      req = json.loads(body) if body else {}
                      text = req.get('input', 'default')
                      if isinstance(text, list):
                          text = text[0]
                      h = hashlib.sha256(str(text).encode()).hexdigest()
                      embedding = [int(h[i:i+2], 16) / 255.0 - 0.5 for i in range(0, 20, 2)]
                      resp = json.dumps({
                          "object": "list",
                          "data": [{"object": "embedding", "index": 0, "embedding": embedding}],
                          "model": req.get("model", "qwen3"),
                          "usage": {"prompt_tokens": len(str(text).split()), "total_tokens": len(str(text).split())}
                      })
                  else:
                      resp = json.dumps({"error": "unknown endpoint", "path": self.path})
                  print(f"===============", flush=True)
                  self.send_response(200)
                  self.send_header('Content-Type', 'application/json')
                  self.end_headers()
                  self.wfile.write(resp.encode())
              def do_GET(self):
                  self.do_POST()

          print("Mock vLLM server listening on :8000", flush=True)
          HTTPServer(('', 8000), Handler).serve_forever()
        ports:
        - containerPort: 8000
---
apiVersion: v1
kind: Service
metadata:
  name: mock-vllm
  labels:
    app: mock-vllm
spec:
  selector:
    app: mock-vllm
  ports:
  - name: http
    port: 8000
    targetPort: 8000
EOF

Wait for the pod to be ready:

kubectl wait --for=condition=ready pod -l app=mock-vllm \
  -n enterprise-agentgateway --timeout=60s

Step 2: Create an API Key Secret

Store the vLLM API key in a Kubernetes secret. Agent Gateway will attach this as a Bearer token on upstream requests.

kubectl apply -n enterprise-agentgateway -f- <<'EOF'
apiVersion: v1
kind: Secret
metadata:
  name: vllm-secret
type: Opaque
stringData:
  Authorization: <your VLLM_API_KEY value>
EOF

Step 3: Create the AgentgatewayBackend

Configure an OpenAI-compatible backend pointing to the vLLM server. The key settings are:

  • provider.openai.model — the vLLM model identifier (matches the LiteLLM model field after the provider prefix)
  • provider.host / provider.port — the vLLM server address (maps to LiteLLM api_base)
  • provider.path — set to /v1/embeddings to route to the embeddings endpoint (maps to LiteLLM mode: embedding)
kubectl apply -n enterprise-agentgateway -f- <<'EOF'
apiVersion: agentgateway.dev/v1alpha1
kind: AgentgatewayBackend
metadata:
  name: qwen3
  namespace: enterprise-agentgateway
spec:
  ai:
    provider:
      openai:
        model: /apps/ecs_mounts/data/q3.6b
      host: mock-vllm.enterprise-agentgateway.svc.cluster.local
      port: 8000
      path: "/v1/embeddings"
  policies:
    auth:
      secretRef:
        name: vllm-secret
EOF

For a real vLLM server, replace host and port with your actual vLLM endpoint. If it uses HTTPS, add policies.tls.sni: <hostname>.

Step 4: Create an HTTPRoute

kubectl apply -n enterprise-agentgateway -f- <<'EOF'
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: qwen3
  namespace: enterprise-agentgateway
spec:
  parentRefs:
    - name: agentgateway
      namespace: enterprise-agentgateway
  rules:
  - matches:
    - path:
        type: PathPrefix
        value: /qwen3
    backendRefs:
    - name: qwen3
      namespace: enterprise-agentgateway
      group: agentgateway.dev
      kind: AgentgatewayBackend
EOF

Step 5: Create an AgentgatewayPolicy with Route Types

By default, Agent Gateway treats all AI backend traffic as Completions and parses the request body for messages. Since embeddings requests use input instead of messages, we need to configure the route type as Passthrough for the embeddings path so the gateway forwards the request body as-is.

kubectl apply -n enterprise-agentgateway -f- <<'EOF'
apiVersion: agentgateway.dev/v1alpha1
kind: AgentgatewayPolicy
metadata:
  name: qwen3-embeddings
  namespace: enterprise-agentgateway
spec:
  targetRefs:
  - group: gateway.networking.k8s.io
    kind: HTTPRoute
    name: qwen3
  backend:
    ai:
      routes:
        "/v1/chat/completions": "Completions"
        "/v1/embeddings": "Passthrough"
        "/v1/models": "Passthrough"
        "*": "Passthrough"
EOF

Verify all resources are accepted:

kubectl get agentgatewaybackend qwen3 -n enterprise-agentgateway \
  -o jsonpath='{.status.conditions[0].reason}' && echo ""
kubectl get httproute qwen3 -n enterprise-agentgateway \
  -o jsonpath='{.status.parents[0].conditions[0].reason}' && echo ""
kubectl get agentgatewaypolicy qwen3-embeddings -n enterprise-agentgateway \
  -o jsonpath='{.status.ancestors[0].conditions[0].reason}' && echo ""

Expected: all three show Accepted (or Valid).

Step 6: Test Embeddings

Send an embeddings request:

curl -s -X POST "http://localhost:8080/qwen3" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "/apps/ecs_mounts/data/q3.6b",
    "input": "The food was delicious and the waiter was friendly.",
    "encoding_format": "float"
  }' | jq

Expected response:

{
  "object": "list",
  "data": [
    {
      "object": "embedding",
      "index": 0,
      "embedding": [-0.217, -0.127, -0.437, 0.025, ...]
    }
  ],
  "model": "/apps/ecs_mounts/data/q3.6b",
  "usage": {
    "prompt_tokens": 9,
    "total_tokens": 9
  }
}

Verify the upstream server received the correct path and API key:

kubectl logs -n enterprise-agentgateway -l app=mock-vllm -c server --tail=15

Expected:

=== REQUEST ===
Path: /v1/embeddings
Headers:
  content-type: application/json
  authorization: Bearer <your-api-key>
  host: mock-vllm.enterprise-agentgateway.svc.cluster.local
Body: {"model": "/apps/ecs_mounts/data/q3.6b", "input": "The food was delicious..."}
Response: embedding with 10 dims
===============

Cleanup

kubectl delete agentgatewaypolicy qwen3-embeddings -n enterprise-agentgateway
kubectl delete httproute qwen3 -n enterprise-agentgateway
kubectl delete agentgatewaybackend qwen3 -n enterprise-agentgateway
kubectl delete secret vllm-secret -n enterprise-agentgateway
kubectl delete deployment mock-vllm -n enterprise-agentgateway
kubectl delete svc mock-vllm -n enterprise-agentgateway

LiteLLM to Agent Gateway Mapping

LiteLLM Config Agent Gateway Equivalent
model_name: qwen3 AgentgatewayBackend name + HTTPRoute path /qwen3
model: hosted_vllm//<path> ai.provider.openai.model: <path>
api_key: $VLLM_API_KEY Secret with policies.auth.secretRef
api_base: $VLLM_HOST_URL ai.provider.host + ai.provider.port
mode: embedding ai.provider.path: "/v1/embeddings" + AgentgatewayPolicy with routes: {"/v1/embeddings": "Passthrough"}

Key Concepts

  • OpenAI-compatible provider: Use ai.provider.openai with custom host/port to point to any OpenAI-compatible server (vLLM, Ollama, LM Studio, etc.)
  • Path override: ai.provider.path controls which API endpoint the gateway rewrites to on the upstream server
  • Route types via AgentgatewayPolicy: By default the gateway parses requests as chat completions. For embeddings, set the route to Passthrough so the request body is forwarded as-is without parsing for messages
  • Auth injection: The gateway automatically attaches the API key from the referenced secret as a Bearer token on upstream requests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment