Skip to content

Instantly share code, notes, and snippets.

@nerdalert
Last active January 30, 2026 06:30
Show Gist options
  • Select an option

  • Save nerdalert/d44f30145b03ec397d3903909beba0d3 to your computer and use it in GitHub Desktop.

Select an option

Save nerdalert/d44f30145b03ec397d3903909beba0d3 to your computer and use it in GitHub Desktop.

vSR LlmInferenceServices Kserve Simulator Demo

$ ./deploy/openshift/deploy-to-openshift.sh --kserve --simulator --no-observability
[SUCCESS] Logged in as cluster-admin
[INFO] Creating namespace: vllm-semantic-router-system
namespace/vllm-semantic-router-system configured
[SUCCESS] Namespace ready
[INFO] Installing KServe and LLMInferenceService CRDs...
[INFO] InferenceService CRD already installed.
[INFO] LLMInferenceService CRD already installed.
[INFO] cert-manager namespace already present.
deployment.apps/cert-manager condition met
deployment.apps/cert-manager-webhook condition met
deployment.apps/cert-manager-cainjector condition met
deployment.apps/kserve-controller-manager condition met
[SUCCESS] KServe webhook service has ready endpoints
clusterrole.rbac.authorization.k8s.io/system:openshift:scc:anyuid added: "llmisvc-controller-manager"
deployment.apps/llmisvc-controller-manager restarted
deployment.apps/llmisvc-controller-manager condition met
[SUCCESS] LLMInferenceService webhook has ready endpoints
[INFO] Ensuring LLMInferenceServiceConfig templates...
llminferenceserviceconfig.serving.kserve.io/kserve-config-llm-decode-template unchanged
llminferenceserviceconfig.serving.kserve.io/kserve-config-llm-decode-worker-data-parallel unchanged
llminferenceserviceconfig.serving.kserve.io/kserve-config-llm-prefill-template unchanged
llminferenceserviceconfig.serving.kserve.io/kserve-config-llm-prefill-worker-data-parallel unchanged
llminferenceserviceconfig.serving.kserve.io/kserve-config-llm-router-route unchanged
llminferenceserviceconfig.serving.kserve.io/kserve-config-llm-scheduler unchanged
llminferenceserviceconfig.serving.kserve.io/kserve-config-llm-template unchanged
llminferenceserviceconfig.serving.kserve.io/kserve-config-llm-worker-data-parallel unchanged
configmap/inferenceservice-config patched (no change)
[SUCCESS] All KServe CRDs already installed.
deployment.apps/llmisvc-controller-manager condition met
[INFO] Ensuring simulator service account and SCC...
clusterrole.rbac.authorization.k8s.io/system:openshift:scc:anyuid added: "llmisvc-workload"
clusterrole.rbac.authorization.k8s.io/system:openshift:scc:privileged added: "llmisvc-workload"
clusterrole.rbac.authorization.k8s.io/system:openshift:scc:privileged added: "llmisvc-controller-manager"
[INFO] Deploying simulator LLMInferenceServices...
llminferenceservice.serving.kserve.io/model-a created
llminferenceservice.serving.kserve.io/model-b created
[INFO] Waiting for simulator LLMInferenceServices to be ready...
llminferenceservice.serving.kserve.io/model-a condition met
llminferenceservice.serving.kserve.io/model-b condition met
[INFO] KServe mode: Deploying semantic-router with KServe backend...

==================================================
  vLLM Semantic Router - KServe Deployment
==================================================

Configuration:
  Namespace:              vllm-semantic-router-system
  Simulator Mode:         true
  LLMInferenceService A:  model-a
  LLMInferenceService B:  model-b
  Model A Name:           Model-A
  Model B Name:           Model-B
  Embedding Model:        all-MiniLM-L12-v2
  Storage Class:          <cluster default>
  Models PVC Size:        10Gi
  Cache PVC Size:         5Gi
  Dry Run:                false

Step 1: Validating prerequisites...
✓ OpenShift CLI found
✓ Logged in as cluster-admin
✓ Namespace exists: vllm-semantic-router-system
✓ LLMInferenceService exists: model-a
✓ LLMInferenceService is ready
✓ LLMInferenceService exists: model-b
✓ LLMInferenceService is ready
Creating stable ClusterIP service for predictor: model-a
✓ Predictor service ClusterIP A: 172.30.103.62 (stable across pod restarts)
Creating stable ClusterIP service for predictor: model-b
✓ Predictor service ClusterIP B: 172.30.6.32 (stable across pod restarts)

Step 2: Generating manifests...
✓ Generated: configmap-router-config.yaml
✓ Generated: configmap-envoy-config.yaml
✓ Generated: serviceaccount.yaml
✓ Generated: pvc.yaml
✓ Generated: peerauthentication.yaml
✓ Generated: deployment.yaml
✓ Generated: service.yaml
✓ Generated: route.yaml

Step 3: Deploying to OpenShift...
serviceaccount/semantic-router unchanged
persistentvolumeclaim/semantic-router-models created
persistentvolumeclaim/semantic-router-cache created
configmap/semantic-router-kserve-config created
configmap/semantic-router-envoy-kserve-config created
Skipping PeerAuthentication (Istio CRD not found).
deployment.apps/semantic-router-kserve created
service/semantic-router-kserve created
route.route.openshift.io/semantic-router-kserve created
route.route.openshift.io/semantic-router-kserve-api created
✓ Resources deployed successfully

Step 4: Waiting for deployment to be ready...
This may take a few minutes while models are downloaded...

  Waiting for pod... (1/36)
  Waiting for pod... (2/36)
  Initializing... (downloading models)
  Initializing... (downloading models)
  Initializing... (downloading models)
  Initializing... (downloading models)
  Initializing... (downloading models)
  Waiting for pod... (8/36)
  Waiting for pod... (9/36)
  Waiting for pod... (10/36)
  Waiting for pod... (11/36)
  Waiting for pod... (12/36)

  Quick status (init logs):
Downloaded sentence-transformers/all-MiniLM-L12-v2
All models downloaded successfully!
Model download complete!
total 40
drwxrwsr-x. 8 root       1001240000  4096 Jan 30 05:52 .
drwxr-xr-t. 4 root       root          33 Jan 30 05:51 ..
drwxr-sr-x. 6 1001240000 1001240000  4096 Jan 30 05:52 all-MiniLM-L12-v2
drwxr-sr-x. 3 1001240000 1001240000  4096 Jan 30 05:51 category_classifier_modernbert-base_model
drwxr-sr-x. 3 1001240000 1001240000  4096 Jan 30 05:52 jailbreak_classifier_modernbert-base_model
drwxrws---. 2 root       1001240000 16384 Jan 30 05:51 lost+found
drwxr-sr-x. 3 1001240000 1001240000  4096 Jan 30 05:51 pii_classifier_modernbert-base_model
drwxr-sr-x. 3 1001240000 1001240000  4096 Jan 30 05:52 pii_classifier_modernbert-base_presidio_token_model
Setting proper permissions...
Creating cache directories...
Model download complete!

  Waiting for pod... (13/36)
  Waiting for pod... (14/36)
  Waiting for pod... (15/36)
  Waiting for pod... (16/36)
  Waiting for pod... (17/36)
  Waiting for pod... (18/36)
  Waiting for pod... (19/36)
  Waiting for pod... (20/36)
  Waiting for pod... (21/36)
  Waiting for pod... (22/36)
  Waiting for pod... (23/36)
✓ Pod is ready: semantic-router-kserve-5696479cbd-q7kl7


✓ External URL: https://semantic-router-kserve-vllm-semantic-router-system.apps.brent.pcbk.p1.openshiftapps.com

==================================================
  Deployment Complete!
==================================================

Next steps:

1. Set the route:
   ENVOY_ROUTE=semantic-router-kserve-vllm-semantic-router-system.apps.brent.pcbk.p1.openshiftapps.com

2. Test model auto-routing:
   curl -k -X POST https://semantic-router-kserve-vllm-semantic-router-system.apps.brent.pcbk.p1.openshiftapps.com/v1/chat/completions \
     -H "Content-Type: application/json" \
     -d '{"model":"auto","messages":[{"role":"user","content":"Explain the elements of a contract under common law and give a simple example."}]}'

3. View logs:
   oc logs -l app=semantic-router -c semantic-router -n vllm-semantic-router-system -f


For more information, see: semantic-router/deploy/kserve/README.md

[SUCCESS] KServe deployment complete

Validation

$ ENVOY_ROUTE=semantic-router-kserve-vllm-semantic-router-system.apps.brent.pcbk.p1.openshiftapps.com
$ curl -k -X POST https://semantic-router-kserve-vllm-semantic-router-system.apps.brent.pcbk.p1.openshiftapps.com/v1/chat/completions \
     -H "Content-Type: application/json" \
     -d '{"model":"auto","messages":[{"role":"user","content":"Explain the elements of a contract under common law and give a simple example."}]}'
{
  "id": "chatcmpl-0b73f7dc-0014-4e59-84c7-8dc0e2227241",
  "created": 1769752597,
  "model": "Model-B",
  "usage": {
    "prompt_tokens": 15,
    "completion_tokens": 32,
    "total_tokens": 47
  },
  "object": "chat.completion",
  "do_remote_decode": false,
  "do_remote_prefill": false,
  "remote_block_ids": null,
  "remote_engine_id": "",
  "remote_host": "",
  "remote_port": 0,
  "choices": [
    {
      "index": 0,
      "finish_reason": "stop",
      "message": {
        "role": "assistant",
        "content": "To be or not to be that is the question. Today it is partially cloudy and raining. Testing@, #testing 1$ ,2%,3^"
      }
    }
  ]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment