Skip to content

Instantly share code, notes, and snippets.

@BenHamm
Created December 17, 2025 23:35
Show Gist options
  • Select an option

  • Save BenHamm/0ab5f828413266042d588db6b63c011a to your computer and use it in GitHub Desktop.

Select an option

Save BenHamm/0ab5f828413266042d588db6b63c011a to your computer and use it in GitHub Desktop.
Qwen3-32B Disaggregated Serving Benchmark Results - AIConfigurator vs Actual Performance

Qwen3-32B Disaggregated Serving Benchmark Results

Date: December 17, 2024
Model: Qwen/Qwen3-32B-FP8
Cluster: Nebius H200 (16 GPUs)
Framework: TensorRT-LLM via Dynamo


1. Cluster Configuration

Hardware

  • GPU Type: NVIDIA H200 SXM
  • GPU Count: 16 (across 2 nodes)
  • Kubernetes: v1.31.9

Access

# Nebius H200 cluster context
kubectl config use-context nebius-mk8s-dynamo-h200-01

2. AIConfigurator Analysis

Command Used

aiconfigurator cli default \
  --hf_id Qwen/Qwen3-32B-FP8 \
  --total_gpus 16 \
  --system h200_sxm \
  --isl 4000 \
  --osl 500 \
  --ttft 600 \
  --tpot 16.67 \
  --save_dir ./dynamo-configs

AIConfigurator Predictions

Configuration GPUs TTFT tokens/s/gpu Recommended Concurrency
Aggregated 16 1,108 ms 147.00 144
Disaggregated 16 542.58 ms 192.04 192

Predicted Improvement: Disaggregated is 31% better in tokens/s/gpu

Optimal Disagg Configuration (from AIC)

  • Prefill Workers: 10 replicas, TP=1, batch_size=1
  • Decode Workers: 3 replicas, TP=2, batch_size=64

3. Deployment

Namespace Setup

kubectl create namespace bhamm-aic-demo
kubectl create secret generic hf-token-secret \
  --from-literal=HF_TOKEN=<token> \
  -n bhamm-aic-demo

Deployment YAML

Applied the AIConfigurator-generated k8s_deploy.yaml which creates:

# DynamoGraphDeployment structure
Frontend: 1 replica
TRTLLMPrefillWorker: 10 replicas (1 GPU each)
TRTLLMDecodeWorker: 3 replicas (2 GPUs each)
# Total: 10 + 6 = 16 GPUs

Pod Status

NAME                                          READY   STATUS    RESTARTS
qwen3-32b-disagg-0-frontend-f62l4             1/1     Running   0
qwen3-32b-disagg-0-trtllmdecodeworker-22bm9   0/1     Running   5
qwen3-32b-disagg-0-trtllmdecodeworker-lmnhm   0/1     Running   5
qwen3-32b-disagg-0-trtllmdecodeworker-xwsqz   0/1     Running   4
qwen3-32b-disagg-0-trtllmprefillworker-*      1/1     Running   0  (x10)

Note: Decode workers show 0/1 Ready due to health check issue (see Known Issues).


4. Benchmark Commands

In-Cluster Benchmark Pod

apiVersion: v1
kind: Pod
metadata:
  name: benchmark-worker
  namespace: bhamm-aic-demo
spec:
  containers:
  - name: benchmark
    image: python:3.11-slim
    command: ["/bin/bash", "-c"]
    args:
      - |
        pip install aiperf aiohttp --quiet
        sleep infinity

AIPerf Commands

Low Concurrency (baseline TTFT):

kubectl exec -n bhamm-aic-demo benchmark-worker -- \
  aiperf profile \
    --url http://qwen3-32b-disagg-frontend:8000 \
    --model "Qwen/Qwen3-32B-FP8" \
    --num-requests 10 \
    --concurrency 1 \
    --isl 4000 \
    --osl 50 \
    --streaming

High Concurrency (throughput):

kubectl exec -n bhamm-aic-demo benchmark-worker -- \
  aiperf profile \
    --url http://qwen3-32b-disagg-frontend:8000 \
    --model "Qwen/Qwen3-32B-FP8" \
    --num-requests 100 \
    --concurrency 192 \
    --isl 4000 \
    --osl 500 \
    --streaming

5. Benchmark Results

Concurrency = 1 (Baseline Latency)

Metric Measured AIC Predicted Delta
TTFT avg 958 ms 543 ms +77%
TTFT p50 955 ms - -
ITL avg 21.25 ms 16.67 ms +27%
Request Latency 1,999 ms - -

Concurrency = 192 (Recommended Load)

Metric Measured AIC Predicted Delta
TTFT avg 22,297 ms 543 ms +41x
TTFT p50 22,220 ms - -
TTFT min 1,787 ms - -
ITL avg 9.22 ms 16.67 ms -45%
Throughput 1,045 tok/s ~3,072 tok/s -66%
Tokens/s/gpu 65.3 192 -66%

Concurrency = 32 (Medium Load)

Metric Measured
TTFT avg 5,296 ms
TTFT min 1,649 ms
ITL avg 9.23 ms
Throughput 721 tok/s

6. Key Findings

1. TTFT SLA Not Met

  • Target: 600 ms
  • Actual (concurrency=1): 958 ms
  • Even at minimum load, the TTFT exceeds the SLA by 60%

2. Throughput Gap

  • Predicted: ~3,072 tokens/s (192 tok/s/gpu × 16 GPUs)
  • Actual: 1,045 tokens/s
  • Gap: 66% lower than predicted

3. ITL Performance

  • ITL at high concurrency (9.22 ms) is actually better than predicted (16.67 ms)
  • Suggests the decode phase is performing well

4. Scaling Behavior

  • TTFT degrades significantly with concurrency
  • At c=1: 958 ms → At c=192: 22,297 ms
  • Indicates prefill bottleneck or queueing

7. Possible Causes

A. Decode Worker Health Check Issue

The decode workers continuously fail internal health checks with:

ValueError: Disaggregated params are required for decode mode

This is because Dynamo's internal canary health check calls generate on decode workers, but decode workers cannot generate without prefill context. This causes:

  • Constant pod restarts (4-5 restarts observed)
  • Log spam every 10 seconds
  • Potential instability affecting performance

B. KV Cache Transfer Latency

The KV cache transfer between prefill and decode workers may have higher latency than AIConfigurator's model accounts for, especially on multi-node deployments.

C. Model Calibration Differences

AIConfigurator's performance model may be calibrated on different hardware/software configurations than the actual Nebius H200 cluster.

D. Prefill Bottleneck

With 10 prefill workers (1 GPU each) vs 3 decode workers (2 GPUs each), the prefill stage may be the bottleneck at high concurrency, causing TTFT to spike.

E. Container/Network Overhead

The Kubernetes deployment adds overhead compared to the bare-metal performance that AIConfigurator models.


8. Known Issues

Decode Worker Health Check Bug

Impact: Decode workers show 0/1 Ready, restart frequently
Workaround: Actual inference through Frontend→Prefill→Decode path works
Fix Needed: Internal canary health check should be disabled for disagg decode workers

[dynamo_runtime::health_check] Canary timer expired for generate, sending health check
ValueError: Disaggregated params are required for decode mode
[dynamo_runtime::health_check] Health check error response from generate

9. Recommendations

  1. File bug for decode worker health check issue in Dynamo
  2. Investigate prefill/decode ratio - may need more decode workers
  3. Profile KV cache transfer latency separately
  4. Compare with aggregated deployment to validate relative performance claims
  5. Test with lower ISL to see if prefill time is the main contributor

10. Raw Data

Concurrency 192 - Full JSON Metrics

{
  "time_to_first_token": {
    "unit": "ms",
    "avg": 22296.84,
    "p50": 22219.54,
    "p99": 42980.08,
    "min": 1786.96,
    "max": 43226.83
  },
  "inter_token_latency": {
    "unit": "ms",
    "avg": 9.22,
    "p50": 9.21,
    "p99": 9.36
  },
  "output_token_throughput": {
    "unit": "tokens/sec",
    "avg": 1045.12
  },
  "request_throughput": {
    "unit": "requests/sec",
    "avg": 2.09
  }
}

Summary

Aspect Expected Actual Status
TTFT SLA (600ms) ✓ Meet 958ms @ c=1 ❌ Violated
Throughput 192 tok/s/gpu 65 tok/s/gpu ⚠️ 34% of target
ITL 16.67 ms 9.22 ms ✓ Better
Deployment Stable Decode restarts ⚠️ Health check bug

Bottom Line: The disaggregated deployment is functional but significantly underperforms AIConfigurator predictions. The decode worker health check bug needs to be fixed, and further investigation is needed to understand the TTFT/throughput gap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment