Qwen3-32B Disaggregated Serving Benchmark Results

Date: December 17, 2024
Model: Qwen/Qwen3-32B-FP8
Cluster: Nebius H200 (16 GPUs)
Framework: TensorRT-LLM via Dynamo

1. Cluster Configuration

Hardware

GPU Type: NVIDIA H200 SXM
GPU Count: 16 (across 2 nodes)
Kubernetes: v1.31.9

Access

# Nebius H200 cluster context
kubectl config use-context nebius-mk8s-dynamo-h200-01

2. AIConfigurator Analysis

Command Used

aiconfigurator cli default \
  --hf_id Qwen/Qwen3-32B-FP8 \
  --total_gpus 16 \
  --system h200_sxm \
  --isl 4000 \
  --osl 500 \
  --ttft 600 \
  --tpot 16.67 \
  --save_dir ./dynamo-configs

AIConfigurator Predictions

Configuration	GPUs	TTFT	tokens/s/gpu	Recommended Concurrency
Aggregated	16	1,108 ms	147.00	144
Disaggregated	16	542.58 ms	192.04	192

Predicted Improvement: Disaggregated is 31% better in tokens/s/gpu

Optimal Disagg Configuration (from AIC)

Prefill Workers: 10 replicas, TP=1, batch_size=1
Decode Workers: 3 replicas, TP=2, batch_size=64

3. Deployment

Namespace Setup

kubectl create namespace bhamm-aic-demo
kubectl create secret generic hf-token-secret \
  --from-literal=HF_TOKEN=<token> \
  -n bhamm-aic-demo

Deployment YAML

Applied the AIConfigurator-generated k8s_deploy.yaml which creates:

# DynamoGraphDeployment structure
Frontend: 1 replica
TRTLLMPrefillWorker: 10 replicas (1 GPU each)
TRTLLMDecodeWorker: 3 replicas (2 GPUs each)
# Total: 10 + 6 = 16 GPUs

Pod Status

NAME                                          READY   STATUS    RESTARTS
qwen3-32b-disagg-0-frontend-f62l4             1/1     Running   0
qwen3-32b-disagg-0-trtllmdecodeworker-22bm9   0/1     Running   5
qwen3-32b-disagg-0-trtllmdecodeworker-lmnhm   0/1     Running   5
qwen3-32b-disagg-0-trtllmdecodeworker-xwsqz   0/1     Running   4
qwen3-32b-disagg-0-trtllmprefillworker-*      1/1     Running   0  (x10)

Note: Decode workers show 0/1 Ready due to health check issue (see Known Issues).

4. Benchmark Commands

In-Cluster Benchmark Pod

apiVersion: v1
kind: Pod
metadata:
  name: benchmark-worker
  namespace: bhamm-aic-demo
spec:
  containers:
  - name: benchmark
    image: python:3.11-slim
    command: ["/bin/bash", "-c"]
    args:
      - |
        pip install aiperf aiohttp --quiet
        sleep infinity

AIPerf Commands

Low Concurrency (baseline TTFT):

kubectl exec -n bhamm-aic-demo benchmark-worker -- \
  aiperf profile \
    --url http://qwen3-32b-disagg-frontend:8000 \
    --model "Qwen/Qwen3-32B-FP8" \
    --num-requests 10 \
    --concurrency 1 \
    --isl 4000 \
    --osl 50 \
    --streaming

High Concurrency (throughput):

kubectl exec -n bhamm-aic-demo benchmark-worker -- \
  aiperf profile \
    --url http://qwen3-32b-disagg-frontend:8000 \
    --model "Qwen/Qwen3-32B-FP8" \
    --num-requests 100 \
    --concurrency 192 \
    --isl 4000 \
    --osl 500 \
    --streaming

5. Benchmark Results

Concurrency = 1 (Baseline Latency)

Metric	Measured	AIC Predicted	Delta
TTFT avg	958 ms	543 ms	+77%
TTFT p50	955 ms	-	-
ITL avg	21.25 ms	16.67 ms	+27%
Request Latency	1,999 ms	-	-

Concurrency = 192 (Recommended Load)

Metric	Measured	AIC Predicted	Delta
TTFT avg	22,297 ms	543 ms	+41x
TTFT p50	22,220 ms	-	-
TTFT min	1,787 ms	-	-
ITL avg	9.22 ms	16.67 ms	-45%
Throughput	1,045 tok/s	~3,072 tok/s	-66%
Tokens/s/gpu	65.3	192	-66%

Concurrency = 32 (Medium Load)

Metric	Measured
TTFT avg	5,296 ms
TTFT min	1,649 ms
ITL avg	9.23 ms
Throughput	721 tok/s

6. Key Findings

1. TTFT SLA Not Met

Target: 600 ms
Actual (concurrency=1): 958 ms
Even at minimum load, the TTFT exceeds the SLA by 60%

2. Throughput Gap

Predicted: ~3,072 tokens/s (192 tok/s/gpu × 16 GPUs)
Actual: 1,045 tokens/s
Gap: 66% lower than predicted

3. ITL Performance

ITL at high concurrency (9.22 ms) is actually better than predicted (16.67 ms)
Suggests the decode phase is performing well

4. Scaling Behavior

TTFT degrades significantly with concurrency
At c=1: 958 ms → At c=192: 22,297 ms
Indicates prefill bottleneck or queueing

7. Possible Causes

A. Decode Worker Health Check Issue

The decode workers continuously fail internal health checks with:

ValueError: Disaggregated params are required for decode mode

This is because Dynamo's internal canary health check calls generate on decode workers, but decode workers cannot generate without prefill context. This causes:

Constant pod restarts (4-5 restarts observed)
Log spam every 10 seconds
Potential instability affecting performance

B. KV Cache Transfer Latency

The KV cache transfer between prefill and decode workers may have higher latency than AIConfigurator's model accounts for, especially on multi-node deployments.

C. Model Calibration Differences

AIConfigurator's performance model may be calibrated on different hardware/software configurations than the actual Nebius H200 cluster.

D. Prefill Bottleneck

With 10 prefill workers (1 GPU each) vs 3 decode workers (2 GPUs each), the prefill stage may be the bottleneck at high concurrency, causing TTFT to spike.

E. Container/Network Overhead

The Kubernetes deployment adds overhead compared to the bare-metal performance that AIConfigurator models.

8. Known Issues

Decode Worker Health Check Bug

Impact: Decode workers show 0/1 Ready, restart frequently
Workaround: Actual inference through Frontend→Prefill→Decode path works
Fix Needed: Internal canary health check should be disabled for disagg decode workers

[dynamo_runtime::health_check] Canary timer expired for generate, sending health check
ValueError: Disaggregated params are required for decode mode
[dynamo_runtime::health_check] Health check error response from generate

9. Recommendations

File bug for decode worker health check issue in Dynamo
Investigate prefill/decode ratio - may need more decode workers
Profile KV cache transfer latency separately
Compare with aggregated deployment to validate relative performance claims
Test with lower ISL to see if prefill time is the main contributor

10. Raw Data

Concurrency 192 - Full JSON Metrics

{
  "time_to_first_token": {
    "unit": "ms",
    "avg": 22296.84,
    "p50": 22219.54,
    "p99": 42980.08,
    "min": 1786.96,
    "max": 43226.83
  },
  "inter_token_latency": {
    "unit": "ms",
    "avg": 9.22,
    "p50": 9.21,
    "p99": 9.36
  },
  "output_token_throughput": {
    "unit": "tokens/sec",
    "avg": 1045.12
  },
  "request_throughput": {
    "unit": "requests/sec",
    "avg": 2.09
  }
}

Summary

Aspect	Expected	Actual	Status
TTFT SLA (600ms)	✓ Meet	958ms @ c=1	❌ Violated
Throughput	192 tok/s/gpu	65 tok/s/gpu	⚠️ 34% of target
ITL	16.67 ms	9.22 ms	✓ Better
Deployment	Stable	Decode restarts	⚠️ Health check bug

Bottom Line: The disaggregated deployment is functional but significantly underperforms AIConfigurator predictions. The decode worker health check bug needs to be fixed, and further investigation is needed to understand the TTFT/throughput gap.

BenHamm/BENCHMARK_RESULTS_GIST.md