Date: December 17, 2024
Model: Qwen/Qwen3-32B-FP8
Cluster: Nebius H200 (16 GPUs)
Framework: TensorRT-LLM via Dynamo
- GPU Type: NVIDIA H200 SXM
- GPU Count: 16 (across 2 nodes)
- Kubernetes: v1.31.9
# Nebius H200 cluster context
kubectl config use-context nebius-mk8s-dynamo-h200-01aiconfigurator cli default \
--hf_id Qwen/Qwen3-32B-FP8 \
--total_gpus 16 \
--system h200_sxm \
--isl 4000 \
--osl 500 \
--ttft 600 \
--tpot 16.67 \
--save_dir ./dynamo-configs| Configuration | GPUs | TTFT | tokens/s/gpu | Recommended Concurrency |
|---|---|---|---|---|
| Aggregated | 16 | 1,108 ms | 147.00 | 144 |
| Disaggregated | 16 | 542.58 ms | 192.04 | 192 |
Predicted Improvement: Disaggregated is 31% better in tokens/s/gpu
- Prefill Workers: 10 replicas, TP=1, batch_size=1
- Decode Workers: 3 replicas, TP=2, batch_size=64
kubectl create namespace bhamm-aic-demo
kubectl create secret generic hf-token-secret \
--from-literal=HF_TOKEN=<token> \
-n bhamm-aic-demoApplied the AIConfigurator-generated k8s_deploy.yaml which creates:
# DynamoGraphDeployment structure
Frontend: 1 replica
TRTLLMPrefillWorker: 10 replicas (1 GPU each)
TRTLLMDecodeWorker: 3 replicas (2 GPUs each)
# Total: 10 + 6 = 16 GPUsNAME READY STATUS RESTARTS
qwen3-32b-disagg-0-frontend-f62l4 1/1 Running 0
qwen3-32b-disagg-0-trtllmdecodeworker-22bm9 0/1 Running 5
qwen3-32b-disagg-0-trtllmdecodeworker-lmnhm 0/1 Running 5
qwen3-32b-disagg-0-trtllmdecodeworker-xwsqz 0/1 Running 4
qwen3-32b-disagg-0-trtllmprefillworker-* 1/1 Running 0 (x10)
Note: Decode workers show 0/1 Ready due to health check issue (see Known Issues).
apiVersion: v1
kind: Pod
metadata:
name: benchmark-worker
namespace: bhamm-aic-demo
spec:
containers:
- name: benchmark
image: python:3.11-slim
command: ["/bin/bash", "-c"]
args:
- |
pip install aiperf aiohttp --quiet
sleep infinityLow Concurrency (baseline TTFT):
kubectl exec -n bhamm-aic-demo benchmark-worker -- \
aiperf profile \
--url http://qwen3-32b-disagg-frontend:8000 \
--model "Qwen/Qwen3-32B-FP8" \
--num-requests 10 \
--concurrency 1 \
--isl 4000 \
--osl 50 \
--streamingHigh Concurrency (throughput):
kubectl exec -n bhamm-aic-demo benchmark-worker -- \
aiperf profile \
--url http://qwen3-32b-disagg-frontend:8000 \
--model "Qwen/Qwen3-32B-FP8" \
--num-requests 100 \
--concurrency 192 \
--isl 4000 \
--osl 500 \
--streaming| Metric | Measured | AIC Predicted | Delta |
|---|---|---|---|
| TTFT avg | 958 ms | 543 ms | +77% |
| TTFT p50 | 955 ms | - | - |
| ITL avg | 21.25 ms | 16.67 ms | +27% |
| Request Latency | 1,999 ms | - | - |
| Metric | Measured | AIC Predicted | Delta |
|---|---|---|---|
| TTFT avg | 22,297 ms | 543 ms | +41x |
| TTFT p50 | 22,220 ms | - | - |
| TTFT min | 1,787 ms | - | - |
| ITL avg | 9.22 ms | 16.67 ms | -45% |
| Throughput | 1,045 tok/s | ~3,072 tok/s | -66% |
| Tokens/s/gpu | 65.3 | 192 | -66% |
| Metric | Measured |
|---|---|
| TTFT avg | 5,296 ms |
| TTFT min | 1,649 ms |
| ITL avg | 9.23 ms |
| Throughput | 721 tok/s |
- Target: 600 ms
- Actual (concurrency=1): 958 ms
- Even at minimum load, the TTFT exceeds the SLA by 60%
- Predicted: ~3,072 tokens/s (192 tok/s/gpu × 16 GPUs)
- Actual: 1,045 tokens/s
- Gap: 66% lower than predicted
- ITL at high concurrency (9.22 ms) is actually better than predicted (16.67 ms)
- Suggests the decode phase is performing well
- TTFT degrades significantly with concurrency
- At c=1: 958 ms → At c=192: 22,297 ms
- Indicates prefill bottleneck or queueing
The decode workers continuously fail internal health checks with:
ValueError: Disaggregated params are required for decode mode
This is because Dynamo's internal canary health check calls generate on decode workers, but decode workers cannot generate without prefill context. This causes:
- Constant pod restarts (4-5 restarts observed)
- Log spam every 10 seconds
- Potential instability affecting performance
The KV cache transfer between prefill and decode workers may have higher latency than AIConfigurator's model accounts for, especially on multi-node deployments.
AIConfigurator's performance model may be calibrated on different hardware/software configurations than the actual Nebius H200 cluster.
With 10 prefill workers (1 GPU each) vs 3 decode workers (2 GPUs each), the prefill stage may be the bottleneck at high concurrency, causing TTFT to spike.
The Kubernetes deployment adds overhead compared to the bare-metal performance that AIConfigurator models.
Impact: Decode workers show 0/1 Ready, restart frequently
Workaround: Actual inference through Frontend→Prefill→Decode path works
Fix Needed: Internal canary health check should be disabled for disagg decode workers
[dynamo_runtime::health_check] Canary timer expired for generate, sending health check
ValueError: Disaggregated params are required for decode mode
[dynamo_runtime::health_check] Health check error response from generate
- File bug for decode worker health check issue in Dynamo
- Investigate prefill/decode ratio - may need more decode workers
- Profile KV cache transfer latency separately
- Compare with aggregated deployment to validate relative performance claims
- Test with lower ISL to see if prefill time is the main contributor
{
"time_to_first_token": {
"unit": "ms",
"avg": 22296.84,
"p50": 22219.54,
"p99": 42980.08,
"min": 1786.96,
"max": 43226.83
},
"inter_token_latency": {
"unit": "ms",
"avg": 9.22,
"p50": 9.21,
"p99": 9.36
},
"output_token_throughput": {
"unit": "tokens/sec",
"avg": 1045.12
},
"request_throughput": {
"unit": "requests/sec",
"avg": 2.09
}
}| Aspect | Expected | Actual | Status |
|---|---|---|---|
| TTFT SLA (600ms) | ✓ Meet | 958ms @ c=1 | ❌ Violated |
| Throughput | 192 tok/s/gpu | 65 tok/s/gpu | |
| ITL | 16.67 ms | 9.22 ms | ✓ Better |
| Deployment | Stable | Decode restarts |
Bottom Line: The disaggregated deployment is functional but significantly underperforms AIConfigurator predictions. The decode worker health check bug needs to be fixed, and further investigation is needed to understand the TTFT/throughput gap.