NVIDIA Dynamo is a high-throughput, low-latency inference framework for serving generative AI models across multi-node GPU clusters. As LLMs grow beyond what a single GPU can handle, Dynamo solves the orchestration challenge of coordinating shards, routing requests, and transferring KV cache data across distributed systems.
Key capabilities:
- Disaggregated serving — Separates prefill and decode phases for optimized GPU utilization
- KV-aware routing — Routes requests to workers with the highest cache hit rate
- KV Block Manager — Offloads KV cache to CPU, SSD, or remote memory (G2/G3/G4) for higher throughput
- Dynamic scaling — Adjusts worker counts based on real-time demand
- Multi-backend support — Works with TensorRT-LLM, vLLM, and SGLang
But Dynamo's flexibility creates a new challenge: how do you configure it optimally?
AIConfigurator is a performance optimization tool that recommends Dynamo deployment configurations in seconds. It uses performance models calibrated on real hardware to simulate thousands of configurations and identify promising setups for your workload and SLA requirements.
What it determines:
| Decision | Traditional Approach | With AIConfigurator |
|---|---|---|
| Aggregated vs. disaggregated? | Trial and error (days) | Instant recommendation |
| How many prefill workers? | Guesswork | Recommended count |
| How many decode workers? | Guesswork | Recommended count |
| What TP/PP sizes? | Manual testing | Recommended parallelism |
| What batch sizes? | Benchmarking | SLA-aware sizing |
Value proposition:
- Speed: 5-10 seconds vs. days of manual testing
- Informed starting point: Recommendations calibrated on real hardware profiling
- Deployment-ready: Generates Kubernetes YAML files you can apply directly
- Comparison: Shows aggregated vs. disaggregated performance side-by-side
You want to deploy Qwen3-32B-FP8 on 2 nodes of H200 GPUs (16 GPUs total). You need to meet a Time To First Token (TTFT) SLA of 600ms while maximizing throughput.
Questions you face:
- Should I use aggregated or disaggregated serving?
- How many prefill workers vs decode workers?
- What tensor parallel (TP) size should I use?
- What batch sizes will meet my SLA?
- How many replicas do I need?
Manually testing all combinations would take days. AIConfigurator solves this in seconds.
# Install AIConfigurator from source (recommended for latest features)
git clone https://github.com/ai-dynamo/aiconfigurator.git
cd aiconfigurator
pip3 install -e .
# Verify installation
aiconfigurator --helpNote:
pip3 install aiconfiguratoris also available for stable releases. This guide uses the source build to demonstrate the latest features from main.
Workload characteristics:
- Model: Qwen3-32B-FP8
- Input sequence length (ISL): 4000 tokens
- Output sequence length (OSL): 500 tokens
- Available GPUs: 16 H200s (2 nodes × 8 GPUs)
- TTFT target: 600ms
- Target throughput: 60 tokens/s/user (TPOT ≈ 16.67ms)
aiconfigurator cli default \
--model QWEN3_32B \
--total_gpus 16 \
--system h200_sxm \
--isl 4000 \
--osl 500 \
--ttft 600 \
--tpot 16.67 \
--save_dir ./dynamo-configsNote: You can also use --hf_id Qwen/Qwen3-32B-FP8 to specify models by their HuggingFace ID.
What happens:
- AIConfigurator evaluates hundreds of possible configurations
- Tests both aggregated and disaggregated serving modes
- Finds configurations predicted to meet your TTFT and TPOT targets
- Recommends configurations for maximum throughput
Execution time: 5-10 seconds
AIConfigurator will display a comprehensive analysis:
********************************************************************************
* Dynamo aiconfigurator Final Results *
********************************************************************************
Input Configuration & SLA Target:
Model: QWEN3_32B
Total GPUs: 16
Best Experiment Chosen: disagg at 898.73 tokens/s/gpu (disagg 1.30x better)
----------------------------------------------------------------------------
Overall Best Configuration:
- Best Throughput: 14,379.60 tokens/s
- Per-GPU Throughput: 898.73 tokens/s/gpu
- Per-User Throughput: 81.24 tokens/s/user
- TTFT: 542.58ms
- TPOT: 12.31ms
- Request Latency: 6684.77ms
----------------------------------------------------------------------------
Key findings:
- Disaggregated serving is 30% better than aggregated for this workload
- Both meet your SLA requirements
- User throughput exceeds your target (81.24 vs 60 tokens/s/user)
AIConfigurator displays an ASCII chart showing all evaluated configurations:
QWEN3_32B Pareto Frontier: tokens/s/gpu_cluster vs tokens/s/user
┌────────────────────────────────────────────────────────────────────────┐
1300.0┤ •• agg │
│ ff disagg │
│ xx disagg best │
1083.3┤ ffff │
│ f │
│ •••• fffffffffffffx │
866.7┤ •••••• f │
│ •••••• fffff │
650.0┤ •• fff │
│ ••••• f │
│ ••• f │
433.3┤ ••••• fffff │
│ ••••• ff │
│ •••••ff••••• │
216.7┤ fffffffff•••••• │
│ fffff•••••• │
│ fffff ••• │
0.0┤ │
└┬─────────────────┬─────────────────┬────────────────┬─────────────────┬┘
0 60 120 180 240
tokens/s/gpu_cluster tokens/s/user
How to read this chart:
| Symbol | Meaning |
|---|---|
•• |
Aggregated configurations (dots) |
ff |
Disaggregated configurations |
xx |
Recommended disaggregated config (winner) |
What the axes mean:
- Y-axis (tokens/s/gpu_cluster): GPU efficiency - higher is better for cost
- X-axis (tokens/s/user): User experience - higher means faster responses per user
Key insight: The disagg curve (f's) sits above the agg curve (dots) at most points, indicating disagg achieves better GPU efficiency across different user throughput levels. The gold "x" marks the recommended configuration predicted to meet your SLA.
Aggregated Top Configurations:
+------+--------------+---------------+--------+-----------------+-------------+-------------------+----------+
| Rank | tokens/s/gpu | tokens/s/user | TTFT | request_latency | concurrency | total_gpus (used) | parallel |
+------+--------------+---------------+--------+-----------------+-------------+-------------------+----------+
| 1 | 693.66 | 61.14 | 511.72 | 8673.58 | 192 | 16 (8x2) | tp2pp1 |
| 2 | 622.83 | 67.20 | 584.68 | 8010.77 | 160 | 16 (4x4) | tp4pp1 |
+------+--------------+---------------+--------+-----------------+-------------+-------------------+----------+
Disaggregated Top Configurations:
+------+--------------+---------------+--------+-----------------+-------------+-------------------+------------+-------+------------+-------+
| Rank | tokens/s/gpu | tokens/s/user | TTFT | request_latency | concurrency | total_gpus (used) | (p)workers | (p)bs | (d)workers | (d)bs |
+------+--------------+---------------+--------+-----------------+-------------+-------------------+------------+-------+------------+-------+
| 1 | 898.73 | 81.24 | 542.58 | 6684.77 | 192 | 16 (10x1+3x2) | 10 | 1 | 3 | 64 |
| 2 | 746.33 | 100.63 | 542.58 | 5501.64 | 136 | 16 (4x1+1x4) | 4 | 1 | 1 | 68 |
+------+--------------+---------------+--------+-----------------+-------------+-------------------+------------+-------+------------+-------+
Interpretation:
- Recommended disagg config uses 10 prefill workers (TP1 each) + 3 decode workers (TP2 each)
- Prefill workers have batch size 1 (tuned for latency)
- Decode workers have batch size 64 (tuned for throughput)
- Concurrency of 192 recommended for maximum utilization
- Request latency (end-to-end) is ~6.7 seconds for 500 output tokens
AIConfigurator creates a structured output directory:
ls -R ./dynamo-configs/
dynamo-configs/QWEN3_32B_isl4000_osl500_ttft600_tpot16_*/
├── agg/
│ ├── pareto.csv # All aggregated configs tested
│ ├── best_config_topn.csv # Top N aggregated configs
│ ├── config.yaml # AIC task configuration
│ └── top1/
│ ├── k8s_deploy.yaml # Ready-to-deploy DGD
│ ├── generator_config.yaml # Config used to generate files
│ └── run_0.sh # Direct deployment script
└── disagg/
├── pareto.csv # All disaggregated configs tested
├── best_config_topn.csv # Top N disaggregated configs
├── config.yaml
└── top1/
├── k8s_deploy.yaml # Ready-to-deploy DGD
├── prefill_config.yaml # Prefill engine config
├── decode_config.yaml # Decode engine config
├── generator_config.yaml # Config used to generate files
├── run_0.sh # Prefill worker deployment script
└── run_1.sh # Decode worker deployment scriptThe pareto.csv files contain every configuration AIConfigurator evaluated:
head -3 ./dynamo-configs/QWEN3_32B_*/disagg/pareto.csv
index,model,isl,osl,ttft,tpot,tokens/s/gpu,tokens/s/user,concurrency,(p)workers,(d)workers,...
0,QWEN3_32B,4000,500,547.98,10.22,878.32,97.85,144,8,2,...
1,QWEN3_32B,4000,500,547.98,11.96,878.32,142.56,88,12,1,...This means you can:
- Filter by different SLA thresholds programmatically
- Compare trade-offs across the full configuration space
- Generate custom visualizations without re-running AIC
Prerequisite: Your Kubernetes cluster must have the Dynamo platform installed. If you haven't set this up yet, follow the Dynamo Kubernetes Installation Guide.
The generated k8s_deploy.yaml is ready to apply directly:
# Review the configuration
cat ./dynamo-configs/QWEN3_32B_*/disagg/top1/k8s_deploy.yaml
# Deploy to your cluster
kubectl apply -f ./dynamo-configs/QWEN3_32B_*/disagg/top1/k8s_deploy.yamlBefore deploying, you may need to:
-
Verify the container image:
- Default (v0.5.0):
nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.7.0 - This should match your cluster's Dynamo version
- Default (v0.5.0):
-
Update the namespace:
- Default:
None(needs to be set) - Change to your target namespace (e.g.,
dynamo)
- Default:
-
Configure model storage:
- Ensure model is available in your cluster (shared PVC, S3, or download on-demand)
- Add volume mounts if using a shared PVC
- For one approach to optimized caching, see Model Caching with Fluid
-
Add HuggingFace token (if needed):
- The config references
hf-token-secret - Create this secret in your namespace if the model requires authentication
- The config references
# Check deployment status
kubectl get dynamographdeployment -n dynamo
# View pods
kubectl get pods -n dynamo
# Check logs
kubectl logs -n dynamo <frontend-pod-name>Use AIPerf to validate that deployed performance matches AIC predictions:
# Port-forward to the frontend service
kubectl port-forward -n dynamo svc/trtllm-disagg-frontend 8000:8000 &
# Run benchmark with AIC-recommended concurrency
aiperf profile \
--model Qwen/Qwen3-32B-FP8 \
--tokenizer Qwen/Qwen3-32B-FP8 \
--endpoint-type chat \
--url http://localhost:8000 \
--streaming \
--synthetic-input-tokens-mean 4000 \
--synthetic-input-tokens-stddev 0 \
--output-tokens-mean 500 \
--output-tokens-stddev 0 \
--concurrency 192 \
--request-count 1000 \
--warmup-request-count 100 \
--artifact-dir ./benchmark-results \
-vWhat to check:
- TTFT should be close to 542.58ms (AIC prediction)
- Throughput should approach 898.73 tokens/s/gpu
- Request latency should be around 6.7 seconds for 500 output tokens
- Note: Real-world results may differ by 10-20% due to system overhead
You can quickly compare different deployment scenarios:
aiconfigurator cli default \
--model QWEN3_32B \
--total_gpus 16 \
--system h200_sxm \
--isl 4000 \
--osl 500 \
--ttft 300 \
--tpot 16.67 \
--save_dir ./configs-strict-slaaiconfigurator cli default \
--model QWEN3_32B \
--total_gpus 32 \
--system h200_sxm \
--isl 4000 \
--osl 500 \
--ttft 600 \
--tpot 16.67 \
--save_dir ./configs-32gpuaiconfigurator cli default \
--model QWEN3_480B \
--total_gpus 32 \
--system h200_sxm \
--isl 4000 \
--osl 500 \
--ttft 600 \
--tpot 16.67 \
--save_dir ./configs-480bFor more complex comparisons across different frameworks (TensorRT-LLM, vLLM, SGLang), use AIConfigurator's experiment mode with a YAML configuration file.
The default command compares aggregated vs disaggregated for a single framework. Experiment mode allows you to:
- Compare multiple frameworks (TensorRT-LLM vs vLLM vs SGLang)
- Run multiple experiments in a single command
- Fine-tune parallelism settings (TP, PP, DP, MoE EP)
- Configure quantization modes (FP8, FP4, etc.)
- Control advanced tuning parameters
Create a YAML file framework-comparison.yaml:
# Framework comparison for Qwen3-32B on 16x H200
exps:
- trtllm_disagg
- vllm_disagg
- sglang_disagg
trtllm_disagg:
mode: patch
serving_mode: disagg
model_name: QWEN3_32B
total_gpus: 16
system_name: h200_sxm
backend_name: trtllm
backend_version: "1.2.0rc2"
isl: 4000
osl: 500
ttft: 600.0
tpot: 16.67
vllm_disagg:
mode: patch
serving_mode: disagg
model_name: QWEN3_32B
total_gpus: 16
system_name: h200_sxm
backend_name: vllm
isl: 4000
osl: 500
ttft: 600.0
tpot: 16.67
sglang_disagg:
mode: patch
serving_mode: disagg
model_name: QWEN3_32B
total_gpus: 16
system_name: h200_sxm
backend_name: sglang
isl: 4000
osl: 500
ttft: 600.0
tpot: 16.67Run the comparison:
aiconfigurator cli exp \
--yaml_path framework-comparison.yaml \
--save_dir ./framework-comparison-resultsAIConfigurator evaluates all three frameworks and shows a combined Pareto frontier:
********************************************************************************
* Dynamo aiconfigurator Final Results *
********************************************************************************
Input Configuration & SLA Target:
Model: QWEN3_32B
Total GPUs: 16
Best Experiment Chosen: vllm_disagg at 904.95 tokens/s/gpu
----------------------------------------------------------------------------
Overall Best Configuration:
- Best Throughput: 14,479.20 tokens/s
- Per-GPU Throughput: 904.95 tokens/s/gpu
- Per-User Throughput: 66.74 tokens/s/user
- TTFT: 447.50ms
- TPOT: 14.98ms
The Pareto chart shows all three frameworks together:
QWEN3_32B Pareto Frontier: tokens/s/gpu_cluster vs tokens/s/user
┌────────────────────────────────────────────────────────────────────────┐
1400.0┤ •• trtllm_disagg │
│ ff vllm_disagg │
│ hh sglang_disagg │
1166.7┤ xx vllm_disagg best │
│ •••f │
│ f │
933.3┤ fffffffffxf••• │
│ ff • │
700.0┤ ffff••• │
│ f ••• │
│ f • │
466.7┤ f ••••• │
│ f •• │
│ hhhhh f •• │
233.3┤ hhhhhhhh ff •• │
│ hhhhhhhhhh ff ••••••• │
│ hhhhhhhhhhhhfffffffff •••••••••• │
0.0┤ hhhhhhhhh │
└┬─────────────────┬─────────────────┬────────────────┬─────────────────┬┘
0 60 120 180 240
| Framework | Best tokens/s/gpu | TTFT | Optimal Architecture |
|---|---|---|---|
| vLLM | 904.95 | 447.50ms | 4 prefill + 1 decode (TP4) × 2 replicas |
| TensorRT-LLM | 898.73 | 542.58ms | 10 prefill + 3 decode (TP2) |
| SGLang | 172.07 | 589.69ms | 4 prefill + 3 decode (TP4) |
Key Insights:
-
vLLM wins by a small margin (0.7%) - For this specific workload (Qwen3-32B, ISL=4000, OSL=500), vLLM achieves slightly higher throughput than TensorRT-LLM.
-
vLLM has better TTFT - 447.50ms vs 542.58ms gives vLLM a 17% latency advantage.
-
SGLang results: take with a grain of salt - SGLang modeling was recently added to AIConfigurator and we're still improving accuracy. The predictions here may not reflect actual SGLang performance.
-
Architecture differs by framework - vLLM prefers fewer, larger workers while TensorRT-LLM prefers more, smaller workers.
How to Use This Information:
- If TTFT is critical: Choose vLLM (447ms vs 542ms)
- If throughput is critical: Both vLLM and TensorRT-LLM are competitive
- If you need specific features: TensorRT-LLM offers more quantization options
This comparison completes in ~10 seconds - far faster than deploying and benchmarking each framework manually!
Each experiment can include detailed worker configurations:
trtllm_disagg_advanced:
mode: patch
serving_mode: disagg
model_name: QWEN3_32B
total_gpus: 16
system_name: h200_sxm
backend_name: trtllm
backend_version: "1.2.0rc2"
isl: 4000
osl: 500
ttft: 600.0
tpot: 16.67
config:
prefill_worker_config:
tp_list: [1, 2]
pp_list: [1]
gemm_quant_mode: fp8_block
kvcache_quant_mode: fp8
decode_worker_config:
tp_list: [1, 2, 4]
pp_list: [1]
gemm_quant_mode: fp8_block
kvcache_quant_mode: fp8
replica_config:
max_prefill_worker: 16
max_decode_worker: 8
advanced_tuning_config:
prefill_max_batch_size: 1
decode_max_batch_size: 128| Field | Description | Example Values |
|---|---|---|
backend_name |
Inference framework | trtllm, vllm, sglang |
serving_mode |
Deployment architecture | agg, disagg |
tp_list |
Tensor parallelism options | [1, 2, 4, 8] |
pp_list |
Pipeline parallelism options | [1, 2] |
gemm_quant_mode |
Matrix multiply quantization | fp8_block, fp16 |
kvcache_quant_mode |
KV cache quantization | fp8, float16 |
moe_ep_list |
MoE expert parallelism | [1, 2, 4, 8] |
- Framework Selection: "Which framework is recommended for my workload?"
- Quantization Comparison: "Does FP8 vs FP16 KV cache affect my SLA?"
- Parallelism Exploration: "What TP/PP combination should I start with?"
- MoE Optimization: "How should I configure expert parallelism?"
What AIConfigurator Solved:
-
Configuration Complexity: Instead of manually testing dozens of TP/PP/replica combinations, AIC recommends a starting configuration in seconds
-
SLA Compliance: Automatically filtered to configurations that meet your latency requirements
-
Agg vs Disagg Decision: Quantified that disaggregated serving provides 30% better throughput for this workload
-
Production-Ready Output: Generated deployment-ready Kubernetes YAML files
Time Saved:
- Manual exploration: Days to weeks
- AIConfigurator: 15 seconds
- AIConfigurator Documentation: https://github.com/ai-dynamo/aiconfigurator
- Dynamo Kubernetes Deployment Guide: https://github.com/ai-dynamo/dynamo/blob/main/docs/kubernetes/installation_guide.md
- SLA-Driven Profiling: https://github.com/ai-dynamo/dynamo/blob/main/docs/benchmarks/sla_driven_profiling.md
This walkthrough demonstrates AIConfigurator's ability to simplify complex deployment decisions for disaggregated LLM serving. By automating configuration search and providing data-driven recommendations, AIConfigurator reduces deployment time from days to seconds while ensuring SLA compliance.
For questions or support: Join the Dynamo Discord or file an issue on GitHub.