Skip to content

Instantly share code, notes, and snippets.

@BenHamm
Last active December 19, 2025 17:02
Show Gist options
  • Select an option

  • Save BenHamm/b19b540113108be6971fa2bd61aea3e0 to your computer and use it in GitHub Desktop.

Select an option

Save BenHamm/b19b540113108be6971fa2bd61aea3e0 to your computer and use it in GitHub Desktop.
AIConfigurator Walkthrough: Finding Optimal LLM Deployment Configurations

AIConfigurator: Fast-Track Your LLM Deployment on NVIDIA Dynamo

What is NVIDIA Dynamo?

NVIDIA Dynamo is a high-throughput, low-latency inference framework for serving generative AI models across multi-node GPU clusters. As LLMs grow beyond what a single GPU can handle, Dynamo solves the orchestration challenge of coordinating shards, routing requests, and transferring KV cache data across distributed systems.

Key capabilities:

  • Disaggregated serving — Separates prefill and decode phases for optimized GPU utilization
  • KV-aware routing — Routes requests to workers with the highest cache hit rate
  • KV Block Manager — Offloads KV cache to CPU, SSD, or remote memory (G2/G3/G4) for higher throughput
  • Dynamic scaling — Adjusts worker counts based on real-time demand
  • Multi-backend support — Works with TensorRT-LLM, vLLM, and SGLang

But Dynamo's flexibility creates a new challenge: how do you configure it optimally?


What is AIConfigurator?

AIConfigurator is a performance optimization tool that recommends Dynamo deployment configurations in seconds. It uses performance models calibrated on real hardware to simulate thousands of configurations and identify promising setups for your workload and SLA requirements.

What it determines:

Decision Traditional Approach With AIConfigurator
Aggregated vs. disaggregated? Trial and error (days) Instant recommendation
How many prefill workers? Guesswork Recommended count
How many decode workers? Guesswork Recommended count
What TP/PP sizes? Manual testing Recommended parallelism
What batch sizes? Benchmarking SLA-aware sizing

Value proposition:

  • Speed: 5-10 seconds vs. days of manual testing
  • Informed starting point: Recommendations calibrated on real hardware profiling
  • Deployment-ready: Generates Kubernetes YAML files you can apply directly
  • Comparison: Shows aggregated vs. disaggregated performance side-by-side

The Scenario

You want to deploy Qwen3-32B-FP8 on 2 nodes of H200 GPUs (16 GPUs total). You need to meet a Time To First Token (TTFT) SLA of 600ms while maximizing throughput.

Questions you face:

  • Should I use aggregated or disaggregated serving?
  • How many prefill workers vs decode workers?
  • What tensor parallel (TP) size should I use?
  • What batch sizes will meet my SLA?
  • How many replicas do I need?

Manually testing all combinations would take days. AIConfigurator solves this in seconds.


Prerequisites

# Install AIConfigurator from source (recommended for latest features)
git clone https://github.com/ai-dynamo/aiconfigurator.git
cd aiconfigurator
pip3 install -e .

# Verify installation
aiconfigurator --help

Note: pip3 install aiconfigurator is also available for stable releases. This guide uses the source build to demonstrate the latest features from main.


Step 1: Define Your Requirements

Workload characteristics:

  • Model: Qwen3-32B-FP8
  • Input sequence length (ISL): 4000 tokens
  • Output sequence length (OSL): 500 tokens
  • Available GPUs: 16 H200s (2 nodes × 8 GPUs)
  • TTFT target: 600ms
  • Target throughput: 60 tokens/s/user (TPOT ≈ 16.67ms)

Step 2: Run AIConfigurator

aiconfigurator cli default \
  --model QWEN3_32B \
  --total_gpus 16 \
  --system h200_sxm \
  --isl 4000 \
  --osl 500 \
  --ttft 600 \
  --tpot 16.67 \
  --save_dir ./dynamo-configs

Note: You can also use --hf_id Qwen/Qwen3-32B-FP8 to specify models by their HuggingFace ID.

What happens:

  • AIConfigurator evaluates hundreds of possible configurations
  • Tests both aggregated and disaggregated serving modes
  • Finds configurations predicted to meet your TTFT and TPOT targets
  • Recommends configurations for maximum throughput

Execution time: 5-10 seconds


Step 3: Review the Results

AIConfigurator will display a comprehensive analysis:

********************************************************************************
*                     Dynamo aiconfigurator Final Results                      *
********************************************************************************
  Input Configuration & SLA Target:
    Model: QWEN3_32B
    Total GPUs: 16
    Best Experiment Chosen: disagg at 898.73 tokens/s/gpu (disagg 1.30x better)
  ----------------------------------------------------------------------------
  Overall Best Configuration:
    - Best Throughput: 14,379.60 tokens/s
    - Per-GPU Throughput: 898.73 tokens/s/gpu
    - Per-User Throughput: 81.24 tokens/s/user
    - TTFT: 542.58ms
    - TPOT: 12.31ms
    - Request Latency: 6684.77ms
  ----------------------------------------------------------------------------

Key findings:

  • Disaggregated serving is 30% better than aggregated for this workload
  • Both meet your SLA requirements
  • User throughput exceeds your target (81.24 vs 60 tokens/s/user)

The Pareto Frontier Visualization

AIConfigurator displays an ASCII chart showing all evaluated configurations:

           QWEN3_32B Pareto Frontier: tokens/s/gpu_cluster vs tokens/s/user     
      ┌────────────────────────────────────────────────────────────────────────┐
1300.0┤ •• agg                                                                 │
      │ ff disagg                                                              │
      │ xx disagg best                                                         │
1083.3┤      ffff                                                              │
      │          f                                                             │
      │  ••••     fffffffffffffx                                               │
 866.7┤     ••••••              f                                              │
      │           ••••••         fffff                                         │
 650.0┤                 ••            fff                                      │
      │                   •••••         f                                      │
      │                        •••      f                                      │
 433.3┤                          •••••  fffff                                  │
      │                               •••••  ff                                │
      │                                   •••••ff•••••                         │
 216.7┤                                          fffffffff••••••               │
      │                                                   fffff••••••          │
      │                                                        fffff •••       │
   0.0┤                                                                        │
      └┬─────────────────┬─────────────────┬────────────────┬─────────────────┬┘
       0                60                120              180              240 
tokens/s/gpu_cluster                 tokens/s/user                              

How to read this chart:

Symbol Meaning
•• Aggregated configurations (dots)
ff Disaggregated configurations
xx Recommended disaggregated config (winner)

What the axes mean:

  • Y-axis (tokens/s/gpu_cluster): GPU efficiency - higher is better for cost
  • X-axis (tokens/s/user): User experience - higher means faster responses per user

Key insight: The disagg curve (f's) sits above the agg curve (dots) at most points, indicating disagg achieves better GPU efficiency across different user throughput levels. The gold "x" marks the recommended configuration predicted to meet your SLA.

Understanding the Output Tables

Aggregated Top Configurations:

+------+--------------+---------------+--------+-----------------+-------------+-------------------+----------+
| Rank | tokens/s/gpu | tokens/s/user |  TTFT  | request_latency | concurrency | total_gpus (used) | parallel |
+------+--------------+---------------+--------+-----------------+-------------+-------------------+----------+
|  1   |    693.66    |     61.14     | 511.72 |     8673.58     |     192     |    16 (8x2)       |  tp2pp1  |
|  2   |    622.83    |     67.20     | 584.68 |     8010.77     |     160     |    16 (4x4)       |  tp4pp1  |
+------+--------------+---------------+--------+-----------------+-------------+-------------------+----------+

Disaggregated Top Configurations:

+------+--------------+---------------+--------+-----------------+-------------+-------------------+------------+-------+------------+-------+
| Rank | tokens/s/gpu | tokens/s/user |  TTFT  | request_latency | concurrency | total_gpus (used) | (p)workers | (p)bs | (d)workers | (d)bs |
+------+--------------+---------------+--------+-----------------+-------------+-------------------+------------+-------+------------+-------+
|  1   |    898.73    |     81.24     | 542.58 |     6684.77     |     192     |    16 (10x1+3x2)  |     10     |   1   |     3      |   64  |
|  2   |    746.33    |    100.63     | 542.58 |     5501.64     |     136     |    16 (4x1+1x4)   |     4      |   1   |     1      |   68  |
+------+--------------+---------------+--------+-----------------+-------------+-------------------+------------+-------+------------+-------+

Interpretation:

  • Recommended disagg config uses 10 prefill workers (TP1 each) + 3 decode workers (TP2 each)
  • Prefill workers have batch size 1 (tuned for latency)
  • Decode workers have batch size 64 (tuned for throughput)
  • Concurrency of 192 recommended for maximum utilization
  • Request latency (end-to-end) is ~6.7 seconds for 500 output tokens

Step 4: Explore the Generated Files

AIConfigurator creates a structured output directory:

ls -R ./dynamo-configs/

dynamo-configs/QWEN3_32B_isl4000_osl500_ttft600_tpot16_*/
├── agg/
│   ├── pareto.csv              # All aggregated configs tested
│   ├── best_config_topn.csv    # Top N aggregated configs
│   ├── config.yaml             # AIC task configuration
│   └── top1/
│       ├── k8s_deploy.yaml         # Ready-to-deploy DGD
│       ├── generator_config.yaml   # Config used to generate files
│       └── run_0.sh                # Direct deployment script
└── disagg/
    ├── pareto.csv              # All disaggregated configs tested
    ├── best_config_topn.csv    # Top N disaggregated configs
    ├── config.yaml
    └── top1/
        ├── k8s_deploy.yaml         # Ready-to-deploy DGD
        ├── prefill_config.yaml     # Prefill engine config
        ├── decode_config.yaml      # Decode engine config
        ├── generator_config.yaml   # Config used to generate files
        ├── run_0.sh                # Prefill worker deployment script
        └── run_1.sh                # Decode worker deployment script

Understanding the Pareto CSV

The pareto.csv files contain every configuration AIConfigurator evaluated:

head -3 ./dynamo-configs/QWEN3_32B_*/disagg/pareto.csv

index,model,isl,osl,ttft,tpot,tokens/s/gpu,tokens/s/user,concurrency,(p)workers,(d)workers,...
0,QWEN3_32B,4000,500,547.98,10.22,878.32,97.85,144,8,2,...
1,QWEN3_32B,4000,500,547.98,11.96,878.32,142.56,88,12,1,...

This means you can:

  • Filter by different SLA thresholds programmatically
  • Compare trade-offs across the full configuration space
  • Generate custom visualizations without re-running AIC

Step 5: Deploy the Recommended Configuration

Prerequisite: Your Kubernetes cluster must have the Dynamo platform installed. If you haven't set this up yet, follow the Dynamo Kubernetes Installation Guide.

The generated k8s_deploy.yaml is ready to apply directly:

# Review the configuration
cat ./dynamo-configs/QWEN3_32B_*/disagg/top1/k8s_deploy.yaml

# Deploy to your cluster
kubectl apply -f ./dynamo-configs/QWEN3_32B_*/disagg/top1/k8s_deploy.yaml

Before deploying, you may need to:

  1. Verify the container image:

    • Default (v0.5.0): nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.7.0
    • This should match your cluster's Dynamo version
  2. Update the namespace:

    • Default: None (needs to be set)
    • Change to your target namespace (e.g., dynamo)
  3. Configure model storage:

    • Ensure model is available in your cluster (shared PVC, S3, or download on-demand)
    • Add volume mounts if using a shared PVC
    • For one approach to optimized caching, see Model Caching with Fluid
  4. Add HuggingFace token (if needed):

    • The config references hf-token-secret
    • Create this secret in your namespace if the model requires authentication

Step 6: Verify Deployment

# Check deployment status
kubectl get dynamographdeployment -n dynamo

# View pods
kubectl get pods -n dynamo

# Check logs
kubectl logs -n dynamo <frontend-pod-name>

Step 7: Benchmark Your Deployment

Use AIPerf to validate that deployed performance matches AIC predictions:

# Port-forward to the frontend service
kubectl port-forward -n dynamo svc/trtllm-disagg-frontend 8000:8000 &

# Run benchmark with AIC-recommended concurrency
aiperf profile \
  --model Qwen/Qwen3-32B-FP8 \
  --tokenizer Qwen/Qwen3-32B-FP8 \
  --endpoint-type chat \
  --url http://localhost:8000 \
  --streaming \
  --synthetic-input-tokens-mean 4000 \
  --synthetic-input-tokens-stddev 0 \
  --output-tokens-mean 500 \
  --output-tokens-stddev 0 \
  --concurrency 192 \
  --request-count 1000 \
  --warmup-request-count 100 \
  --artifact-dir ./benchmark-results \
  -v

What to check:

  • TTFT should be close to 542.58ms (AIC prediction)
  • Throughput should approach 898.73 tokens/s/gpu
  • Request latency should be around 6.7 seconds for 500 output tokens
  • Note: Real-world results may differ by 10-20% due to system overhead

Variations

You can quickly compare different deployment scenarios:

Scenario A: Stricter SLA (TTFT = 300ms)

aiconfigurator cli default \
  --model QWEN3_32B \
  --total_gpus 16 \
  --system h200_sxm \
  --isl 4000 \
  --osl 500 \
  --ttft 300 \
  --tpot 16.67 \
  --save_dir ./configs-strict-sla

Scenario B: More GPUs (32 GPUs)

aiconfigurator cli default \
  --model QWEN3_32B \
  --total_gpus 32 \
  --system h200_sxm \
  --isl 4000 \
  --osl 500 \
  --ttft 600 \
  --tpot 16.67 \
  --save_dir ./configs-32gpu

Scenario C: Different Model (Qwen3-480B)

aiconfigurator cli default \
  --model QWEN3_480B \
  --total_gpus 32 \
  --system h200_sxm \
  --isl 4000 \
  --osl 500 \
  --ttft 600 \
  --tpot 16.67 \
  --save_dir ./configs-480b

Advanced: Multi-Framework Comparison with Experiment Mode

For more complex comparisons across different frameworks (TensorRT-LLM, vLLM, SGLang), use AIConfigurator's experiment mode with a YAML configuration file.

Why Use Experiment Mode?

The default command compares aggregated vs disaggregated for a single framework. Experiment mode allows you to:

  • Compare multiple frameworks (TensorRT-LLM vs vLLM vs SGLang)
  • Run multiple experiments in a single command
  • Fine-tune parallelism settings (TP, PP, DP, MoE EP)
  • Configure quantization modes (FP8, FP4, etc.)
  • Control advanced tuning parameters

Example: Comparing TRT-LLM vs vLLM vs SGLang

Create a YAML file framework-comparison.yaml:

# Framework comparison for Qwen3-32B on 16x H200
exps:
  - trtllm_disagg
  - vllm_disagg
  - sglang_disagg

trtllm_disagg:
  mode: patch
  serving_mode: disagg
  model_name: QWEN3_32B
  total_gpus: 16
  system_name: h200_sxm
  backend_name: trtllm
  backend_version: "1.2.0rc2"
  isl: 4000
  osl: 500
  ttft: 600.0
  tpot: 16.67

vllm_disagg:
  mode: patch
  serving_mode: disagg
  model_name: QWEN3_32B
  total_gpus: 16
  system_name: h200_sxm
  backend_name: vllm
  isl: 4000
  osl: 500
  ttft: 600.0
  tpot: 16.67

sglang_disagg:
  mode: patch
  serving_mode: disagg
  model_name: QWEN3_32B
  total_gpus: 16
  system_name: h200_sxm
  backend_name: sglang
  isl: 4000
  osl: 500
  ttft: 600.0
  tpot: 16.67

Run the comparison:

aiconfigurator cli exp \
  --yaml_path framework-comparison.yaml \
  --save_dir ./framework-comparison-results

Example Output: Framework Comparison Results

AIConfigurator evaluates all three frameworks and shows a combined Pareto frontier:

********************************************************************************
*                     Dynamo aiconfigurator Final Results                      *
********************************************************************************
  Input Configuration & SLA Target:
    Model: QWEN3_32B
    Total GPUs: 16
    Best Experiment Chosen: vllm_disagg at 904.95 tokens/s/gpu
  ----------------------------------------------------------------------------
  Overall Best Configuration:
    - Best Throughput: 14,479.20 tokens/s
    - Per-GPU Throughput: 904.95 tokens/s/gpu
    - Per-User Throughput: 66.74 tokens/s/user
    - TTFT: 447.50ms
    - TPOT: 14.98ms

The Pareto chart shows all three frameworks together:

           QWEN3_32B Pareto Frontier: tokens/s/gpu_cluster vs tokens/s/user     
      ┌────────────────────────────────────────────────────────────────────────┐
1400.0┤ •• trtllm_disagg                                                       │
      │ ff vllm_disagg                                                         │
      │ hh sglang_disagg                                                       │
1166.7┤ xx vllm_disagg best                                                    │
      │      •••f                                                              │
      │          f                                                             │
 933.3┤           fffffffffxf•••                                               │
      │                      ff •                                              │
 700.0┤                        ffff•••                                         │
      │                           f   •••                                      │
      │                           f     •                                      │
 466.7┤                           f     •••••                                  │
      │                           f          ••                                │
      │ hhhhh                      f           ••                              │
 233.3┤     hhhhhhhh                ff           ••                            │
      │             hhhhhhhhhh        ff           •••••••                     │
      │                       hhhhhhhhhhhhfffffffff       ••••••••••           │
   0.0┤                                   hhhhhhhhh                            │
      └┬─────────────────┬─────────────────┬────────────────┬─────────────────┬┘
       0                60                120              180              240 

Framework Comparison Summary

Framework Best tokens/s/gpu TTFT Optimal Architecture
vLLM 904.95 447.50ms 4 prefill + 1 decode (TP4) × 2 replicas
TensorRT-LLM 898.73 542.58ms 10 prefill + 3 decode (TP2)
SGLang 172.07 589.69ms 4 prefill + 3 decode (TP4)

Key Insights:

  1. vLLM wins by a small margin (0.7%) - For this specific workload (Qwen3-32B, ISL=4000, OSL=500), vLLM achieves slightly higher throughput than TensorRT-LLM.

  2. vLLM has better TTFT - 447.50ms vs 542.58ms gives vLLM a 17% latency advantage.

  3. SGLang results: take with a grain of salt - SGLang modeling was recently added to AIConfigurator and we're still improving accuracy. The predictions here may not reflect actual SGLang performance.

  4. Architecture differs by framework - vLLM prefers fewer, larger workers while TensorRT-LLM prefers more, smaller workers.

How to Use This Information:

  • If TTFT is critical: Choose vLLM (447ms vs 542ms)
  • If throughput is critical: Both vLLM and TensorRT-LLM are competitive
  • If you need specific features: TensorRT-LLM offers more quantization options

This comparison completes in ~10 seconds - far faster than deploying and benchmarking each framework manually!

Advanced Configuration Options

Each experiment can include detailed worker configurations:

trtllm_disagg_advanced:
  mode: patch
  serving_mode: disagg
  model_name: QWEN3_32B
  total_gpus: 16
  system_name: h200_sxm
  backend_name: trtllm
  backend_version: "1.2.0rc2"
  isl: 4000
  osl: 500
  ttft: 600.0
  tpot: 16.67
  config:
    prefill_worker_config:
      tp_list: [1, 2]
      pp_list: [1]
      gemm_quant_mode: fp8_block
      kvcache_quant_mode: fp8
    decode_worker_config:
      tp_list: [1, 2, 4]
      pp_list: [1]
      gemm_quant_mode: fp8_block
      kvcache_quant_mode: fp8
    replica_config:
      max_prefill_worker: 16
      max_decode_worker: 8
    advanced_tuning_config:
      prefill_max_batch_size: 1
      decode_max_batch_size: 128

Key Configuration Fields

Field Description Example Values
backend_name Inference framework trtllm, vllm, sglang
serving_mode Deployment architecture agg, disagg
tp_list Tensor parallelism options [1, 2, 4, 8]
pp_list Pipeline parallelism options [1, 2]
gemm_quant_mode Matrix multiply quantization fp8_block, fp16
kvcache_quant_mode KV cache quantization fp8, float16
moe_ep_list MoE expert parallelism [1, 2, 4, 8]

Use Cases for Experiment Mode

  1. Framework Selection: "Which framework is recommended for my workload?"
  2. Quantization Comparison: "Does FP8 vs FP16 KV cache affect my SLA?"
  3. Parallelism Exploration: "What TP/PP combination should I start with?"
  4. MoE Optimization: "How should I configure expert parallelism?"

Key Takeaways

What AIConfigurator Solved:

  1. Configuration Complexity: Instead of manually testing dozens of TP/PP/replica combinations, AIC recommends a starting configuration in seconds

  2. SLA Compliance: Automatically filtered to configurations that meet your latency requirements

  3. Agg vs Disagg Decision: Quantified that disaggregated serving provides 30% better throughput for this workload

  4. Production-Ready Output: Generated deployment-ready Kubernetes YAML files

Time Saved:

  • Manual exploration: Days to weeks
  • AIConfigurator: 15 seconds

Resources


About This Guide

This walkthrough demonstrates AIConfigurator's ability to simplify complex deployment decisions for disaggregated LLM serving. By automating configuration search and providing data-driven recommendations, AIConfigurator reduces deployment time from days to seconds while ensuring SLA compliance.

For questions or support: Join the Dynamo Discord or file an issue on GitHub.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment