Date: December 17, 2024
Model: Qwen/Qwen3-32B-FP8
Cluster: Nebius H200 (16 GPUs)
Framework: TensorRT-LLM via Dynamo
NVIDIA Dynamo is a high-throughput, low-latency inference framework for serving generative AI models across multi-node GPU clusters. As LLMs grow beyond what a single GPU can handle, Dynamo solves the orchestration challenge of coordinating shards, routing requests, and transferring KV cache data across distributed systems.
Key capabilities:
- Disaggregated serving — Separates prefill and decode phases for optimized GPU utilization
- KV-aware routing — Routes requests to workers with the highest cache hit rate
- KV Block Manager — Offloads KV cache to CPU, SSD, or remote memory (G2/G3/G4) for higher throughput
We tested AIConfigurator (version 0.4.0) against the performance claims in the "Advanced Disagg Perf Tuning" guide and found a significant discrepancy between AIC's predictions and the guide's reported results.
Key Finding: AIC predicts disaggregated serving provides 7-8% improvement, while the guide reports 102-148% improvement - a 10-20x difference in expected gains.
Source Document: The guide being tested is from PR #4655 by davilu-nvidia (submitted Nov 27, 2025, currently under review and not yet merged).
Presentation Date: November 13, 2025
Tool: AIPerf v0.3.0
Inference providers that want to contribute to open soruce projects are limited in what they can share while respecting user privacy and protecting proprietary prompts. Real LLM inference traces contain sensitive information, but without realistic traces, benchmarks can't accurately reflect production workloads.
Block hashing converts tokens into cryptographic hash IDs that preserve prefix-matching patterns while protecting content. Based on Mooncake AI's approach (USENIX FAST'25).
Performance benchmark of aws/anthropic/bedrock-claude-sonnet-4-5-v1 model hosted on NVIDIA's inference API, testing with large context windows (8K input tokens, 1K output tokens).
Benchmark Date: November 3, 2025
Tool: AIPerf v0.2.0
Test Configuration: 100 requests with streaming enabled
This tutorial captures end-to-end reference flows for running AIPerf against vLLM-hosted models. Each chapter covers a specific OpenAI-compatible endpoint: how to launch the vLLM server, run the AIPerf benchmark, and interpret
This guide will walk you through using TensorRT-Cloud to perform performance sweeping with the TRT LLM PyTorch backend.
--> THIS GUIDE IS PSEUDOCODE AND JUST A PRODUCT MANAGER'S SUGGESTION. WE HAVE NOT YET BUILT THIS FEATURE <---
Unlike the C++ backend which uses ahead-of-time (AoT) compilation, TRT LLM PyTorch uses just-in-time (JIT) compilation. This means we'll be configuring runtime parameters rather than build parameters for our performance sweep.