Date: 2026-02-13
Model: zai-org/GLM-5-FP8 (744B MoE, 40B active parameters)
Framework: SGLang (lmsysorg/sglang:glm5-blackwell)
Hardware: 8× NVIDIA B200 183GB, NV18 NVSwitch, Xeon 8570 (224 cores), 2TB RAM
Reference: SGLang GLM-5 Cookbook (8×H200 141GB)
AI Agent: Claude-4.6-Opus Running on OpenClaw
NVIDIA Dynamo is a high-throughput, low-latency inference framework for serving generative AI models across multi-node GPU clusters. As LLMs grow beyond what a single GPU can handle, Dynamo solves the orchestration challenge of coordinating shards, routing requests, and transferring KV cache data across distributed systems.
Key capabilities:
- Disaggregated serving — Separates prefill and decode phases for optimized GPU utilization
- KV-aware routing — Routes requests to workers with the highest cache hit rate
- KV Block Manager — Offloads KV cache to CPU, SSD, or remote memory (G2/G3/G4) for higher throughput
We tested AIConfigurator (version 0.4.0) against the performance claims in the "Advanced Disagg Perf Tuning" guide and found a significant discrepancy between AIC's predictions and the guide's reported results.
Key Finding: AIC predicts disaggregated serving provides 7-8% improvement, while the guide reports 102-148% improvement - a 10-20x difference in expected gains.
Source Document: The guide being tested is from PR #4655 by davilu-nvidia (submitted Nov 27, 2025, currently under review and not yet merged).
Presentation Date: November 13, 2025
Tool: AIPerf v0.3.0
Inference providers that want to contribute to open soruce projects are limited in what they can share while respecting user privacy and protecting proprietary prompts. Real LLM inference traces contain sensitive information, but without realistic traces, benchmarks can't accurately reflect production workloads.
Block hashing converts tokens into cryptographic hash IDs that preserve prefix-matching patterns while protecting content. Based on Mooncake AI's approach (USENIX FAST'25).
Performance benchmark of aws/anthropic/bedrock-claude-sonnet-4-5-v1 model hosted on NVIDIA's inference API, testing with large context windows (8K input tokens, 1K output tokens).
Benchmark Date: November 3, 2025
Tool: AIPerf v0.2.0
Test Configuration: 100 requests with streaming enabled
This tutorial captures end-to-end reference flows for running AIPerf against vLLM-hosted models. Each chapter covers a specific OpenAI-compatible endpoint: how to launch the vLLM server, run the AIPerf benchmark, and interpret