Date: 2026-02-13
Model: zai-org/GLM-5-FP8 (744B MoE, 40B active parameters)
Framework: SGLang (lmsysorg/sglang:glm5-blackwell)
Hardware: 8× NVIDIA B200 183GB, NV18 NVSwitch, Xeon 8570 (224 cores), 2TB RAM
Reference: SGLang GLM-5 Cookbook (8×H200 141GB)
AI Agent: Claude-4.6-Opus Running on OpenClaw
We benchmarked GLM-5-FP8 on a single 8×B200 node against the published SGLang cookbook results on 8×H200. Three configurations were tested: B200 baseline (no speculative decoding), B200 with EAGLE speculative decoding, and the H200 cookbook reference (EAGLE enabled).
Key findings:
- B200 with EAGLE delivers 12.8% higher throughput than H200 (1,370 vs 1,215 output tok/s)
- Decode latency matches H200 with EAGLE (TPOT 7.69ms vs 7.54ms)
- Prefill is 13% faster on B200 (TTFT 246ms vs 282ms), reflecting higher compute throughput
- EAGLE speculative decoding is critical — without it, decode latency is 2.6× worse (20ms vs 7.7ms TPOT) due to the model's MoE architecture benefiting heavily from draft-verify batching
- Accept length is consistent at ~3.5 tokens across both platforms, confirming the EAGLE draft model generalizes well
| Parameter | Value |
|---|---|
| Model | zai-org/GLM-5-FP8 |
| Architecture | 744B MoE, 40B active, FP8 quantized |
| Tensor Parallelism | 8 |
| EAGLE Spec Decode | 3 steps, topk=1, 4 draft tokens |
| Memory Fraction | 0.85 (static) |
| GPU Clocks | SM 1965 MHz, MEM 3996 MHz (locked) |
| Driver | 570.195.03 |
| Workload | Random tokens, 1K input / 1K output |
| Metric | H200 (cookbook) | B200 (no EAGLE) | B200 (EAGLE) |
|---|---|---|---|
| TTFT mean (ms) | 291 | 260 | 391 |
| TTFT median (ms) | 282 | 244 | 246 |
| TPOT mean (ms) | 7.54 | 20.22 | 7.69 |
| TPOT median (ms) | 7.16 | 20.34 | 7.29 |
| ITL median (ms) | 6.81 | 20.32 | 7.01 |
| ITL P95 (ms) | 13.82 | 20.95 | 13.97 |
| ITL P99 (ms) | 17.34 | 21.18 | 27.88 |
| Output tok/s | 118 | 48 | 112 |
| Accept length | 3.48 | — | 3.52 |
| E2E mean (ms) | 3,576 | 4,071 | 3,756 |
- TTFT median is 13% faster on B200 (246ms vs 282ms) thanks to higher FP8 compute throughput on Blackwell
- With EAGLE enabled, TPOT is essentially identical (7.69ms vs 7.54ms — within 2%)
- Without EAGLE, decode is memory-bandwidth-bound and 2.6× slower — the MoE architecture dispatches to only 40B active params per token, making each decode step very lightweight and latency-sensitive
- Accept length of 3.52 means each EAGLE verification step produces ~3.5 tokens, effectively tripling decode throughput
| Metric | H200 (cookbook) | B200 (no EAGLE) | B200 (EAGLE) |
|---|---|---|---|
| Output tok/s | 1,215 | 1,233 | 1,370 ✅ |
| Total tok/s | 2,435 | 2,472 | 2,747 ✅ |
| Peak output tok/s | 2,648 | 2,300 | 2,845 ✅ |
| Request throughput (req/s) | 2.43 | 2.47 | 2.74 ✅ |
| Duration (s) | 412 | 406 | 365 ✅ |
| TPOT mean (ms) | 38.7 | 77.09 | 34.46 |
| TPOT median (ms) | — | 78.85 | 31.88 |
| TTFT mean (ms) | 8,690 | 693 | 18,239 |
| TTFT median (ms) | — | 282 | 18,754 |
| ITL median (ms) | — | 45.36 | 15.36 |
| ITL P99 (ms) | — | 250.96 | 138.73 |
| E2E mean (ms) | — | 38,782 | 34,979 |
| Concurrency (effective) | ~100 | 95.6 | 95.8 |
| Accept length | — | — | 3.52 |
- B200 + EAGLE is 12.8% faster than H200 on sustained output throughput (1,370 vs 1,215 tok/s)
- Even without EAGLE, B200 slightly edges H200 (+1.5% output tok/s) — both are memory-bandwidth-bound at high concurrency
- EAGLE provides an additional 11% throughput boost on B200 over baseline (1,370 vs 1,233), and dramatically improves per-request latency (TPOT: 34ms vs 77ms)
- Higher TTFT under load with EAGLE is expected — the draft model adds prefill overhead when the system is compute-saturated
- Peak throughput of 2,845 output tok/s demonstrates excellent burst capacity
| Config | Output tok/s/GPU |
|---|---|
| B200 + EAGLE | 171.3 |
| B200 baseline | 154.2 |
| H200 + EAGLE | 151.9 |
B200 delivers 12.8% more output tokens per GPU than H200 for this model.
Latency (c=1) Throughput (c=100)
TTFT TPOT Out tok/s Out tok/s Duration
H200 + EAGLE 282ms 7.54ms 118 1,215 412s
B200 + EAGLE 246ms 7.69ms 112 1,370 365s
B200 baseline 244ms 20.22ms 48 1,233 406s
───── ───── ───── ───── ─────
B200 vs H200 -13% +2% -5% +12.8% -11.4%
(EAGLE, both)
Bottom line: B200 matches H200 on latency and beats it by ~13% on throughput for GLM-5-FP8. EAGLE speculative decoding is a must-have — it provides 2.6× better decode latency at concurrency=1 and 11% higher throughput under load. The 744B MoE model runs comfortably on a single 8-GPU node with TP=8.
-
DeepGEMM JIT precompile adds 10+ min to cold start. SGLang compiles kernels for 32K M values per unique (N,K) shape on first launch. With EAGLE enabled, there are additional shapes to compile. Cached after first run, but painful if iterating on server config. Can be pre-compiled separately with
python3 -m sglang.compile_deep_gemm --model zai-org/GLM-5-FP8 --tp 8. -
EAGLE inflates TTFT under high concurrency. Mean TTFT went from 693ms (no EAGLE) → 18.2s (EAGLE) at concurrency=100. The draft model's verification step adds prefill overhead when compute is saturated. Throughput still wins, but worth considering for latency-sensitive deployments where TTFT matters more than TPOT.
-
Otherwise clean. The
lmsysorg/sglang:glm5-blackwellimage, model download, TP=8 mapping, and EAGLE weights all worked out of the box. No driver issues, no OOM, no kernel panics.
Request:
curl http://localhost:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "zai-org/GLM-5-FP8",
"messages": [{"role": "user", "content": "What are you? Describe yourself in one paragraph."}],
"max_tokens": 512,
"temperature": 0.7,
"chat_template_kwargs": {"enable_thinking": false}
}'Response:
{
"id": "5f648f2b83d644db883bc93a6abff1c9",
"object": "chat.completion",
"created": 1771051076,
"model": "zai-org/GLM-5-FP8",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "I am a large language model developed by Z.ai, designed to assist with a wide range of tasks through natural language understanding and generation. Trained on extensive and diverse datasets, I aim to provide accurate, helpful, and context-aware responses to your questions. While I strive to be informative and engaging, my knowledge is based on pre-existing data up to my training cutoff, and I don't have real-time awareness or personal experiences. Feel free to ask me anything!",
"reasoning_content": null,
"tool_calls": null
},
"logprobs": null,
"finish_reason": "stop",
"matched_stop": 154827
}
],
"usage": {
"prompt_tokens": 15,
"total_tokens": 109,
"completion_tokens": 94,
"prompt_tokens_details": null,
"reasoning_tokens": 0
}
}Generated by Thermidor 🦞 on b200-82