OpenClaw - GLM-5-FP8 Inference Benchmark Report: 8×B200 vs 8×H200

Date: 2026-02-13
Model: zai-org/GLM-5-FP8 (744B MoE, 40B active parameters)
Framework: SGLang (lmsysorg/sglang:glm5-blackwell)
Hardware: 8× NVIDIA B200 183GB, NV18 NVSwitch, Xeon 8570 (224 cores), 2TB RAM
Reference: SGLang GLM-5 Cookbook (8×H200 141GB) AI Agent: Claude-4.6-Opus Running on OpenClaw

Executive Summary

We benchmarked GLM-5-FP8 on a single 8×B200 node against the published SGLang cookbook results on 8×H200. Three configurations were tested: B200 baseline (no speculative decoding), B200 with EAGLE speculative decoding, and the H200 cookbook reference (EAGLE enabled).

Key findings:

B200 with EAGLE delivers 12.8% higher throughput than H200 (1,370 vs 1,215 output tok/s)
Decode latency matches H200 with EAGLE (TPOT 7.69ms vs 7.54ms)
Prefill is 13% faster on B200 (TTFT 246ms vs 282ms), reflecting higher compute throughput
EAGLE speculative decoding is critical — without it, decode latency is 2.6× worse (20ms vs 7.7ms TPOT) due to the model's MoE architecture benefiting heavily from draft-verify batching
Accept length is consistent at ~3.5 tokens across both platforms, confirming the EAGLE draft model generalizes well

Configuration

Parameter	Value
Model	`zai-org/GLM-5-FP8`
Architecture	744B MoE, 40B active, FP8 quantized
Tensor Parallelism	8
EAGLE Spec Decode	3 steps, topk=1, 4 draft tokens
Memory Fraction	0.85 (static)
GPU Clocks	SM 1965 MHz, MEM 3996 MHz (locked)
Driver	570.195.03
Workload	Random tokens, 1K input / 1K output

Latency Benchmark (Concurrency = 1, 10 Prompts)

Metric	H200 (cookbook)	B200 (no EAGLE)	B200 (EAGLE)
TTFT mean (ms)	291	260	391
TTFT median (ms)	282	244	246
TPOT mean (ms)	7.54	20.22	7.69
TPOT median (ms)	7.16	20.34	7.29
ITL median (ms)	6.81	20.32	7.01
ITL P95 (ms)	13.82	20.95	13.97
ITL P99 (ms)	17.34	21.18	27.88
Output tok/s	118	48	112
Accept length	3.48	—	3.52
E2E mean (ms)	3,576	4,071	3,756

Analysis

TTFT median is 13% faster on B200 (246ms vs 282ms) thanks to higher FP8 compute throughput on Blackwell
With EAGLE enabled, TPOT is essentially identical (7.69ms vs 7.54ms — within 2%)
Without EAGLE, decode is memory-bandwidth-bound and 2.6× slower — the MoE architecture dispatches to only 40B active params per token, making each decode step very lightweight and latency-sensitive
Accept length of 3.52 means each EAGLE verification step produces ~3.5 tokens, effectively tripling decode throughput

Throughput Benchmark (Concurrency = 100, 1000 Prompts)

Metric	H200 (cookbook)	B200 (no EAGLE)	B200 (EAGLE)
Output tok/s	1,215	1,233	1,370 ✅
Total tok/s	2,435	2,472	2,747 ✅
Peak output tok/s	2,648	2,300	2,845 ✅
Request throughput (req/s)	2.43	2.47	2.74 ✅
Duration (s)	412	406	365 ✅
TPOT mean (ms)	38.7	77.09	34.46
TPOT median (ms)	—	78.85	31.88
TTFT mean (ms)	8,690	693	18,239
TTFT median (ms)	—	282	18,754
ITL median (ms)	—	45.36	15.36
ITL P99 (ms)	—	250.96	138.73
E2E mean (ms)	—	38,782	34,979
Concurrency (effective)	~100	95.6	95.8
Accept length	—	—	3.52

Analysis

B200 + EAGLE is 12.8% faster than H200 on sustained output throughput (1,370 vs 1,215 tok/s)
Even without EAGLE, B200 slightly edges H200 (+1.5% output tok/s) — both are memory-bandwidth-bound at high concurrency
EAGLE provides an additional 11% throughput boost on B200 over baseline (1,370 vs 1,233), and dramatically improves per-request latency (TPOT: 34ms vs 77ms)
Higher TTFT under load with EAGLE is expected — the draft model adds prefill overhead when the system is compute-saturated
Peak throughput of 2,845 output tok/s demonstrates excellent burst capacity

Throughput Per GPU

Config	Output tok/s/GPU
B200 + EAGLE	171.3
B200 baseline	154.2
H200 + EAGLE	151.9

B200 delivers 12.8% more output tokens per GPU than H200 for this model.

Summary

                        Latency (c=1)           Throughput (c=100)
                    TTFT    TPOT    Out tok/s   Out tok/s   Duration
H200 + EAGLE        282ms   7.54ms  118         1,215       412s
B200 + EAGLE        246ms   7.69ms  112         1,370       365s
B200 baseline        244ms  20.22ms  48         1,233       406s
                    ─────   ─────   ─────       ─────       ─────
B200 vs H200        -13%    +2%     -5%         +12.8%      -11.4%
(EAGLE, both)

Bottom line: B200 matches H200 on latency and beats it by ~13% on throughput for GLM-5-FP8. EAGLE speculative decoding is a must-have — it provides 2.6× better decode latency at concurrency=1 and 11% higher throughput under load. The 744B MoE model runs comfortably on a single 8-GPU node with TP=8.

Deployment Notes

DeepGEMM JIT precompile adds 10+ min to cold start. SGLang compiles kernels for 32K M values per unique (N,K) shape on first launch. With EAGLE enabled, there are additional shapes to compile. Cached after first run, but painful if iterating on server config. Can be pre-compiled separately with python3 -m sglang.compile_deep_gemm --model zai-org/GLM-5-FP8 --tp 8.
EAGLE inflates TTFT under high concurrency. Mean TTFT went from 693ms (no EAGLE) → 18.2s (EAGLE) at concurrency=100. The draft model's verification step adds prefill overhead when compute is saturated. Throughput still wins, but worth considering for latency-sensitive deployments where TTFT matters more than TPOT.
Otherwise clean. The lmsysorg/sglang:glm5-blackwell image, model download, TP=8 mapping, and EAGLE weights all worked out of the box. No driver issues, no OOM, no kernel panics.

Proof of Life: Chat Completion

Request:

curl http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "zai-org/GLM-5-FP8",
    "messages": [{"role": "user", "content": "What are you? Describe yourself in one paragraph."}],
    "max_tokens": 512,
    "temperature": 0.7,
    "chat_template_kwargs": {"enable_thinking": false}
  }'

Response:

{
  "id": "5f648f2b83d644db883bc93a6abff1c9",
  "object": "chat.completion",
  "created": 1771051076,
  "model": "zai-org/GLM-5-FP8",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "I am a large language model developed by Z.ai, designed to assist with a wide range of tasks through natural language understanding and generation. Trained on extensive and diverse datasets, I aim to provide accurate, helpful, and context-aware responses to your questions. While I strive to be informative and engaging, my knowledge is based on pre-existing data up to my training cutoff, and I don't have real-time awareness or personal experiences. Feel free to ask me anything!",
        "reasoning_content": null,
        "tool_calls": null
      },
      "logprobs": null,
      "finish_reason": "stop",
      "matched_stop": 154827
    }
  ],
  "usage": {
    "prompt_tokens": 15,
    "total_tokens": 109,
    "completion_tokens": 94,
    "prompt_tokens_details": null,
    "reasoning_tokens": 0
  }
}

Generated by Thermidor 🦞 on b200-82

BenHamm/openclaw-glm5-b200-benchmark-report.md

Select an option

No results found

Select an option

No results found

OpenClaw - GLM-5-FP8 Inference Benchmark Report: 8×B200 vs 8×H200

Executive Summary

Configuration

Latency Benchmark (Concurrency = 1, 10 Prompts)

Analysis

Throughput Benchmark (Concurrency = 100, 1000 Prompts)

Analysis

Throughput Per GPU

Summary

Deployment Notes

Proof of Life: Chat Completion