Skip to content

Instantly share code, notes, and snippets.

@Basten7
Last active September 10, 2025 14:59
Show Gist options
  • Select an option

  • Save Basten7/091df055c04edaa9c88eb0cdc7fc429d to your computer and use it in GitHub Desktop.

Select an option

Save Basten7/091df055c04edaa9c88eb0cdc7fc429d to your computer and use it in GitHub Desktop.
Prompt Processing vs Token Generation
Classic LLM-inference trace on the GPU
@Basten7
Copy link
Author

Basten7 commented Aug 11, 2025

LLM-inference trace on Metal3 Build
Capture d’écran 2025-08-11 à 10 45 58

Warm-up spikes (left): Kernel/JIT warm-up, memory/KV-cache allocation, and first big GEMMs kick in. Short spikes and a dip are normal during graph creation and allocator growth.
High flat plateau — Prompt Processing (prefill): The model is chewing through the prompt in big matrix multiplies and full-sequence attention. That’s compute-heavy and easy to keep the 6900 XT busy, so you see ~90–100% util until the prompt is ingested. Your “pp512” likely means you’re chunking the prefill in 512-token blocks; that only affects how long this plateau lasts, not its height.
Deep notch between phases: End of prefill → start of decoding. You usually get a bubble while the runtime switches kernels, finalizes KV cache pages, synchronizes streams, and does the first softmax/sampling (often CPU-side). First-token latency shows up here.
Lower, steady plateau — Token Generation (decode): Now you’re generating one token at a time. Each step does matvecs/attention against the growing KV cache. That’s more memory-bound and has less parallel work per step, so utilization settles a bit lower and flatter. Tiny wiggles = periodic sampling/logging.
Drop to zero: Generation stops; buffers freed.

@Basten7
Copy link
Author

Basten7 commented Aug 11, 2025

LLM-inference trace on Vulkan Build
Capture d’écran 2025-08-11 à 10 57 35

@Basten7
Copy link
Author

Basten7 commented Aug 11, 2025

LLM-inference trace on Metal3 Build at 0.1 ms
Capture d’écran 2025-08-11 à 11 00 29

LLM-inference trace on Vulkan Build at 0.1 ms
Capture d’écran 2025-08-11 à 10 59 46

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment