Last active
September 10, 2025 14:59
-
-
Save Basten7/091df055c04edaa9c88eb0cdc7fc429d to your computer and use it in GitHub Desktop.
Prompt Processing vs Token Generation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Classic LLM-inference trace on the GPU |
Author
Author
Author
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment



LLM-inference trace on Metal3 Build

Warm-up spikes (left): Kernel/JIT warm-up, memory/KV-cache allocation, and first big GEMMs kick in. Short spikes and a dip are normal during graph creation and allocator growth.
High flat plateau — Prompt Processing (prefill): The model is chewing through the prompt in big matrix multiplies and full-sequence attention. That’s compute-heavy and easy to keep the 6900 XT busy, so you see ~90–100% util until the prompt is ingested. Your “pp512” likely means you’re chunking the prefill in 512-token blocks; that only affects how long this plateau lasts, not its height.
Deep notch between phases: End of prefill → start of decoding. You usually get a bubble while the runtime switches kernels, finalizes KV cache pages, synchronizes streams, and does the first softmax/sampling (often CPU-side). First-token latency shows up here.
Lower, steady plateau — Token Generation (decode): Now you’re generating one token at a time. Each step does matvecs/attention against the growing KV cache. That’s more memory-bound and has less parallel work per step, so utilization settles a bit lower and flatter. Tiny wiggles = periodic sampling/logging.
Drop to zero: Generation stops; buffers freed.