Basten7/gist:091df055c04edaa9c88eb0cdc7fc429d

Last active September 10, 2025 14:59

Star (0) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Select an option

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/Basten7/091df055c04edaa9c88eb0cdc7fc429d.js"></script>
Save Basten7/091df055c04edaa9c88eb0cdc7fc429d to your computer and use it in GitHub Desktop.

Download ZIP

Prompt Processing vs Token Generation

Raw

gistfile1.txt

Classic LLM-inference trace on the GPU

Author

Basten7 commented Aug 11, 2025 •

edited

Loading

LLM-inference trace on Metal3 Build

Warm-up spikes (left): Kernel/JIT warm-up, memory/KV-cache allocation, and first big GEMMs kick in. Short spikes and a dip are normal during graph creation and allocator growth.
High flat plateau — Prompt Processing (prefill): The model is chewing through the prompt in big matrix multiplies and full-sequence attention. That’s compute-heavy and easy to keep the 6900 XT busy, so you see ~90–100% util until the prompt is ingested. Your “pp512” likely means you’re chunking the prefill in 512-token blocks; that only affects how long this plateau lasts, not its height.
Deep notch between phases: End of prefill → start of decoding. You usually get a bubble while the runtime switches kernels, finalizes KV cache pages, synchronizes streams, and does the first softmax/sampling (often CPU-side). First-token latency shows up here.
Lower, steady plateau — Token Generation (decode): Now you’re generating one token at a time. Each step does matvecs/attention against the growing KV cache. That’s more memory-bound and has less parallel work per step, so utilization settles a bit lower and flatter. Tiny wiggles = periodic sampling/logging.
Drop to zero: Generation stops; buffers freed.

Author

Basten7 commented Aug 11, 2025

LLM-inference trace on Vulkan Build

Author

Basten7 commented Aug 11, 2025 •

edited

Loading

LLM-inference trace on Metal3 Build at 0.1 ms

LLM-inference trace on Vulkan Build at 0.1 ms

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment