Skip to content

Instantly share code, notes, and snippets.

@galv
Created November 7, 2025 18:06
Show Gist options
  • Select an option

  • Save galv/63003ef45adaa0e833678f72bf03889e to your computer and use it in GitHub Desktop.

Select an option

Save galv/63003ef45adaa0e833678f72bf03889e to your computer and use it in GitHub Desktop.
CUDA Graph Debugging Techniques in Pytorch

Techniques for CUDA Graph Debugging in Pytorch

This is just a list of helpful techniques I have used for

export TORCH_SHOW_CPP_STACKTRACES=1

Unfortunately, I have found that pytorch's unwinder can make mistakes, which makes the stack traces less than useful. As soon as the stack unwinder fails, all remaining frames are omitted.

Compiling with -fno-omit-frame-pointer has not helped in my experience.

Secondarily, I have found that this doesn't work in distributed runs launched by torchrun. As soon as a single process fails, torchrun will try to kill all other processes. Unfortunately, since we typically want to attach gdb to only one process, and all processes typically hit the same error, another process's death will result in the

py-bt

The con is that some parts of pytorch can

The other con is that I don't know how to make this work for distributed runs launched via torchrun because the debug process needs to have a tty.

export CUDA_LOG_FILE=stdout

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment