Techniques for CUDA Graph Debugging in Pytorch

This is just a list of helpful techniques I have used for

export TORCH_SHOW_CPP_STACKTRACES=1

Unfortunately, I have found that pytorch's unwinder can make mistakes, which makes the stack traces less than useful. As soon as the stack unwinder fails, all remaining frames are omitted.

Compiling with -fno-omit-frame-pointer has not helped in my experience.

Secondarily, I have found that this doesn't work in distributed runs launched by torchrun. As soon as a single process fails, torchrun will try to kill all other processes. Unfortunately, since we typically want to attach gdb to only one process, and all processes typically hit the same error, another process's death will result in the

py-bt

The con is that some parts of pytorch can

The other con is that I don't know how to make this work for distributed runs launched via torchrun because the debug process needs to have a tty.

galv/description.md

Select an option

No results found

Select an option

No results found

Techniques for CUDA Graph Debugging in Pytorch

export TORCH_SHOW_CPP_STACKTRACES=1

py-bt

export CUDA_LOG_FILE=stdout