This is just a list of helpful techniques I have used for
Unfortunately, I have found that pytorch's unwinder can make mistakes, which makes the stack traces less than useful. As soon as the stack unwinder fails, all remaining frames are omitted.
Compiling with -fno-omit-frame-pointer has not helped in my experience.
Secondarily, I have found that this doesn't work in distributed runs launched by torchrun. As soon as a single process fails, torchrun will try to kill all other processes. Unfortunately, since we typically want to attach gdb to only one process, and all processes typically hit the same error, another process's death will result in the
The con is that some parts of pytorch can
The other con is that I don't know how to make this work for distributed runs launched via torchrun because the debug process needs to have a tty.