Hello there! I’m trying to train a custom LLM similar to Andrej Karpathy’s nanogpt and nanochat tutorials. My issue is that training loss and gradient norms go to nearly zero after around a hundred steps. I’m using the MLX framework on an M1 Max.
Code, raw logs, graphs of the training loss and validation loss, and gradient norms and raw csv data are all available on this github gist: https://gist.github.com/iankronquist/68bc7e51178aef47dd225074e5310814#file-trainingruninfo-md
I have a rather llama like architecture with rope. Unlike llama I am using gelu (like gpt2) instead of swiglu in the MLP to save a few parameters on the gate matrices. I’m using a embedding dimension of 768 and 12 layers, and an mlp up projection ratio of 4, and group query attention with a key value head ratio of 4 (all like gpt2 small and llama). I’m using the gpt2 tokenizer with a vocab dimension of 50304. This comes out to around 114M parameters and seems like I’m on the beaten path f