Ian Kronquist iankronquist

Info about my latest training run 2025-12-14

Hello there! I’m trying to train a custom LLM similar to Andrej Karpathy’s nanogpt and nanochat tutorials. My issue is that training loss and gradient norms go to nearly zero after around a hundred steps. I’m using the MLX framework on an M1 Max.

Code, raw logs, graphs of the training loss and validation loss, and gradient norms and raw csv data are all available on this github gist: https://gist.github.com/iankronquist/68bc7e51178aef47dd225074e5310814#file-trainingruninfo-md

I have a rather llama like architecture with rope. Unlike llama I am using gelu (like gpt2) instead of swiglu in the MLP to save a few parameters on the gate matrices. I’m using a embedding dimension of 768 and 12 layers, and an mlp up projection ratio of 4, and group query attention with a key value head ratio of 4 (all like gpt2 small and llama). I’m using the gpt2 tokenizer with a vocab dimension of 50304. This comes out to around 114M parameters and seems like I’m on the beaten path f

Migrating from i3 to Sway on Arch Linux

Refer to the arch wiki: https://wiki.archlinux.org/index.php/Sway

Install packages: pacman -S sway weston
Copy configuration:

mkdir -p ~/.config/sway
cp ~/.i3/config ~/.config/sway/config

When you log in, start sway:

A Young Lady's C++ Primer

(I have been enjoying The Diamond Age, thank you)

C++ was developed in 198X by Bjourne Strousap. It is an improved version of the venerable C programming language. C is excellent at describing low level details in a way which is portable across computers. It is the most influential language of our lifetimes, but unless you're writing an operating system, a hypervisor (AKA Virtual Machine Monitor), or working on an embedded system on a tiny ass microcontroller, it's probably not the right tool for the job.

Programming Paradigms

In programming, a paradigm is an abstract way to understand and solve a problem. A paradigm is like a perspective, a high point from which you can survey the terrain and try to decide the path your journey will take.

Toay, there are three major programming paradigms:

Imperative Programming.
Object Oriented Programming (OOP).
Functional Programming (FP).

In principle any language can be used to program in any paradigm, but in practice certain languages tend to favor certain paradigms.

Typing

When programmers talk about typing, most of the time they aren't talking about the odious task of pressing keys on a keyboard (watch any programmer and look to see how much of their time they spend actually typing out code. What you'll see instead is a lot of frowning and staring at the screen with an expression of great consternation as you can see them think "why the hell didn't my code do what I thought?"). Instead they're talking about the types of variables. Now you're probably familiar with the idea that there are numbers and strings and

	'''
	We don't have enough disk to unpack tokenized copies of the fineweb dataset, so tokenize as we go.

	'''
	import os
	import time

	import tiktoken

	import random

	.intel_syntax noprefix
	# GDT:
	# 0x00 NULL
	# 0x10 32 bit code
	# 0x18 32 bit data
	# 0x20 16 bit code ; 64kb limit
	# 0x28 16 bit data ; 64kb limit

	.extern BootDrive
	.extern halt

	#![feature(asm)]
	#![allow(unused)]
	const EI_CLASS: u8 = 4;
	const EI_NIDENT: usize = 16;

	#[derive(Default,Debug, Copy, Clone)]
	#[repr(C,packed)]
	struct Elf64Header {
	ident: [u8;EI_NIDENT],
	type_: u16,

	#define MPP_MACHINE (0b11 << 11)
	#define MPP_SUPERVISOR (0b01 << 11)
	#define SPP_SUPERVISOR (1 << 8)
	#define MPIE_YES (1 << 7)
	#define SPIE_YES (1 << 5)
	#define MIE_YES (1 << 3)
	#define SIE_YES (1 << 1)

	/* Machine external interrupt enable */
	#define MIE_MEIE (1 << 11)

	CPU Reset (CPU 0)
	EAX=00000000 EBX=00000000 ECX=00000000 EDX=00000000
	ESI=00000000 EDI=00000000 EBP=00000000 ESP=00000000
	EIP=00000000 EFL=00000000 [-------] CPL=0 II=0 A20=0 SMM=0 HLT=0
	ES =0000 00000000 00000000 00000000
	CS =0000 00000000 00000000 00000000
	SS =0000 00000000 00000000 00000000
	DS =0000 00000000 00000000 00000000
	FS =0000 00000000 00000000 00000000