cosminscn/notes_dt.md

Last active August 28, 2025 01:03

Star (0) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Select an option

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/cosminscn/495d625ab61e036e0c56103b8daf07cb.js"></script>
Save cosminscn/495d625ab61e036e0c56103b8daf07cb to your computer and use it in GitHub Desktop.

Download ZIP

Distributed training notes - aug 27 25

Raw

notes_dt.md

A100 - spec 312 TFLOPS/s

40GB? 80GB HBM ram 20MB cache

Large model run

deepzero3 should read the deep speed paper, looks like they did as baseline model parallelism with bs2?
- https://arxiv.org/pdf/1910.02054
bloom blog https://huggingface.co/blog/bloom-megatron-deepspeed
activations
- 12(input/proj/attention/nonlin) x hidden_dim x local_batch x seq_length x transformer_layers x 2(activation size)
params
- transformer_layers * 12 (2 x 4 hidden_dim mlp + 4 hidden_dim) * hidden_dim * hidden_dim x 2? or x4?

activations_per_layer / params_per_layer == local_batch x seq_len / hidden_dim == 2 * 512 / 1024 == 1???? seems high?

only hdim for big models is larger?

bs, hd = 12, 8192 for 172B model 400 GPUS Table 9 https://arxiv.org/pdf/1910.02054

hmm do I need to compare per layer? or assume fsdp?

they discuss a bit here about design options https://github.com/bigscience-workshop/bigscience/blob/master/train/tr11-176B-ml/chronicles-prequel.md

FSDP

FSDP forward pass:
    for layer_i in layers:
        all-gather full weights for layer_i
        forward pass for layer_i
        discard full weights for layer_i

FSDP backward pass:
    for layer_i in layers:
        all-gather full weights for layer_i
        backward pass for layer_i
        discard full weights for layer_i
        reduce-scatter gradients for layer_i

simple fsdp implementation with torch compile https://github.com/facebookresearch/capi/blob/main/fsdp.py

JAX course

TIL use lecun init N(0, 1/fan_in)? and you don't need to do scaleddotprod attn
- xavier/golorot std dev sqrt(2/(fan_in + fan_out))
- hmm
- https://claude.ai/chat/748db8c8-d94f-4b21-b11f-1754eeef2a39
fusedeinsum might be a thing

GPU Blog

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment