Skip to content

Instantly share code, notes, and snippets.

@phughk
Created September 22, 2025 04:23
Show Gist options
  • Select an option

  • Save phughk/9fee45052b08c84d2221a153d21d8805 to your computer and use it in GitHub Desktop.

Select an option

Save phughk/9fee45052b08c84d2221a153d21d8805 to your computer and use it in GitHub Desktop.
Brief description of Burn-rs modules

Convolutional Modules

Conv1d / Conv2d / Conv3d

Apply a sliding filter to input signal. Used for spatial data (ex. 1d=audio, 2d=image, 3d=video)

ConvTranspose1d / ConvTranspose2d / ConvTranspose3d

Also known as deconvolution, they perform the reverse of convolution by upsampling

DeformConv2d

A more advanced convolution that allows for sampling to be offset, making it better for geometric transformations.

Recurrent and Sequential Modules

Gru

Gated Recurrent Unit

Lstm

Long Short Term Memory, handles long term dependencies in data

BiLstm

Bidirectional LSTM, allowing both forward and backward dependencies in data

GateController

Controls flow of information around network, determining what data to keep and what to forget.

Attention and Transformation modules

MultiHeadAttention

PositionalEncoding

RotaryEncoding

TransformerDecoder / TransformerDecoderLayer

TransformerEncoder / TransformerEncoderLayer

PositionWiseFeedForward

A component of a transformer that operates on each position sequentially.

Pooling Modules

These modules reduce the spatial size of a feature map.

AdaptiveAvgPool1d / AdaptiveAvgPool2d

Adapts the pooling kernel size to produce a fixed size output.

AvgPool1d / AvgPool2d

MaxPool1d / MaxPool2d

Selects the maximum value of a fixed size pool, which helps to select the most salient features.

Loss Functions

BinaryCrossEntropyLoss

Used for binary classification problems.

CosineEmbeddingLoss

Measures similarity between two tensors.

CrossEntropyLoss

Used for multi class classification problems.

MseLoss

Mean squared error, common for regression problems. Its simple and efficient. Because it's squared, larger errors have a disproportionately larger impact. You shouldn't use this if there are a lot of outliers, since it will skew towards fitting them.

HuberLoss

Less sensitive to outliers than mean squared error. It takes a parameter ($\delta$) that defines the threshold between difference of values. This effectively behaves similarly to MSE, but when there are outliers, then it reduces the significance of them so it doesn't overfit as much.

PoissonNllLoss

Poisson Negative Log Likelihood Loss. A specialised loss function for regression where the target data is "count" data i.e. integers. It assumes the target data follows a poisson distribution (the variance of data is proportional to it's mean). It strongly penalises predictions that are too small (i.e. close to zero), which is desired in "count" data.

Normalisation Modules

BatchNorm

GroupNorm

InstanceNorm

LayerNorm

RmsNorm

Other Core Modules

Interpolate1d / Interpolate2d

Perform interpolation to resize a tensor, estimating new values based on existing values. Used for scaling input up and down to a desired size.

Dropout

Randomly sets some inputs to zero during training to prevent overfitting.

Embedding

Converts integer indices to dense vectors, commonly in natural language processing.

Linear

Common layer that performs a linear transformation of data

Activator functions

Various activator functions that allow for learning complex patterns

Relu

Rectified Linear Unit, f(x)=max(0, x). This is the most popular activation function, since it forwards positive input, but caps negative input to zero.

LeakyRelu

Similar to Relu, f(x)=max(ax, x) where a is some constant like 0.01. Used to prevent neurons from becoming inactive/dead.

PRelu

Similar to Relu, f(x)=max(ax, x) where a is not a constant, but a learnable parameter. It's more flexible than Relu since it can learn the optimal value of a.

Sigmoid

$$ f(x) = \frac{1}{1 + e^{-x}} $$ "S" shaped curve, squashes input into 0..1 range

HardSigmoid

Computationally cheaper than Sigmoid.

Tanh

$$ $f(x) = \frac{e^{x} - e^{-x}}{e^{x} + e^{-x}}$ $$ Similar to Sigmoid, it has an "S" shaped curve, but squashes input into -1..1 range.

Gelu

$$ $f(x) = x \Phi(x)$ $$ More modern, $\Phi(x)$ is the cumulative normal distribution of x.

SwiGlu

A variant of the Gated Linear Unit activation that uses the Sigmoid-Weighted Linear Unit (SiLU) function.

Unfold4d

BF16 / F16

Datatypes for handling reduced precision for benefit of performance and footprint.

Helper types

Ignored

Tensor

Param

RunningState

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment