Apply a sliding filter to input signal. Used for spatial data (ex. 1d=audio, 2d=image, 3d=video)
Also known as deconvolution, they perform the reverse of convolution by upsampling
A more advanced convolution that allows for sampling to be offset, making it better for geometric transformations.
Gated Recurrent Unit
Long Short Term Memory, handles long term dependencies in data
Bidirectional LSTM, allowing both forward and backward dependencies in data
Controls flow of information around network, determining what data to keep and what to forget.
A component of a transformer that operates on each position sequentially.
These modules reduce the spatial size of a feature map.
Adapts the pooling kernel size to produce a fixed size output.
Selects the maximum value of a fixed size pool, which helps to select the most salient features.
Used for binary classification problems.
Measures similarity between two tensors.
Used for multi class classification problems.
Mean squared error, common for regression problems. Its simple and efficient. Because it's squared, larger errors have a disproportionately larger impact. You shouldn't use this if there are a lot of outliers, since it will skew towards fitting them.
Less sensitive to outliers than mean squared error.
It takes a parameter (
Poisson Negative Log Likelihood Loss. A specialised loss function for regression where the target data is "count" data i.e. integers. It assumes the target data follows a poisson distribution (the variance of data is proportional to it's mean). It strongly penalises predictions that are too small (i.e. close to zero), which is desired in "count" data.
Perform interpolation to resize a tensor, estimating new values based on existing values. Used for scaling input up and down to a desired size.
Randomly sets some inputs to zero during training to prevent overfitting.
Converts integer indices to dense vectors, commonly in natural language processing.
Common layer that performs a linear transformation of data
Various activator functions that allow for learning complex patterns
Rectified Linear Unit, f(x)=max(0, x). This is the most popular activation function, since it forwards positive input, but caps negative input to zero.
Similar to Relu, f(x)=max(ax, x) where a is some constant like 0.01. Used to prevent neurons from becoming inactive/dead.
Similar to Relu, f(x)=max(ax, x) where a is not a constant, but a learnable parameter. It's more flexible than Relu since it can learn the optimal value of a.
$$ f(x) = \frac{1}{1 + e^{-x}} $$ "S" shaped curve, squashes input into 0..1 range
Computationally cheaper than Sigmoid.
$$
$$
x.
A variant of the Gated Linear Unit activation that uses the Sigmoid-Weighted Linear Unit (SiLU) function.
Datatypes for handling reduced precision for benefit of performance and footprint.