These are implicit assumptions that every type of neural network layer has on the data it’s dealing with.

Convolutional Networks

Hierarchy assumes that signal is made up of simpler signal primitives that can be combined together
Translational Equivariance translation of patterns in the input result in a similar translation in the output
Parameter Sharing the same parameters are shared across locations in the input
Locality pixels beside each other are more closely related than pixels farther apart

When to use

Time series data (where actions in a close sequence matter more)
Images
3d Data (but transformers are pretty close)

When not to use

Tabular data
Data where order doesn’t matter: where the order of adjacent signals dont matter
Data where global relationships matter more than local ones

Fully Connected Layer

No structure assumptions makes no assumptions on the inherent structure of the data, any structure has to be learned
Global receptive field sees all of the input at once
Position dependent the position of the pattern matters, (not translational equivariant )

When to use

Final classification layers where we should consider all the features we’ve encoded as a whole
Tabular data where there is no idea of spatial / temporal understanding of the data
Low-dimensional Latent spaces like encoding data (word embeddings?)
Coordinate networks networks that use coordinates an inputs and outputs some value (NeRF)

When not to use

High-dimensional Latent Spaces too many parameters
When translation equivariance is desired use a conv instead

Recurrent Neural Nets

Data is sequential / temporal our understanding of the data evolves as we sequentially go through the data
Markovian Process Current state depends on the past state, but not the full history (vanishing gradient as you get too far away)
Translational Equivariance, Temporal translation in the sequence in time will result in a similar translation in the output of the RNN (CNNs have the same ordeal, but for both space and time)
Causality can only look backwards (and we assume that we only need to look backwards in time)

When to use

Sequential Data like text, or an action through time
Data of variable length RNNs can run on data sequentially, so the length of the data doesn’t matter (it loses memory of the initial part of the data though)
Online/streaming process needed if we need the NN to function in an online manner on a stream of data
Past context is needed and future dont matter

When not to use

Non-sequential data
Data of fixed length, 9/10 times transformers perform better
Long range dependencies, when the RNN needs to know context that is from the start of the sequence (it will forget)
Data where order doesn’t matter

Self-Attention Layers

Almost no assumptions so its generalizable, but needs more data
Order does not matter, inherently, the layer does not deal with order, you have to make it learn order, which can be done with position encodings
Data far away, close together, are equally as likely to be weighed more or less aka global receptive field
Data heavily depends on its context
Variable Length the actual attention layer can handle variable lengths, but to make the training process quicker, we usually batch and pad the inputs

When to use

Data depends on data far away (or the entire context)
Can work on sets and pointclouds
when you got LOTS of data

When not to use

Too little data
Strong spatial / temporal patterns exist
Computational contraints

Pooling Layers (MaxPool, AvgPool)

We can compress the data without losing too much information
small shifts in the data don’t matter too much
when you want to build hierarchy more explicitly
Maxpool: most prominent feature wins
Avgpool: should aggregate features together to get the best one

Batch Normalization

The statistical characteristics of the Batch are roughly similar to the statistical characteristics of the entire dataset
normalization is needed and helps

Layer Norm

The statistical characteristics of the layer should be considered separately from the batch

Dropout

no single neuron matters

Explorer

Inductive Biases

Convolutional Networks

When to use

When not to use

Fully Connected Layer

When to use

When not to use

Recurrent Neural Nets

When to use

When not to use

Self-Attention Layers

When to use

When not to use

Pooling Layers (MaxPool, AvgPool)

Batch Normalization

Layer Norm

Dropout

Graph View

Table of Contents

Backlinks