These are implicit assumptions that every type of neural network layer has on the data it’s dealing with.

Convolutional Networks

  • Hierarchy assumes that signal is made up of simpler signal primitives that can be combined together
  • Translational Equivariance translation of patterns in the input result in a similar translation in the output
  • Parameter Sharing the same parameters are shared across locations in the input
  • Locality pixels beside each other are more closely related than pixels farther apart

When to use

  • Time series data (where actions in a close sequence matter more)
  • Images
  • 3d Data (but transformers are pretty close)

When not to use

  • Tabular data
  • Data where order doesn’t matter: where the order of adjacent signals dont matter
  • Data where global relationships matter more than local ones

Fully Connected Layer

  • No structure assumptions makes no assumptions on the inherent structure of the data, any structure has to be learned
  • Global receptive field sees all of the input at once
  • Position dependent the position of the pattern matters, (not translational equivariant )

When to use

  • Final classification layers where we should consider all the features we’ve encoded as a whole
  • Tabular data where there is no idea of spatial / temporal understanding of the data
  • Low-dimensional Latent spaces like encoding data (word embeddings?)
  • Coordinate networks networks that use coordinates an inputs and outputs some value (NeRF)

When not to use

  • High-dimensional Latent Spaces too many parameters
  • When translation equivariance is desired use a conv instead

Recurrent Neural Nets

  • Data is sequential / temporal our understanding of the data evolves as we sequentially go through the data
  • Markovian Process Current state depends on the past state, but not the full history (vanishing gradient as you get too far away)
  • Translational Equivariance, Temporal translation in the sequence in time will result in a similar translation in the output of the RNN (CNNs have the same ordeal, but for both space and time)
  • Causality can only look backwards (and we assume that we only need to look backwards in time)

When to use

  • Sequential Data like text, or an action through time
  • Data of variable length RNNs can run on data sequentially, so the length of the data doesn’t matter (it loses memory of the initial part of the data though)
  • Online/streaming process needed if we need the NN to function in an online manner on a stream of data
  • Past context is needed and future dont matter

When not to use

  • Non-sequential data
  • Data of fixed length, 9/10 times transformers perform better
  • Long range dependencies, when the RNN needs to know context that is from the start of the sequence (it will forget)
  • Data where order doesn’t matter

Self-Attention Layers

  • Almost no assumptions so its generalizable, but needs more data
  • Order does not matter, inherently, the layer does not deal with order, you have to make it learn order, which can be done with position encodings
  • Data far away, close together, are equally as likely to be weighed more or less aka global receptive field
  • Data heavily depends on its context
  • Variable Length the actual attention layer can handle variable lengths, but to make the training process quicker, we usually batch and pad the inputs

When to use

  • Data depends on data far away (or the entire context)
  • Can work on sets and pointclouds
  • when you got LOTS of data

When not to use

  • Too little data
  • Strong spatial / temporal patterns exist
  • Computational contraints

Pooling Layers (MaxPool, AvgPool)

  • We can compress the data without losing too much information
  • small shifts in the data don’t matter too much
  • when you want to build hierarchy more explicitly
  • Maxpool: most prominent feature wins
  • Avgpool: should aggregate features together to get the best one

Batch Normalization

  • The statistical characteristics of the Batch are roughly similar to the statistical characteristics of the entire dataset
  • normalization is needed and helps

Layer Norm

  • The statistical characteristics of the layer should be considered separately from the batch

Dropout

  • no single neuron matters