What is torch nn Really

Just an exploration on what PyTorch actually is and the preface for each module actually existing.

MNIST Setup

The document I am referencing is https://docs.pytorch.org/tutorials/beginner/nn_tutorial.html it uses the MNIST Dataset to make its point.

from pathlib import Path
import requests
 
DATA_PATH = Path("data")
PATH = DATA_PATH / "mnist"
 
PATH.mkdir(parents=True, exist_ok=True)
 
URL = "https://github.com/pytorch/tutorials/raw/main/_static/"
FILENAME = "mnist.pkl.gz"
 
if not (PATH / FILENAME).exists():
  content = requests.get(URL + FILENAME).content
  (PATH / FILENAME).open("wb").write(content)

import pickle
import gzip
 
# The dataset is in pickle format, which is a python serializable large file format
with gzip.open((PATH / FILENAME).as_posix(), "rb") as f:
        ((x_train, y_train), (x_valid, y_valid), _) = pickle.load(f, encoding="latin-1")

from matplotlib import pyplot
import numpy as np
 
# Dataset is just a bunch of 28x28 pixel written numbers, just basic of the basic OCR
pyplot.imshow(x_train[88].reshape((28, 28)), cmap="gray")
pyplot.show()

Output:

<Figure size 640x480 with 1 Axes>

import torch
from functools import partial
 
device = 'cuda' if torch.cuda.is_available() else 'cpu'
 
# PyTorch uses a special datatype for operations, hence we convert the data to tensor
x_train, y_train, x_valid, y_valid = map(
  partial(torch.as_tensor, device=device), 
  (x_train, y_train, x_valid, y_valid)
  )
print(x_train.shape, x_train.device)
print(y_train.shape, y_train.device)
print(x_valid.shape, x_valid.device)
print(y_valid.shape, y_valid.device)

Output:

torch.Size([50000, 784]) cuda:0
torch.Size([50000]) cuda:0
torch.Size([10000, 784]) cuda:0
torch.Size([10000]) cuda:0

Gradients

One powerful thing about PyTorch is that it can calculate gradients automatically. This is done with requires_grad=True.

import math
 
# "It works because It works" type initialization of model weights(called an Xavier Initialization)
# Well actually it has some statistical backing, to limit the chances of exploding and vanishing gradients,
# we keep te variance in activation and gradient the same
weights = torch.randn(size=(784, 10), device=device) / math.sqrt(784)  # one shot MNIST
weights.requires_grad_() # this needs to be done because / math.sqrt(784) made a new tensor!!!!!
bias = torch.zeros(10, device=device, requires_grad=True)

Small Aside on Xavier Initialization

Problem: Keep variance constant across layers during forward/backward pass.

Forward pass: For $y = \sum_{i = 1}^{n_{in}} w_{i} x_{i}$ , assuming independence and zero mean:

\text{Var}(y) = n_{in} \cdot \text{Var}(w) \cdot \text{Var}(x)$$ **Goal:** $\text{Var}(y) = \text{Var}(x)$ requires:

n_{in} \cdot \text{Var}(w) = 1 \implies \text{Var}(w) = \frac{1}{n_{in}}

* * B a c k w a r d p a ss : * * B ysy mm e t ry, p reser v in gg r a d i e n t v a r ian cere q u i res :

n_{out} \cdot \text{Var}(w) = 1 \implies \text{Var}(w) = \frac{1}{n_{out}}

* * X a v i erco m p ro mi se * * (a v er a g e b o t h co n s t r ain t s) :

\text{Var}(w) = \frac{2}{n_{in} + n_{out}}

* * S o i f w e ini t ia l i ze w e i g h t s a s n or ma l d i s t r ib u t i o n : * *

W_{ij} \sim \mathcal{N}\left(\mu = 0, \sigma = \sqrt{\frac{2}{n_{in} + n_{out}}}\right)

* * I m pl e m e n t a t i o n : * * ‘‘‘ p y t h o n s t d = n p . s q r t (2.0/ (n_{i} n + n_{o} u t)) W = n p . r an d o m . n or ma l (0, s t d, s i ze = (n_{o} u t, n_{i} n)) ‘‘‘ B u tt hin g i s t h e in p u t i s u s u a ll y >> t han t h eo u tp u t, so p eo pl e j u s t d o

\frac{1}{\sqrt{n_{in}}}

Explorer

What is torch nn Really

What is torch nn Really

MNIST Setup

Gradients

Small Aside on Xavier Initialization

Graph View

Table of Contents