BatchNorm Normalizes data to a μ=0σ=1 across a batch. x^i=σB2+ϵxi−μB yi=γx^i+β μB=m1i=1∑mxi(batch mean) σB2=m1i=1∑m(xi−μB)2(batch variance) LayerNorm Normalizes across all features in each sample independently. x^=σL2+ϵx−μL y=γx^+β μL=H1i=1∑Hxi(mean over features) σL2=H1i=1∑H(xi−μL)2(variance over features)