Z2H Video 4, exercise E01 [Post #23, Day 46]
February 20, 2025•480 words
E01: I did not get around to seeing what happens when you initialize all weights and biases to zero. Try this and train the neural net. You might think either that 1) the network trains just fine or 2) the network doesn't train at all, but actually it is 3) the network trains but only partially, and achieves a pretty bad final performance. Inspect the gradients and activations to figure out what is happening and why the network is only partially training, and what part is being trained exactly.
I initialized weights to 0 in the Linear module here self.weight = torch.zeros(fan_in, fan_out)
. The biases were already initialized to 0.
I ran the training run and had these outputs:
0/ 200000: 3.2958
10000/ 200000: 2.7882
20000/ 200000: 2.7518
30000/ 200000: 2.8506
40000/ 200000: 2.8221
50000/ 200000: 2.8987
60000/ 200000: 2.6038
70000/ 200000: 2.6751
80000/ 200000: 3.0131
90000/ 200000: 2.7753
100000/ 200000: 2.6079
110000/ 200000: 2.5273
120000/ 200000: 2.6690
130000/ 200000: 2.6426
140000/ 200000: 2.8423
150000/ 200000: 2.5484
160000/ 200000: 2.9788
170000/ 200000: 2.9535
180000/ 200000: 2.8622
190000/ 200000: 2.8513
My MacBook was cranking for this one, I heard the fan come on and the training took about four minutes.
So this looks to me like the neural net isn't really training. The loss values are bouncing around quite a bit and not really decreasing sequentially. The sample names output from the model are not good.
narmahxaae.
hlrihkimrs.
reaty.
hnaassnejr.
hnenfamesahc.
iaeei.
.
elmaia.
ceaiiv.
e.
lein.
h.
.
m.
.
oin.
eeijn.
s.
lilea.
.
After full training, the histogram plots are all just a single sharp spike, and all the layers are perfectly overlapping. The updates-to-data ratios plot just looks like a single horizontal line.
Let me see what these look like after the first epoch.
Ok, the diagnostic plots look the same.
E02: BatchNorm, unlike other normalization layers like LayerNorm/GroupNorm etc. has the big advantage that after training, the batchnorm gamma/beta can be "folded into" the weights of the preceeding Linear layers, effectively erasing the need to forward it at test time. Set up a small 3-layer MLP with batchnorms, train the network, then "fold" the batchnorm gamma/beta into the preceeding Linear layer's W,b by creating a new W2, b2 and erasing the batch norm. Verify that this gives the same forward pass during inference. i.e. we see that the batchnorm is there just for stabilizing the training, and can be thrown out after training is done! pretty cool.
I don't understand what "folded into" means.
I set up a small 3-layer MLP with batchnorms, like this:
layers = [
Linear(n_embd * block_size, n_hidden, bias=False), BatchNorm1d(n_hidden), #Tanh(),
Linear( n_hidden, n_hidden, bias=False), BatchNorm1d(n_hidden), #Tanh(),
Linear( n_hidden, vocab_size, bias=False), BatchNorm1d(vocab_size),
]
I trained the network up to 1000 iterations. Loss is 3.3156.
I am unsure how to complete the rest of the question.