Z2H Video 4, exercise E01 [Post #23, Day 46]

February 20, 2025•480 words

E01: I did not get around to seeing what happens when you initialize all weights and biases to zero. Try this and train the neural net. You might think either that 1) the network trains just fine or 2) the network doesn't train at all, but actually it is 3) the network trains but only partially, and achieves a pretty bad final performance. Inspect the gradients and activations to figure out what is happening and why the network is only partially training, and what part is being trained exactly.

I initialized weights to 0 in the Linear module here self.weight = torch.zeros(fan_in, fan_out). The biases were already initialized to 0.

I ran the training run and had these outputs:

      0/ 200000: 3.2958
  10000/ 200000: 2.7882
  20000/ 200000: 2.7518
  30000/ 200000: 2.8506
  40000/ 200000: 2.8221
  50000/ 200000: 2.8987
  60000/ 200000: 2.6038
  70000/ 200000: 2.6751
  80000/ 200000: 3.0131
  90000/ 200000: 2.7753
 100000/ 200000: 2.6079
 110000/ 200000: 2.5273
 120000/ 200000: 2.6690
 130000/ 200000: 2.6426
 140000/ 200000: 2.8423
 150000/ 200000: 2.5484
 160000/ 200000: 2.9788
 170000/ 200000: 2.9535
 180000/ 200000: 2.8622
 190000/ 200000: 2.8513

My MacBook was cranking for this one, I heard the fan come on and the training took about four minutes.

So this looks to me like the neural net isn't really training. The loss values are bouncing around quite a bit and not really decreasing sequentially. The sample names output from the model are not good.

narmahxaae.
hlrihkimrs.
reaty.
hnaassnejr.
hnenfamesahc.
iaeei.
.
elmaia.
ceaiiv.
e.
lein.
h.
.
m.
.
oin.
eeijn.
s.
lilea.
.

After full training, the histogram plots are all just a single sharp spike, and all the layers are perfectly overlapping. The updates-to-data ratios plot just looks like a single horizontal line.

Let me see what these look like after the first epoch.

Ok, the diagnostic plots look the same.

E02: BatchNorm, unlike other normalization layers like LayerNorm/GroupNorm etc. has the big advantage that after training, the batchnorm gamma/beta can be "folded into" the weights of the preceeding Linear layers, effectively erasing the need to forward it at test time. Set up a small 3-layer MLP with batchnorms, train the network, then "fold" the batchnorm gamma/beta into the preceeding Linear layer's W,b by creating a new W2, b2 and erasing the batch norm. Verify that this gives the same forward pass during inference. i.e. we see that the batchnorm is there just for stabilizing the training, and can be thrown out after training is done! pretty cool.

I don't understand what "folded into" means.

I set up a small 3-layer MLP with batchnorms, like this:

layers = [
  Linear(n_embd * block_size, n_hidden, bias=False), BatchNorm1d(n_hidden), #Tanh(),
  Linear(           n_hidden, n_hidden, bias=False), BatchNorm1d(n_hidden), #Tanh(),
  Linear(           n_hidden, vocab_size, bias=False), BatchNorm1d(vocab_size),
]

I trained the network up to 1000 iterations. Loss is 3.3156.

I am unsure how to complete the rest of the question.

Z2H Video 4, exercise E01 [Post #23, Day 46]

More from A Civil Engineer to AI Software Engineer 🤖
All posts

Z2H Video 4, round 2, finished watching [Post #22, Day 45]

Deep learning specialization, started course 1 [Post #24, Day 48]

Z2H Video 4, exercise E01 [Post #23, Day 46]

More from A Civil Engineer to AI Software Engineer 🤖All posts

Z2H Video 4, round 2, finished watching [Post #22, Day 45]

Deep learning specialization, started course 1 [Post #24, Day 48]

More from A Civil Engineer to AI Software Engineer 🤖
All posts