Z2H Video 4, round 2, finished watching [Post #22, Day 45]

First, just revisiting this exercise question from the previous video, based on new insights given in this Video 4:

E02: I was not careful with the intialization of the network in this video. (1) What is the loss you'd get if the predicted probabilities at initialization were perfectly uniform? What loss do we achieve? (2) Can you tune the initialization to get a starting loss that is much more similar to (1)?

I was on the right track before, cleaned up answers:

(1) What is the loss you'd get if the predicted probabilities at initialization were perfectly uniform? What loss do we achieve?

You would get -ln(1/27) = 3.30.

After our first pass through the network with our original initialization, we get an initial loss of 27.8817.

(2) Can you tune the initialization to get a starting loss that is much more similar to (1)?

We achieve close to this loss after one pass through the neural network by initializing W2 weights close to zero (multiplying by 0.01 at initialization) and b2 biases to be 0.

This was the first improvement, but the next lurking problem was making sure the tanh neurons get activated by avoiding values feeding into them in the "flat" zones of -1 and 1. If -1 or 1 get fed into a tanh neuron in our backward pass, the resulting value is 0, because the derivative is (1 - t**2) so the gradient gets zeroed out and we have a "dead neuron".


Ok, I think more is clear to me now from Video 4, here is my breakdown of the contents of the video:

  1. Smart initialization, Kaiming init, avoid hockey stick shape in plot of loss versus iteration, don't waste training iterations at the beginning squashing down large parameter numbers
  2. Batch normalization, the first kind of normalization introduced to neural nets
  3. Residual neural network (resnet) architecture (convolution layer, batch normalization layer, ReLU layer stacked in repeating blocks), commonly used for image classification
  4. A look at torch.nn layers documentation online
  5. Our code now PyTorch-ified, code wrapper into layer modules (Linear, BatchNorm1d, Tanh), can stack layers like LEGO blocks to build neural nets
  6. Tracking stats, diagnostic tools to understand if neural net is "in a good state dynamically", histograms for forward pass activations, backward pass gradients, and weights updated as part of stochastic gradient ascent (also looked at means and standard deviations, and gradients-to-data ratios), and fourth diagnostic plot showing updates to data ratios (a good rough heuristic for what we want the ratio to be is 1e-3, or -3 on log scale, if the ratio is too high above this line then the learning rate could be too big, if ratio is too far below the line then the learning rate could be too low)
  7. Batch normalization makes the neural net more robust

At the end of the video, Andrej said he thinks our neural net performance is no longer bottlenecked by the optimization, but now by our context length, and we should look at more powerful architectures like recurrent neural networks and transformers.

But we are getting to the cutting edge of neural nets (when this video was made in 2023 anyway). The field hasn't "solved" the best way to initialize a neural net, or backpropagate, or make parameter updates, there is plenty more research to be done in these areas.

My plan for tomorrow is to work on the Video 4 exercises, here's a preview:

E01: I did not get around to seeing what happens when you initialize all weights and biases to zero. Try this and train the neural net. You might think either that 1) the network trains just fine or 2) the network doesn't train at all, but actually it is 3) the network trains but only partially, and achieves a pretty bad final performance. Inspect the gradients and activations to figure out what is happening and why the network is only partially training, and what part is being trained exactly.

E02: BatchNorm, unlike other normalization layers like LayerNorm/GroupNorm etc. has the big advantage that after training, the batchnorm gamma/beta can be "folded into" the weights of the preceeding Linear layers, effectively erasing the need to forward it at test time. Set up a small 3-layer MLP with batchnorms, train the network, then "fold" the batchnorm gamma/beta into the preceeding Linear layer's W,b by creating a new W2, b2 and erasing the batch norm. Verify that this gives the same forward pass during inference. i.e. we see that the batchnorm is there just for stabilizing the training, and can be thrown out after training is done! pretty cool.

More from A Civil Engineer to AI Software Engineer 🤖
All posts