Sidenote: Z2H chapter progression
January 28, 2025•616 words
I am making this note in an effort to keep everything organized in my mind. These are the chapters from the notes section of each video.
Video 1 – The spelled-out intro to neural networks and backpropagation: building micrograd
- 00:00:00 intro
- 00:00:25 micrograd overview
- 00:08:08 derivative of a simple function with one input
- 00:14:12 derivative of a function with multiple inputs
- 00:19:09 starting the core Value object of micrograd and its visualization
- 00:32:10 manual backpropagation example #1: simple expression
- 00:51:10 preview of a single optimization step
- 00:52:52 manual backpropagation example #2: a neuron
- 01:09:02 implementing the backward function for each operation
- 01:17:32 implementing the backward function for a whole expression graph
- 01:22:28 fixing a backprop bug when one node is used multiple times
- 01:27:05 breaking up a tanh, exercising with more operations
- 01:39:31 doing the same thing but in PyTorch: comparison
- 01:43:55 building out a neural net library (multi-layer perceptron) in micrograd
- 01:51:04 creating a tiny dataset, writing the loss function
- 01:57:56 collecting all of the parameters of the neural net
- 02:01:12 doing gradient descent optimization manually, training the network
- 02:14:03 summary of what we learned, how to go towards modern neural nets
- 02:16:46 walkthrough of the full code of micrograd on github
- 02:21:10 real stuff: diving into PyTorch, finding their backward pass for tanh
- 02:24:39 conclusion
- 02:25:20 outtakes :)
Video 2 – The spelled-out intro to language modeling: building makemore
- 00:00:00 intro
- 00:03:03 reading and exploring the dataset
- 00:06:24 exploring the bigrams in the dataset
- 00:09:24 counting bigrams in a python dictionary
- 00:12:45 counting bigrams in a 2D torch tensor ("training the model")
- 00:18:19 visualizing the bigram tensor
- 00:20:54 deleting spurious (S) and (E) tokens in favor of a single . token
- 00:24:02 sampling from the model
- 00:36:17 efficiency! vectorized normalization of the rows, tensor broadcasting
- 00:50:14 loss function (the negative log likelihood of the data under our model)
- 01:00:50 model smoothing with fake counts
- 01:02:57 PART 2: the neural network approach: intro
- 01:05:26 creating the bigram dataset for the neural net
- 01:10:01 feeding integers into neural nets? one-hot encodings
- 01:13:53 the "neural net": one linear layer of neurons implemented with matrix multiplication
- 01:18:46 transforming neural net outputs into probabilities: the softmax
- 01:26:17 summary, preview to next steps, reference to micrograd
- 01:35:49 vectorized loss
- 01:38:36 backward and update, in PyTorch
- 01:42:55 putting everything together
- 01:47:49 note 1: one-hot encoding really just selects a row of the next Linear layer's weight matrix
- 01:50:18 note 2: model smoothing as regularization loss
- 01:54:31 sampling from the neural net
- 01:56:16 conclusion
Video 3 – Building makemore Part 2: MLP
- 00:00:00 intro
- 00:01:48 Bengio et al. 2003 (MLP language model) paper walkthrough
- 00:09:03 (re-)building our training dataset
- 00:12:19 implementing the embedding lookup table
- 00:18:35 implementing the hidden layer + internals of torch.Tensor: storage, views
- 00:29:15 implementing the output layer
- 00:29:53 implementing the negative log likelihood loss
- 00:32:17 summary of the full network
- 00:32:49 introducing F.cross_entropy and why
- 00:37:56 implementing the training loop, overfitting one batch
- 00:41:25 training on the full dataset, minibatches
- 00:45:40 finding a good initial learning rate
- 00:53:20 splitting up the dataset into train/val/test splits and why
- 01:00:49 experiment: larger hidden layer
- 01:05:27 visualizing the character embeddings
- 01:07:16 experiment: larger embedding size
- 01:11:46 summary of our final code, conclusion
- 01:13:24 sampling from the model
- 01:14:55 google collab (new!!) notebook advertisement