Z2H Video 2, building makemore [Post #8, Day 10]
January 15, 2025•1,311 words
Yesterday I started watching Video 2 in Andrej Karpathy's Zero to Hero series, titled The spelled-out intro to language modeling: building makemore. The makemore library is another library created by Andrej that makes more of the things that you give it. So far we have been building a character-level language model, that is a model that predicts the next character in a sequence of text, one character at a time. We are building a bigram language model, which means we are working with two characters at a time – the one we have as input, and the next one which we are trying to predict. We are using a data set of 32,033 names and are generating new names based on the structure of all the existing names. We are using PyTorch.
From memory (and looking at my Jupyter Labs notebook), this is what we have done so far (at 00:39:21 into the video):
- Did some preliminary analysis on the data set, length (32,033 words), min word length (2 letters), max word length (15 letters), found how many occurrences of each bigram (e.g. ar, le, en, la), also added characters (special tokens) to denote start and end of word
- Imported PyTorch and created an empty 27 by 27 PyTorch tensor (2-dimensional array) filled with zeros
- Create a mapping of 26 letters in the alphabet to integers (a to 1, b to 2, etc.) called
stoi
(string to integer), and added one item of '.' mapped to 0 to denote start or end of word - Fill PyTorch tensor with all possible bigram codes
- Created visualization of tensor showing all bigrams and for each bigram – number of times (counts) that bigram appears in the data set
- For top row of tensor (.., .a, .b, .c, etc.), computed probability of each bigram occurring, by dividing number of occurrences of the bigram by sum of number of occurrences of all bigrams in the row, more like probability that . comes after . (which is zero because there are zero occurrences of this in the data set), a comes after . (i.e. .a), b comes after . (i.e. .b), etc.
- Used
torch.multinomial
which takes probabilities as input, and spits out integers based on those probabilities, and will spit out as many integers as you specify (num_samples
), so like if probability of 0 is 60%, probability of 1 is 30% and probability of 2 is 10% and you ask for 10 samples you could get something like[0, 0, 2, 1, 1, 0, 1, 0, 0, 0]
, which we then convert back to letters using our itos (integer to string) mapping - We then used this methodology to learn and generate new names, the first letter is a lookup into our probabilities table (our 27 x 27 tensor)
My question at the start of today was: is there any neural network involved in this? Andrej said we "trained" a bigram language model. We trained just by counting frequencies of letter pairing in our data set. So no, this example does not include a neural network. Our probability tensor P
was really where we kept the parameters of our bigram language model, it was our "training". Then we ran our model by iteratively sampling the next character and creating names.
Andrej said to read the Broadcasting semantics page of the PyTorch documentation. And to treat it with respect, and it's not something to play fast and loose with. Really respect it, watch some tutorials about it, really be careful with it, because one can quickly run into bugs.
Python in-place operations are faster, instead of a = a + 1
, make it a += 1
, which doesn't create new memory under the hood.
The second half of Video 2 is modeling this bigram example as a neural net now!
Data type, i.e. dtype is important to check in Python. So can do like x.dtype
(and would get e.g. float32
, int64
, etc.).
I have finished going through Video 2. We indeed built a simple neural network, I think with 27 input neurons, and 27 hidden layer neurons, need to confirm. We built the same Tensor as in the bigram model, but this time instead of filling it with our computed counts based on the data set (e.g. a count being how many times r comes after a), we filled it (initialized it) with random weights. These were then iterated on using our gradient descent process to arrive upon the same Tensor that was computed in the bigram model example! So we got the same results – same loss and the same five example names (using seeded random generator). The key is, it is now easier to complexify our neural network to get better and better results, whereas if we wanted to complexify our bigram model architecture we would have to keep making that counts Tensor bigger and bigger (like instead of just bigrams, putting in trigrams, and 'higher-grams') to the point of computational inefficiency.
I learned that regularization is the same as adding smoothing to our bigram model example, like replacing zeroes in the counts Tensor with ones, so the bigram "jq" at least has some probability and doesn't cause our loss to blow up (or something like that).
We used negative log likelihood for our loss quantification because this is a classification problem, compared to using mean squared error in Video 1 since it was a regression problem.
Things I still need to iron out:
- What a "logit" is
- Again, want to go through the manual backpropagation with computational graph building one more time, my understanding is PyTorch builds those computational graphs for forward pass and backpropagation under the hood
- Make sure I'm solid on negative log likelihood, how to compute and its meaning
And here are some exercises Andrej listed that I can work on after completing Video 2:
- E01: train a trigram language model, i.e. take two characters as an input to predict the 3rd one. Feel free to use either counting or a neural net. Evaluate the loss; Did it improve over a bigram model?
- E02: split up the dataset randomly into 80% train set, 10% dev set, 10% test set. Train the bigram and trigram models only on the training set. Evaluate them on dev and test splits. What can you see?
- E03: use the dev set to tune the strength of smoothing (or regularization) for the trigram model - i.e. try many possibilities and see which one works best based on the dev set loss. What patterns can you see in the train and dev set loss as you tune this strength? Take the best setting of the smoothing and evaluate on the test set once and at the end. How good of a loss do you achieve?
- E04: we saw that our 1-hot vectors merely select a row of
W
, so producing these vectors explicitly feels wasteful. Can you delete our use ofF.one_hot
in favor of simply indexing into rows ofW
? - E05: look up and use
F.cross_entropy
instead. You should achieve the same result. Can you think of why we'd prefer to useF.cross_entropy
instead? - E06: meta-exercise! Think of a fun/interesting exercise and complete it.
My plan is to keep chugging along with the videos. I think they are great and Andrej is a great teacher. I plan to continue circling back on previously learned concepts to solidify them in my mind. I'm also enjoying keeping up with the Eureka Labs Discord channel. It looks like someone recommended getting a few more videos in before starting the Video 2 exercises as concepts that are needed are covered. It would be cool if I could figure them out on my own first! But I think I will continue ahead with the videos.