Z2H Video 2, exercises complete [Post #17, Day 36]
February 10, 2025•1,301 words
Coming back to my work after a nice weekend visiting family near Kingston, New York.
E01: train a trigram language model, i.e. take two characters as an input to predict the 3rd one. Feel free to use either counting or a neural net. Evaluate the loss; Did it improve over a bigram model?
I trained a trigram language model using the counting method. I evaluated the loss. The loss for the bigram model was 2.45, the loss for my trigram model is 2.21. The trigram model improved over the bigram model. There are some further improvements I could make to my trigram model too. There are currently trigrams in my model that are not possible, like '...', '.a.', '.b.', etc., there are no one-letter names. I could remove these from my trigram model to perhaps improve the model further.
I have now also trained a neural net model. The loss I am getting to after several hundred rounds of training my model is 2.24, so slightly above the trigram language model created using the counting method. There are still some tweaks I can make to the learning rate, regularization term, run more iterations, etc. When I sample from my neural net model, the words (i.e. names) are a little strange, some seem ok, then others are not really words, perhaps there is a bug somewhere in my code. Here are some of the names output from my neural net model:
- Da
- Jace
- Ari
- Emy
- Horron
- Criah
- Ellonw
- Jick (funny)
- Ron
- Ely
- Mafiorxyvi (weird and unwordlike)
- Zana
- Leilie
- Sudulaylphadt (weird and unwordlike)
- Iob
- Yukamei
- Mibduwon
- Wellie
- Lucla
- Abillypkzrcenazamarripb (what happened for this one??)
Ok, another interesting thing: the names above were from a random name sampling without seeding the PyTorch Generator. Now when I seed it with the seed I've been using for everything else, I get this list of 20 names:
- Cexzdfzjglkuriana
- Kaydemmilistona
- Noluwan
- Ka
- Da
- Samiyah
- Javer
- Gotai
- Moriellavojkwu
- Eda
- Kaley
- Maside
- En
- Aviony
- Fobspehlynne
- Vtahlas
- Kashrxdleenlen
- Al
- Isan
- Jaridynne
So interestingly this name list is similar to the names output by my trigram by counting model, but they are not identical, I think like they were for the bigram name sampling we did with Andrej's video. I wonder if this is an indication of a bug in my code.
E02: split up the dataset randomly into 80% train set, 10% dev set, 10% test set. Train the bigram and trigram models only on the training set. Evaluate them on dev and test splits. What can you see?
I have split up the data set randomly into 80% training (train) set, 10% development or validation (dev) set, and 10% testing (test) set. I utilized Claude to help me create a function to implement the dividing up of the data set. During this process I discovered that there are 2,539 words in the names data set that appear twice, so I removed all the repeats (I'm not sure of the implications of this exactly on the training of the neural net).
I trained my trigram model on the training set only. During training I monitored the loss of the training set, and also the loss of the dev set. They appeared to be approximately equivalent throughout training which I believe is a good sign. After training, I computed the loss on the test set by doing one forward pass. After 250 iterations during training, I had the following: train loss of 2.29, dev loss of 2.31, and test loss of 2.31. I think these numbers do not raise any alarms with my neural network and its training.
I found that loss is lower when using data set with no repeats. I ran 100 training iterations on my bigram model using the original data set and got a loss of 2.4901. Then re-initialized the neural net and ran 100 training iterations using data set with repeats removed and got a loss of 2.4896. Not a whole lot different, but slightly improved.
I then trained my bigram neural net model only on the training set. I got the following loss values after 100 passes through the network: train loss is 2.4889, dev loss is 2.4937, test loss is 2.4928. These all seem ok, I can see that they are all similar as was the case for the trigram neural net loss results.
E03: use the dev set to tune the strength of smoothing (or regularization) for the trigram model - i.e. try many possibilities and see which one works best based on the dev set loss. What patterns can you see in the train and dev set loss as you tune this strength? Take the best setting of the smoothing and evaluate on the test set once and at the end. How good of a loss do you achieve?
For now I have just done this by manual trial and error. My training forward pass does not include a regularization term in the computation of loss, and my dev forward pass does. So far I tried 0.1, 0.01, and 0.001 for the regularization strength term. When the regularization strength term is 0.1 I found the train loss is less than the dev loss for each pass through the neural net. When the regularization strength term is 0.001 I found the train loss is greater than the dev loss for each iteration. I have only tested this on the first 10 passes of the neural net. There is surely a more effective method for tuning regularization strength but this is my first caveman attempt.
I did a training run of 100 passes with the regularization strength term equal to 0.001. Test loss is 2.38, dev loss is 2.40, and test loss is 2.41. Maybe I was only supposed to apply that updated regularization strength term for the test set forward pass loss calculation?
E04: we saw that our 1-hot vectors merely select a row of W, so producing these vectors explicitly feels wasteful. Can you delete our use of F.one_hot
in favor of simply indexing into rows of W
?
I used Claude to help me with this one. It was a very simple adjustment instead of the one-hot method, I could index into the W
matrix to compute logits using W[xs]
. Very simple and it sped up the calculations considerably.
E05: look up and use F.cross_entropy
instead. You should achieve the same result. Can you think of why we'd prefer to use F.cross_entropy
instead?
Done! I went to PyTorch documentation and looked up cross_entropy
. I found that it takes two inputs, logits and targets (i.e. labels, i.e. ys
in our case). I ran a cell block with the original code for computing loss, and with the F.cross_entropy
for computing loss and I got the same result! After first removing the regularization term from the original code loss calculation, and I see one of the parameters for F.cross_entropy
is label_smoothing
, perhaps that is how to add in regularization.
Now on to Video 3! Yay exciting 😊
E06: meta-exercise! Think of a fun/interesting exercise and complete it.
This is a tad lame but for now my extra exercise was running some further analysis of the names.txt data set and discovering that there are several repeat names in the data set. I then removed these repeat names and found that it slightly improved the performance of my bigram neural net model. So perhaps the greater message here is that a higher quality (cleaned-up) data set leads to better model performance.
I'll see if I can think of something more fun to do!