Z2H Video 3, round 2, finished watching [Post #18, Day 38]

I have finished working my way through Video 3 the second time.

I understood most of what was presented. I am still not solid on my understanding of what the embedding vectors are and how number of dimensions affects them and the performance of the neural network model.

Now time for some exercises.


E01: Tune the hyperparameters of the training to beat my best validation loss of 2.2

The parameters I tried adjusting were:

  • block size, i.e., the context length
  • number of embedding vector dimensions
  • number of neurons in hidden layer
  • minibatch size
  • number of training iterations

I also tried removing name repeats in the data set.

Here is my experiments log:

# experiment log

# andrej reference values
# train loss = 2.13, val loss = 2.16, test loss = 2.16

# block size = 4
# train loss = 2.15, val loss = 2.19, test loss = 2.19
# determination: worse

# block size = 5
# train loss = 2.20, val loss = 2.23, test loss = 2.23
# determination: worse

# removed repeats from data set
# train loss = 2.13, val loss = 2.16, test loss = 2.18
# determination: worse, interesting

# embedding vector dimensions = 15
# train loss = 2.09, val loss = 2.15, test loss = 2.14
# determination: better

# embedding vector dimensions = 20
# train loss = 2.07, val loss = 2.14, test loss = 2.15
# determination: better

# hidden layer neurons = 300
# train loss = 2.10, val loss = 2.18, test loss = 2.17
# determination: worse

# minibatch size = 64
# train loss = 2.11, val loss = 2.15, test loss = 2.15
# determination: better

# training iterations = 300000
# train loss = 2.12, val loss = 2.18, test loss = 2.18
# determination: worse

# minibatch size = 128
# train loss = 2.12, val loss = 2.16, test loss = 2.16
# determination: same

# embedding vector dimensions = 15, minibatch size = 64
# train loss = 2.08, val loss = 2.14, test loss = 2.1419
# determination: better ***BEST SO FAR***

# repeat experiment, embedding vector dimensions = 15
# train loss = 2.08, val loss = 2.15, test loss = 2.17
# determination: worse, unexpected since this was a repeat experiment, demonstrating non-reproducibility
# could be due to random shuffle during data set splitting
# and the initializations are random, but maybe not with seeded generator?

# repeat experiment again, embedding vector dimensions = 15
# train loss = 2.08, val loss = 2.17, test loss = 2.15
# determination: same

# repeat experiment again without reshuffling data set split, embedding vector dimensions = 15
# train loss = 2.09, val loss = 2.17, test loss = 2.15
# determination: same

So for the best result I had:

# embedding vector dimensions = 15, minibatch size = 64
# train loss = 2.08, val loss = 2.14, test loss = 2.1419
# determination: better ***BEST SO FAR***

Which is marginally better than Andrej's.

I tried running these experiments in the Google Colab notebook but the processing time took forever, is that because the computations run on a server rather than my laptop?

I just checked my laptop specs:

  • Model name: MacBook Pro (14-inch)
  • Model release date: November 2023
  • Chip: Apple M3
  • Memory (RAM): 8 GB
  • Storage: 500 GB
  • Number of cores: 8 (4 performance and 4 efficiency)
  • Current operating system: macOS Sequoia, Version 15.3

E02: I was not careful with the intialization of the network in this video. (1) What is the loss you'd get if the predicted probabilities at initialization were perfectly uniform? What loss do we achieve? (2) Can you tune the initialization to get a starting loss that is much more similar to (1)?

So first I initialized all parameters contained within C, W1, b1, W2, and b2 to 1. I computed the loss without running any training and got train loss = 3.30, val loss = 3.30, and test loss = 3.30. I then sampled 20 names just to see what they looked like, and yeah, gibberish.

I then ran a training run with the standard hyperparameters and got train loss = 2.84, val loss = 2.84, and test loss = 2.84.

I don't understand the second part of this question and it makes me think I may have misunderstood the first part.

Ok, I am revisiting this the next day after having a thought about it this morning. So with 27 possible next characters, a uniform probability for all characters would be 1/27 (so when 27 of these probabilities are multiplied together we get 1). So I undid what I did before, initializing all parameters to 1 (I don't think that was correct), and let them randomly initialize again. Actually I don't think this is even needed. I simply filled my prob tensor with 1/27 (~0.037) then computed the loss loss = -prob[torch.arange(182625), Ytr].log().mean(). I hardcoded in 182625, I think that is the correct number for there and is the number of input examples for the training run from the data set. The computed loss was 3.30. I'm not sure how I can apply this to a training run, and I'm still not sure how to do part 2 of the question.

Just got some more insight. I'm watching Video 7 now (skipped ahead) and Andrej mentioned how expected loss is -ln(1/27) which equals 3.30, yes what I computed above! So I guess this is kind of like a starting point for loss and it should go down from there with effective training of the neural network.


E03: Read the Bengio et al 2003 paper (link above), implement and try any idea from the paper. Did it work?

I plan to read the full paper tomorrow, and will try to glean an idea from the paper and implement it in my model. I started reading this paper while I was in Vermont but I wasn't totally focused, and that was also earlier on in my learning, a lot went over my head. Maybe I will understand more this time.

Checking back in as I have just finished reading the paper on 19 February. I don't know what idea from the paper I could try implementing yet, maybe something will come to me. Maybe just making the character embeddings larger dimension? Well I have experimented with that already. Well one thing they did was combine n-gram model with neural network model, maybe I could try that? Another thing they mentioned was using existing knowledge in initializing the embeddings, maybe I could try that? Like somehow specify relationship between vowels and consonants?

More from A Civil Engineer to AI Software Engineer 🤖
All posts