Sidenote: AI reading log
January 8, 2025•866 words
Here is where I am keeping track of all of my reading and watching of educational videos online.
To-read/watch:
- https://karpathy.github.io/neuralnets/
- https://karpathy.github.io/2015/05/21/rnn-effectiveness/
- https://www.deeplearningbook.org/
- Andrej's blog posts, one about Software 2.0, one about bitcoin, etc.
- At Andrej's instruction, read PyTorch Broadcasting semantics page
- https://karpathy.github.io/2015/11/14/ai/
- Nando de Freitas writings
- Other deeplearning.ai courses like Attention in Transformers: Concepts and Code in PyTorch, taught by Josh Starmer, founder and CEO of StatQuest
- General Wikipedia deep dive, "neural networks"
- Stanford CS231n course notes https://cs231n.stanford.edu/
- Andrej Karpathy, Neural Networks: Zero to Hero
- AlexNet
Noted by Andrej Karpathy as good resources:
- http://d2l.ai/ looks quite good and up to date based on quick skim
- https://github.com/fastai/fastbook I like Jeremy and his focus on code. I feel like sometimes lessons can feel like an advertisement for the fastai library.
- http://cs231n.stanford.edu/ and its notes (not biased at all :))
Keep in mind:
- Most common neural net mistakes: https://x.com/karpathy/status/1013244313327681536
Read 0 @ 11:55 on 19/02/25 [read full]
A neural probabilistic language model
By Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin, 2003
- I struggled through this one, for the first half I had ChatGPT walk me through it sentence by sentence which was very helpful and gave me the idea for my new app, it was especially helpful for me in understanding the presented equations which usually make my mind go blank when I see them
- This paper compared the old language modeling using n-gram approaches to a neural network approach, a key aspect is the word embeddings (also called feature vectors) which allow words to have information in them, about their relation to other words and semantics
- The new neural model was tested on two text "corpora", or test data sets, and had lower perplexity values (perplexity = eavg. nll loss) compared to the state of the art n-gram models
- I started my first attempt at reading this while in Vermont a while ago, but this time started from the beginning and went all the way through over the course of two days (yesterday and today)
Read 0 @ 17:57 on 07/01/25 [read full]
Deep learning
By Yann LeCun, Yoshua Bengio, and Geoffrey Hinton, 2015
- Figure 1 (Multilayer neural networks and back propagation) is almost an exact match with the Graham Ganssle example I studied!, even down to using the same variable letters, except in this paper there are two hidden layers rather than one, the neurons are consistently referred to as units, I was calling them nodes also, not sure where I got that name from
- There were many parallels with my learnings during my MLP coding foray, like backpropagation to update weights, good to have consistency of concepts, helps me solidify them as part of the framework in my mind
- The authors believe unsupervised learning is the most promising way forward, "human and animal learning is largely unsupervised: we discover the structure of the world by observing it, not by being told the name of every object"
- One takeaway is that a combination of techniques are used to get the best outcomes, like combining convolutional neural networks with recurrent neural networks, with an additional memory function
- This was another challenging one, I found this paper on Geoffrey Hinton's website, it was published in Nature
Read 0 @ 13:11 on 06/01/25 [read full]
Neural Networks
By Graham Ganssle, 2018
- Split up data set, 80% of data set for training, 20% for validation
- Weights and biases continually updated with each iteration (epoch, and I think actually with each sample running through the network too) to make predicted answer match actual answer during training
- Input layer, hidden layer(s), output layer
- Back-propagation was the key concept to start the deep-learning revolution
- A neural network when stripped down to the most basic is simple multiply and add
Read 0 @ 15:35 on 04/01/25
Transformer (deep learning architecture) [Wikipedia]
- Multi-headed self-attention means allowing the signal for key tokens to be amplified and the signal for less important tokens to be diminished
- The Transformer model powers systems like Google DeepMind's Gemini (formerly called Bard) and OpenAI's ChatGPT
- The Transformer has also led to the development of pre-trained systems, such as generative pre-trained transformers (GPTs) and bidirectional encoder representations from transformers (BERT)
- Text is converted to numerical representations called tokens
Read 0 @ 15:19 on 04/01/25 [read full]
Attention is all you need
By Ashish Vaswani et al., 2017
- Challenging read, I would estimate I understand ~10%
- Transformer architecture with self-attention mechanism is superior to recurrent and convolutional neural network (RNN, CNN) models
- Best performance on newstest2013 to date, significantly faster training period
- They used eight NVIDIA P100 GPUs to train their model (looks like they are ~$500 USD ea. online)
- A standard baseline for LLM performance is on language translation, e.g. English to German
- The Transformer replaces recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention
- This is the seminal paper on the Transformer, the first sequence transduction model based entirely on attention