Sidenote: AI reading log

January 8, 2025•866 words

Here is where I am keeping track of all of my reading and watching of educational videos online.

To-read/watch:

https://karpathy.github.io/neuralnets/
https://karpathy.github.io/2015/05/21/rnn-effectiveness/
https://www.deeplearningbook.org/
Andrej's blog posts, one about Software 2.0, one about bitcoin, etc.
At Andrej's instruction, read PyTorch Broadcasting semantics page
https://karpathy.github.io/2015/11/14/ai/
Nando de Freitas writings
Other deeplearning.ai courses like Attention in Transformers: Concepts and Code in PyTorch, taught by Josh Starmer, founder and CEO of StatQuest
General Wikipedia deep dive, "neural networks"
Stanford CS231n course notes https://cs231n.stanford.edu/
Andrej Karpathy, Neural Networks: Zero to Hero
AlexNet

Noted by Andrej Karpathy as good resources:

http://d2l.ai/ looks quite good and up to date based on quick skim
https://github.com/fastai/fastbook I like Jeremy and his focus on code. I feel like sometimes lessons can feel like an advertisement for the fastai library.
http://cs231n.stanford.edu/ and its notes (not biased at all :))

Keep in mind:

Most common neural net mistakes: https://x.com/karpathy/status/1013244313327681536

Read 0 @ 11:55 on 19/02/25 [read full]
A neural probabilistic language model
By Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin, 2003

I struggled through this one, for the first half I had ChatGPT walk me through it sentence by sentence which was very helpful and gave me the idea for my new app, it was especially helpful for me in understanding the presented equations which usually make my mind go blank when I see them
This paper compared the old language modeling using n-gram approaches to a neural network approach, a key aspect is the word embeddings (also called feature vectors) which allow words to have information in them, about their relation to other words and semantics
The new neural model was tested on two text "corpora", or test data sets, and had lower perplexity values (perplexity = e^avg. nll loss) compared to the state of the art n-gram models
I started my first attempt at reading this while in Vermont a while ago, but this time started from the beginning and went all the way through over the course of two days (yesterday and today)

Read 0 @ 17:57 on 07/01/25 [read full]
Deep learning
By Yann LeCun, Yoshua Bengio, and Geoffrey Hinton, 2015

Figure 1 (Multilayer neural networks and back propagation) is almost an exact match with the Graham Ganssle example I studied!, even down to using the same variable letters, except in this paper there are two hidden layers rather than one, the neurons are consistently referred to as units, I was calling them nodes also, not sure where I got that name from
There were many parallels with my learnings during my MLP coding foray, like backpropagation to update weights, good to have consistency of concepts, helps me solidify them as part of the framework in my mind
The authors believe unsupervised learning is the most promising way forward, "human and animal learning is largely unsupervised: we discover the structure of the world by observing it, not by being told the name of every object"
One takeaway is that a combination of techniques are used to get the best outcomes, like combining convolutional neural networks with recurrent neural networks, with an additional memory function
This was another challenging one, I found this paper on Geoffrey Hinton's website, it was published in Nature

Read 0 @ 13:11 on 06/01/25 [read full]
Neural Networks
By Graham Ganssle, 2018

Split up data set, 80% of data set for training, 20% for validation
Weights and biases continually updated with each iteration (epoch, and I think actually with each sample running through the network too) to make predicted answer match actual answer during training
Input layer, hidden layer(s), output layer
Back-propagation was the key concept to start the deep-learning revolution
A neural network when stripped down to the most basic is simple multiply and add

Read 0 @ 15:35 on 04/01/25
Transformer (deep learning architecture) [Wikipedia]

Multi-headed self-attention means allowing the signal for key tokens to be amplified and the signal for less important tokens to be diminished
The Transformer model powers systems like Google DeepMind's Gemini (formerly called Bard) and OpenAI's ChatGPT
The Transformer has also led to the development of pre-trained systems, such as generative pre-trained transformers (GPTs) and bidirectional encoder representations from transformers (BERT)
Text is converted to numerical representations called tokens

Read 0 @ 15:19 on 04/01/25 [read full]
Attention is all you need
By Ashish Vaswani et al., 2017

Challenging read, I would estimate I understand ~10%
Transformer architecture with self-attention mechanism is superior to recurrent and convolutional neural network (RNN, CNN) models
Best performance on newstest2013 to date, significantly faster training period
They used eight NVIDIA P100 GPUs to train their model (looks like they are ~$500 USD ea. online)
A standard baseline for LLM performance is on language translation, e.g. English to German
The Transformer replaces recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention
This is the seminal paper on the Transformer, the first sequence transduction model based entirely on attention

Sidenote: AI reading log

More from A Civil Engineer to AI Software Engineer 🤖
All posts

Digging into the details of my MLP [Post #3, Day 2]

Knobs [Post #4, Day 3]

Sidenote: AI reading log

More from A Civil Engineer to AI Software Engineer 🤖All posts

Digging into the details of my MLP [Post #3, Day 2]

Knobs [Post #4, Day 3]

More from A Civil Engineer to AI Software Engineer 🤖
All posts