Digging into the details of my MLP [Post #3, Day 2]
January 7, 2025•678 words
I have worked my way sequentially through the forward pass of my multilayer perceptron (MLP) neural network. It involved matrix multiplication and the use of the sigmoid function to compute the activation value for each hidden layer node (i.e. neuron). I use the print function to check values as they flow through my neural network, then I check the computations with hand calculations and a spreadsheet.
My neural network (from Graham Ganssle's example) has seven input features (VP, VS, and rho for the upper and lower soil layers at each sample, and the angle of incidence) which can also be called input nodes (or neurons), it has one hidden layer with 300 units (i.e. neurons), and one output layer with one output node (neuron) which is the value of reflectivity for a P-P reflection at an interface in the subsurface. The weights are randomly initialized and the biases start at zero. The weight matrix going to the hidden layer (W1) is a 300 row by 7 column matrix, the bias vector at the hidden layer has 300 values (1 for each node), the weight matrix going to the output layer (W2) is a vector with 300 values.
My next goals are to work sequentially through the backpropagation path of the network, then create two new neural networks, one to map N to 2N (simple multiplication, equivalent to building a linear model), and one to map N to N2 (nonlinear). These both take one input value and produce one output value. I also want to work my way through Andrej Karpathy's Neural Networks: Zero to Hero YouTube Playlist. I know he is also working on an AI course at Eureka Labs called LLM101n, I bet that will be great when it's released.
Checking back in now as I am working through the backpropagation process.
Questions I have at this point in time include:
- Why is the sigmoid (logistic) function used to compute the activation values? Is it something about introducing nonlinearity?
- Why is the derivative then used for backpropagation?
- I multiply the loss by the activation value to compute the gradient of the weight (W2 gradient), why? And what exactly is the gradient of the weight? Then the new weight (new W2) is the original weight minus the learning rate (in my case 0.001) multiplied by the gradient of the weight.
- The gradient of the bias (b2) is just the loss, why?
- What is the difference conceptually between a weight and a bias?
I think I can use Claude in Cursor to help me answer some of these questions.
This first backpropagation step is from the output error (loss) to the W2 and b2 weights and biases before getting back past the hidden layer. Then the second step is to update the W1 and b1 weights and biases. This second part uses the chain rule, the loss multiplied by the derivative activation values multiplied by the W2 weights. So the result after the backpropagation is that now the two sets of weights and biases (W1, b1, and W2, b2) have been updated based on the error the network found from the forward pass, so now my network has learned something.
So to train and use my MLP to make predictions I first have to train my MLP using data with known outputs, with which I can compare the outputs of my MLP and update the weights and biases with each epoch (and more granularly within each sample run within each epoch). Once I have my trained MLP I can then use the subset of the data I separated for validation, my MLP hasn't seen these data yet (there was also a step about randomly shuffling the training data in each epoch to prevent learning order bias). If the validation looks good (not sure how exactly to quantify this yet) I can then go ahead and apply my MLP (with computed weights and biases from the training stage) to a new data set for which I don't have solutions and they can be predicted.