2024-12-22 at 16:21 3b1b
December 24, 2024•128 words
cost function is average over all examples
backprop gives you gradient of C(w1, w2, ...)
(how?)
but to calculate that youd need all examples
instead, we use only a few examples at a time then calculate not the exact gradient but instead a Stochastic Gradient using backprop using those few examples . this makes sense bc it's also how humans learn . we dont need to retrain on 50000 examples before adjusting our strategy/intuition/wtvr (whether consciously or subconsciously) . we adjust as we go, but also not too immediately for like only one example (for an MNIST-type situation), we need like perhaps 10-50+ examples before adjusting . (for situations like reasoning w physics practice problems, we only need like 1 example before adjusting, but that is entirely different..............)