Stuck at a Partial Derivative for 6 Years?

My lack of understanding of back propagation had stopped me from implementing a neural network, until this week

17Mar26

Roar. Bears, I've been reading a book on basic deep learning this week. When reading about a specific category of machine learning like reinforcement learning (RL), it's easy to focus on the algorithm and forget about the software library used to implement them, and the training loop design. These are all part of the code needed to train an RL agent. And let's not forget the interaction with the environment we talked about before, and the code to monitor training process and save the model that I've yet read about. Training a neural network can take anything from half an hour to days. You want to catch the mistakes early, including forgetting to save the model from the memory at the end.

The book I read took an approach that I have always commended. It started with implementing a perceptron, then a one-layer neural network, using Python only, without any extra library. The perceptron implementation had a Scikit-Learn-like API, with the fit method updating the model's weights and bias. If you recall how a perceptron works, these terms should be familiar. The neural network's API looked like that of PyTorch, with a forward method to calculate output with activation, and a backward method to calculate the gradient, which would be used to update the weights of the network. Of course, it was not a toy implementation - the book then went on to train both models and got a good accuracy. It is only until this point that we can say we kind of understand the inner workings of the perceptron and the one-layer neural network.

To be clear, there are many such tutorials online, and even on YouTube, where people implemented neural networks from scratch. I had thought about making such a video, but was stopped by the backward method. When training a neural network, after calculating an loss value from the network output and the loss function, we need to adjust model weights to see if the result gets better, i.e., the loss gets lower. To choose a set of weights of such, we need to calculate what's called the gradient of the loss function at the model weights' point. The algorithm for that is called back propagation. The gradient is made of partial derivatives of the loss about each weight. If this already sounds confusing, let me spill you some more tales. The loss function, when plotted in 3D (which is assuming there are only three weights) looks literally like a mountain, with a "major" valley but also small peaks and valleys along the way.

What would that loss function look like? How can I possibly calculate the partial derivatives of such a function? These were the thoughts in my head. For six whole years, I've been frightened by this math and never set out to implement a neural network. And remember, six years ago the idea of automatic differentiation had just been implemented in Python libraries in the labs and was not in any framework yet, so even machine learning engineers at that time needed to calculate the partial derivatives themselves. The only differnce is, they did it.

Six years later, earlier this week, I did it, too, for a mean square error loss about weights of a perceptron and a single-layer neural network.

The partial derivatives turned out to be much simpler than I had kept in my mind for the whole six years. Still, I wondered how this back propagation thing is done for multi-layer neural networks, each having a very complex activation function. It wasn't until I found some lecture slides about back propagation that I realised, for six whole years, I'd been thinking of the whole problem wrong. For those interested, I'll put a link to the slides here. A spoiler, I was doing it as if in a calculus class, while there was a smarter way for computer scientists. I'll write a notebook about what I did and what the smarter way is as well, but the slides actually say it all already.