Back Propagation – Mastering Neural Network 2 -The Backbone

Reading Time: 7 Minutes

Hey there, welcome back to the journey of mastering neural networks, in this post of the neural network series we’ll be exploring back propagation in detail with all the math behind the logic to work. If you have not gone through the previous post on mastering neural network, I highly encourage you to go through it so that you can completely comprehend the content and develop a proper intuition from this post.

Neural networks in computing have gained significant attention and are being utilized in a wide range of applications worldwide. This technology has enabled businesses to grow exponentially in the competitive market and has facilitated the emergence of many new ventures.

If you are very curious on how this technologies work, it is very important to understand the elementary concepts that fuel massive technologies like Large Language Model. The core of neural networks lies in the understanding of the human brain as it exactly tries to mimic it.

Just as how the human brain functions through millions of neurons and billions of interconnections between them, an artificial neural network aims at creating multiple neurons and training them with data to learn from it, rather then requiring a programmer to code each logic and pathway for a solution or expected output based on given parameters/features.

A simple and basic single layer neural network can be build only using a forward propagation approach where random weights and bias are initialised and simple activation function is used to train the model/network. But the problem with this is that, it can’t be used to solve complex problems and work with huge volume of data.

So this is exactly why back-propagation is required to leverage multiple neural layers and use a logic that propagates the error estimated by taking the delta between the actual and generated output and altering all weights of neurons and biases of all layers. By doing all this process, the network begins to learn and update it parameters from the data as how humans understand a concept and store it in the memory.

The Idea Behind Back Propagation

^{_{Image Credits: Marvel Entertainment}}

Back-propagation is a crucial step in training the network. It is just an additional step that has been added to a simple neural network that uses forward propagation to make it scalable for complex and dynamic problems. These steps below also serve as a bird’ eye view of what we are going to learn from this post:

Forward Pass: The input data is passed through the network, and predictions are made.
Loss Calculation: The difference (error) between the predictions and the actual values is calculated using a loss function.
Backward Pass (Back-propagation):
- The error is propagated back through the network.
- Gradients (derivatives) of the loss with respect to each weight are computed.
Weight Update: The weights are adjusted in the direction that reduces the error, typically using gradient descent.

By repeating these steps, the network iteratively improves its predictions by minimising the error. Back propagation efficiently computes the gradients needed for this weight adjustment process.

Here we are working on a classification problem, the output values represent the probabilities of the input belonging to each class. The values in the last layer represent the probability of the input belonging to class 0 and class 1, respectively.

For educative purposes, we are going to consider a simple 2 layer network(one input and one hidden layer) with 2 neurons on each layer. The weights for each neuron and bias for each neuron is randomly generated.

For the purpose of simple calculation, we assume that both the biases in each layer for respective neurons are same.

At the next step, we first perform forward propagation on the network. Here is an illustrative calculation, each neuron after a forward pass has to go through an activation function (it helps in normalising the calculation to a fixed bound, so as the network can learn without noise). Here the activation function used is a sigmoid function (range is 0 to 1), because we are dealing with a binary output problem. Different activation functions such as ReLU, tanh can be used when dealing with complex problems.

This is the graph of sigmoid function, it transforms the given input into a range within 0 to 1.

Here’s how the network will look like after forward propagation is complete and all the neurons are activated.

Now, we have to calculate the error for each neuron and back-propagate along with it in order to change the weights and biases in a meaningful manner.

To adjust the network’s weights , we use a method called gradient descent. Here’s how gradients are involved in this process:

Loss Function: For each input, the network makes a prediction, and the loss function calculates the error between the predicted value and the actual value.
Gradient Calculation: The gradient is the derivative of the loss function with respect to the network’s weights. It tells us how much the loss function would change if we change the weights by a small amount. Think of the gradient as a slope of a hill – it points in the direction of the steepest ascent.
Gradient Descent: Since we want to minimise the loss, we move in the opposite direction of the gradient. This is like walking downhill to reach the lowest point of the hill. By adjusting the weights in small steps in the direction opposite to the gradient, the loss gradually decreases.
Iterative Process: This process of calculating the gradients and updating the weights is repeated for many iterations. Each iteration helps the network learn a bit more and reduce the error further.

Error calculation part:

The calculation of error in the hidden layer requires summing the previous weights and multiplying with the respective gradient.

Finally we modify each weight of a neuron by subtracting the original weight from learning_rate*error*respective_activated_output. (we use a learning rate in order to regulate the weights of the neural network based on the loss gradient, which indicates how often the network updates its learned parameters).

In this context, η(eta) represents the learning rate, which I have set to 0.1 for simplicity in calculation. The learning rate controls how quickly the network learns from the data. A higher learning rate enables faster learning but comes with potential drawbacks: it may skip over essential details or learn incorrect information. Thus, selecting an appropriate learning rate is crucial.

To determine the optimal learning rate, we typically employ a trial-and-error approach, adjusting the learning rate and observing the error reduction in the network’s performance until a satisfactory level of improvement is achieved. This iterative process helps to balance the speed of learning with the accuracy and reliability of the model.

The error term is propagated into tuning the biases as well.

Finally we have the final computed neural network with the updated weights and biases. Does this mean the network is ready for prediction? Absolutely not. The same procedure must be repeated for multiple iterations, known as epochs in deep learning. Only after the network has sufficiently reduced its overall loss through these epochs will the model be ready for prediction with good accuracy. This is a sample plot that I made by training the network over 1000 epochs->

For those who are curious, here’s a prediction with the available parameter, given 0.05 and 0.1 as input->

Final Output after One Iteration:

The predicted output is very bad, there is a 87% chance of belonging to class 0 and 89% chance of belonging to class 1. Once we train the network through multiple iteration we can get close to the expected output. Here’s an output after training the network after a million epochs->

Predicted output for input [0.05, 0.10]: (0.01000000000034705, 0.9899999999999923)

It doesn’t mean that all training needs a million epochs, some can be trained well with just over 100 epochs, I have done a million epochs just to show how it could improve accuracy. But since we are solving a classification problem, an epoch as small as 100 is enough to get output as (0.307790442796199, 0.84206790046583) to classify the input.

For those who are curious on how we got the derivative of sigmoid function, here’s the calculation:

Thank you for joining me on this journey. I hope that you’ve gained some insights from this post. We have gone through one of the most simplest network associated with back propagation in order to develop the intuition behind them in the most easiest way.

In the next post we’ll dive into building the same logic which we saw in this post from scratch in python, and in the later part we’ll develop a generalised logic in python which can be leveraged to adapt to dynamic parameter sizes, data with variable volumes and using different activation functions. Do drop your doubts if any in the comments below. I’d also love to hear your thoughts and suggestions of this post, so don’t hesitate to put them down in the comments.

So stay tuned for an upcoming post that translates the math and logic involved in the post to python code. Subscribe to sapiencespace and enable notifications to get regular insights.

Click here to view similar insights.

What’s your Reaction?