
Reading Time: 12 Minutes
In the previous post of this series, we have seen the process of translating the math and logic behind neural networks into efficient python code with the example of a very simple classification problem. Now we are going a notch up to build a robust network that incorporates industry standard techniques to solve complex problems.
So to start getting used to those standards and techniques, we’ll begin with a very simple problem of identifying the class/species of an iris flower with the given features. The iris dataset is a classic and elementary dataset in the field of machine learning, it is been there since 1988 and primarily used for educative purposes and learning. Featured image credit – UCI machine learning
The aim of this post is to build a network in python that predicts the class of iris with a decent accuracy and visualise its results.
First we begin by importing the basic libraries needed to process the data. Here I am using the load_iris method from the sklearn.datasets module, but you can directly download the data as a .csv file from the above link and use pandas read_csv to populate the environment with data.
Then we define our activation functions and their derivatives for calculation.
import numpy as np
import matplotlib.pyplot as plt
# Activation functions
def relu(x):
return np.maximum(0, x)
def relu_derivative(x):
return (x > 0).astype(float)
def softmax(x):
exp_x = np.exp(x - np.max(x)) # stability improvement
return exp_x / exp_x.sum(axis=1, keepdims=True)
Ok, this is a totally new set of functions as compared to the previous post. Let’s break them down one by one. The sigmoid function is replaced with ReLU(Rectified Linear unit), because sigmoid just has its output range from 0 to 1 which makes it incapable of training complex networks and the gradient of the sigmoid function becomes very small when the input is far from zero (either very positive or very negative). This causes the vanishing gradient problem, making it difficult to train deep networks because the gradients used for updating weights become very small.
The softmax function is used to convert the raw scores into meaning probabilities. In our application, the generated outputs at the last layer is a 150×3 matrix each row consisting of scores. For example, on our first iteration we get the following scores->
array([[ 0.93424973, 0.1273886 , 0.39534845],
[ 2.70475743, 2.58435342, 2.42502007],
[ 3.32404947, 2.6119012 , 2.67289137],…….
This is a plot that visualises the usefulness of softmax->
a = np.array([[ 3.32404947, 2.6119012 , 2.67289137]])
print(softmax(a))
plt.scatter(x=np.array([1,2,3]), y=a, label='Unnormalised scores')
plt.scatter(x=np.array([1,2,3]), y=softmax(a), color='green', label='Softmax')
plt.legend()

Since theses numbers are boundless and not normalised, the network cannot learn any useful information. It might learn if you are using a very complex architecture, but it is just a waste of compute and resources. If you are interested in the calculation, here’s a snippet for you=>

The notion behind softmax is that, larger input values become much larger after exponentiation, while smaller input values become relatively smaller. This process helps in making the highest value in the input vector dominate more prominently, which is often desirable in classification tasks where you want the most likely class to stand out.
Let’s first prepare the data, basic forward and backward pass function to visualise the computations.
from sklearn.datasets import load_iris
from sklearn.preprocessing import OneHotEncoder, StandardScaler
iris = load_iris()
X = iris.data
y = iris.target.reshape(-1, 1) # strecthes it into a nx1 matrix
# One-hot encode the target variable
encoder = OneHotEncoder(sparse=False) # creates encoding in matrix form, example element ->[1. 0. 0.]
y = encoder.fit_transform(y)
# Normalize the input features
scaler = StandardScaler()
X = scaler.fit_transform(X)
# Initialize weights and biases
np.random.seed(2346827) # this is set to reproduce the same result durinf
weights1 = np.random.rand(4, 5) # 4 input features to 5 hidden neurons
weights2 = np.random.rand(5, 3) # 5 hidden neurons to 3 output neurons (3 classes)
bias1 = np.random.rand(1, 5) * 0 # initially the bias is zero because we don't want to loose useful information by vanishing gradient of large numbers at first iteration. (you could however multiply it with numbers like 0.1 or 0.01 to keep some entropy).
bias2 = np.random.rand(1, 3)
inp = np.array(X)
expected_output = y
For the sake of simplicity, I directly use the load_iris() method from sklearn to get the data.
Once we get the data, we have to encode the targets, it currently has three targets (Setosa, Versicolor, and Virginica). We can achieve this by using the One hot encoding method, it is named so because it creates a 1 x n matrix (n being the number of categories), assigns an index to each element, flags that index to 1 and keeps the rest 0 when a specific element turns up for a row. If sparse is set to true, then only the indices are stored instead of the whole matrix to save memory.
Then we use StandardScaler to scale the x values so that the network can learn meaningful stuff easily. After that, it is just the usual code to initialise the weights and biases.
epochs = 2600
learning_rate = 3e-6
lossi = []
spl_loss = []
for epoch in range(epochs):
# forward propagation
hidden_input = np.dot(inp, weights1) + bias1 # 150x4 * 4*5 = 150x5 matrix + 1x5 matrix(bias) is broadcasted
hidden_output = relu(hidden_input)
output_input = np.dot(hidden_output, weights2) + bias2
break # we break to visualise the initial results
Here’s a visualisation of the activations and the ReLU passes parameters->
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_style('darkgrid')
plt.hist(hidden_input); # ; is added to avoid printing any additional text associated with the plot


The right side plot is the one with the x values not scaled, as you can see there are a quite a few activations are very low and information is lost. In large datasets, not scaling them can turn out to be a huge disadvantage during training, as a lot of useful information may be left out.
This is the plot of the values after passing through the ReLU function (the values below zero are eliminated and only the positive activations are kept)
plt.hist(hidden_output);


The right side is the visualisation of the ReLU function.
Yes, setting all negative values to zero does mean that information from negative values is lost. However, the network’s architecture, training process are designed to manage and compensate for this loss. Many architectures incorporate batch normalisation, dropout, and residual connections, to help preserve useful information and maintain gradient flow. But since we are training for a relatively simple task, we won’t be going for the above described methods. We will set the learning rate, epochs, sit back and relax to see the network do the magic.
Now lets add backpropagation logic to the code and visualise the hidden_input passing through the ReLU’s gradient / derivative.
Here’s a graph of the derivative of ReLU, it simply 1 or 0 in the output range for the gradient.

""" add this code to the loop above for back propagation->
output_error = output - expected_output
just_der = relu_derivative(hidden_output)
hidden_error = np.dot(output_error, weights2.T) * relu_derivative(hidden_output) """
plt.figure(figsize=(20,7))
plt.plot(just_der);

Here you can see that the output is only either 1 or 0.
Now, that we understand the intricacies of the network’s initialization, lets put the entire together and fire up the network.
Just a side note before we get into action, we should know where our network should stop and at which loss it should stop. So, for classification tasks, we have the negative log likelihood loss which basically tell’s us what should be our expected loss at the end of training.
The formula is -log base 10(1/total_num_of_classes). In our case the value is np.log10(1/3) which is 0.47712125471966244. We don’t need to exactly get to the same number to achieve a good model, a loss anywhere near it will do good. But what happens if you break the benchmark and maybe get the loss of 0.23 at the end of training by setting up a robust architecture? Do you get a state of the art model? No. Unfortunately, you have overfitted the network, which means that the model will perform very bad on the unseen data and this is something that we never want.
So, keeping this base information in our mind, let’s get into training.
epochs = 2800 # how did I know this ? I had to experiment mulitple times by changing the epochs and learning rate. If the lr is too high then there is not use of running large number of epochs but the network won't learn anything (it's like reving up the engine on 1st gear for a long distance). So finding the epochs and lr is very important
learning_rate = 3e-5
lossi = []
spl_loss = []
for epoch in range(epochs):
# forward propagation
hidden_input = np.dot(inp, weights1) + bias1 # 150x4 * 4*5 = 150x5 matrix + 1x5 matrix(bias) is broadcasted
hidden_output = relu(hidden_input)
output_input = np.dot(hidden_output, weights2) + bias2
output = softmax(output_input)
output_error = output - expected_output
just_der = relu_derivative(hidden_output)
# Back prop: Update weights and biases
hidden_error = np.dot(output_error, weights2.T) * relu_derivative(hidden_output)
weights2 -= learning_rate * np.dot(hidden_output.T, output_error)
bias2 -= learning_rate * np.sum(output_error, axis=0, keepdims=True)
weights1 -= learning_rate * np.dot(inp.T, hidden_error)
bias1 -= learning_rate * np.sum(hidden_error, axis=0, keepdims=True)
loss = -np.mean(np.sum(expected_output * np.log(output + 1e-9), axis=1)) # 1e-9 (small) value added to avoid taking the log of zero, which is undefined.
# The summation along axis=1 collapses the individual class probabilities into a single value for each sample, which represents the log probability of the correct class.
lossi.append(loss)
spl_loss.append(round(loss, 2))
if epoch % 100 == 0:
print(f'Epoch {epoch}, Loss: {loss}')
Note: The loss variable is initialized solely for the purpose of understanding how the model learns over time. The actual adjustment in this example takes place by applying the derivative of ReLU during the back-propagation step.
The loss function used here is know as Cross Entropy Loss, it’s mainly used for classification tasks. It measures the difference between two probability distributions: the true labels (expected output) and the predicted probabilities (output). Here the log(output) is taken to penalise confident but wrong predictions.
By averaging the loss over all samples, we normalise the loss value to ensure it is not dependent on the number of samples. Without averaging, the total loss would increase with the number of samples, making it difficult to compare losses across different batch sizes or datasets.
An element wise multiplication is performed amongst the expected and generated output, since the expected output is one-hot encoded, this operation effectively picks the log probability of the correct class for each sample. The mean is taken to get a single scalar value for the prediction.
Taking the negative log transforms small probabilities into large penalties. This ensures that the model is punished more severely for being confidently wrong, driving the learning process towards making accurate and confident predictions.
To understand this more clearly let’s take an example of a guessing game where there are three coloured balls (red, blue and green) and the goal is to predict the colour of a randomly chosen ball by your friend.
Conditions of the game:
- Guessing with Confidence:
- If you are very confident that the ball will be red, you might guess red with 90% confidence, blue with 5% confidence, and green with 5% confidence.
- If you are not very sure, you might guess red with 40% confidence, blue with 30% confidence, and green with 30% confidence.
2. Winning and Losing Points:
- If the ball is indeed red, you win points. The more confident you were (higher percentage), the more points you win.
- If the ball is blue or green, you lose points. The more confident you were about the wrong colour (higher percentage), the more points you lose.
Log(1) is 0, so as you make confident predictions, you will be provided with more score, but if you make wrong predictions with high confidence, say that the current chosen ball is green with 90% surety and red with 5% and blue with 5%, but actual ball is red then log10(0.05) is -1.3010299956639813 (a very bad score).
This is the combined graph of expected outputs, first iteration’s output and it’s log =>
plt.figure(figsize=(20,9))
plt.plot(expected_output)
plt.plot(output)
plt.plot(np.log10(output))
plt.legend(iris.target_names, loc='upper left')

The flat lines (i.e. blue, orange and green) are 1 or 0 as each 50 examples in the dataset belongs to only one class, that is why we can observe the flatness.
If you observe the black line, you can see that the model is predicting that it is setosa with about 55% confidence and the log value is also less, which means that there is less penalising for the correct prediction. The versicolor and virginica is predicted to be correct with 30% and 23% accuracy. You can see how less confident predictions are penalised by setting them to lesser values.

If you move just a few points to the right from the black line, you could observe that there are some purple lines (virginica) below brown line (versicolour). Here the model is less confident that it will be virginica and more confident that it will be versicolour, so the log assigns a heavy penalty for the model. You can see that the grey plot (virginica’s log10 output) is much lower.
Here is the plot after complete training with 2800 epochs=>

The pink lines are given lots of penality in the last 50 examples, as all of the last 50 rows belong to the virginica class and the model has predicted it is setosa with high confidence. The same thinking approach applies for the rest.
But we can’t use this to gauge the performance of the model, therefore we use the Cross Entropy loss function to observe the loss and make calculations for the accuracy.
This is the loss plot for our model, which is finally settling at a loss of 0.47801, the right side plot is the step plot of how much epochs does the model take to decrease the loss by two significant decimal points.


You can also use histogram to visualise the right plot.

Here you can see that the model is gradually taking too much of epochs to learn from data when the loss drops to 1.0. Since the learning rate might be too high after certain learning we are able to see this pattern, this is absolutely normal and healthy as long as the loss decrease curve is kind of like the above one. But when we are dealing with huge amounts of data or when we use complex architectures, we can use a stochastic approach to decrease the learning rate after certain epochs (this can be done manually or a scheduler from torch can be imported) and create good initialisation of weights and biases for the activations.
Here we don’t go for scheduler techniques as we have a relatively very simple network and small dataset.
Just a quick note, you have to split the data into train and test batches before training so that you can verify the model’s performance on unseen data and whether the model doesn’t overfit. I didn’t split it earlier because I wanted to show how the species are evenly spread by 50 consecutive intervals each, if I split it earlier, then the visualisations would have been very difficult to understand.
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
inp = X_train
expected_output = y_train
Here’s a visualisation of the model’s prediction with respect to the test data=>
def predict(new_input):
hidden_input = np.dot(new_input, weights1) + bias1
hidden_output = relu(hidden_input)
output_input = np.dot(hidden_output, weights2) + bias2
output = softmax(output_input)
return output
predictions = predict(X_test)
predicted_classes = np.argmax(predictions, axis=1)
actual_classes = np.argmax(y_test, axis=1)
# Accuracy
accuracy = np.mean(predicted_classes == actual_classes)
print(f'Accuracy: {accuracy * 100:.2f}%') # I am getting an accuracy of 96.67 %
import seaborn as sns
import pandas as pd
data = pd.DataFrame({'Actual': actual_classes+1, 'Predicted': predicted_classes+1})
heatmap_data = pd.crosstab(index=data['Actual'], columns=data['Predicted'])
plt.figure(figsize=(10,8))
sns.heatmap(heatmap_data, annot=True, cmap='coolwarm', fmt='d')
plt.xlabel('Predicted')
plt.ylabel('Actual')

The results are really good with 1 exception. This seems good since we are not dealing with large datasets and not using complex architecture.
So, that it ! Congratulations on reaching till here, you are amazing. Don’t stop here by closing this tab, run this code, tweak its parameters and experiment with the activations. By doing so you’ll get to understand what we have seen in this post in a more detailed manner.
Thank you for reading all along, in the next part of this series we’ll be building the same code using pytorch for efficiency and then we’ll solve a more complex problem with pytorch. Until then, happy coding !!!!
Subscribe to sapiencespace and enable notifications to get regular insights.
Click here to view similar insights.