Art of Training Neural Networks – 18 Secrets to Mastering Neural Networks – 5

Reading Time: 10 Minutes

Since the time I started to learn and implement neural networks, I got to encounter a lot of different situations and hurdles in setting up and training a neural network.

After a good amount of researching and learning from the master Andrej Karpathy himself through his YouTube videos and blog posts I got to know a lot of nuanced and essential information that we need to keep in mind while training networks.

In my previous post on predicting iris class using neural network, I have mentioned in the end that the next post will be on leveraging PyTorch to build the network with less coding and for more efficiency. But I felt that this is the right time to introduce you to the not so fun aspects of training neural Networks and how to tackle these hurdles, so that you can avoid any snag during training.

I would like to mention that this post is not to be mistaken for a guide to build neural networks or teaching the fundamentals of neural networks. You can find information on neural networks and builds them hands-on from the neural networks series in sapiencespace from these
useful links:

What is Neural Networks: sapiencespace.com/neural-networks/
Getting started with fundamentals of neural networks: sapiencespace.com/mastering-neural-networks-1-a-comprehensive-guide-to-forward-propagation/
Understanding the nuances of learning in network: sapiencespace.com/mastering-neural-network-2-back-propagation/
Coding for back-propagation: sapiencespace.com/coding-back-propagation-python-neural-networks-3/

Andrej has already posted a blog post named A Recipe for Training Neural Networks, in which he covers all the details associated with training and getting good results. His post helps all range of people from beginners to experts. This post is not a gist or summary of his blog post, this is a 101 of training neural networks of my learnings and experience.

I am not an expert in neural networks, there is a lot more for me to learn, but trust me this information which I learnt through my journey would be very helpful for people who are starting with deep learning. Because I feel like this information in this post would have really helped me in understanding the fundamentals of training and also saved me countless hours of fixing errors and building efficient architectures if I knew this before.

Here’s the Art of training Neural Networks for you.

1. Know your destination

You have to know where you want to stop your training and evaluate the model. In addition you should also know that you are travelling in the right direction during training. Any absurd losses or seeing losses that has surpassed the destination loss will lead to anomalies, overfitting and poor performance of the final model.

For classification based tasks use negative log likelihood to know where to stop the training. For example if I have 3 classes in my target variable to predict. Immaterial of the model type or it’s architecture, the goal is to reach the loss of 0.477 {log₁₀(1/number of classes)} in order to get a good performance from the model.

When it comes to regression based tasks, the goal is to reach to the minimum error as possible. To calculate the error we have the loss functions, which we will be seeing in detail at later part of the post.

NOTE: Don’t forget to split your data into train and test before building the network. You may train an amazing network but will not be able to gauge its performance if you don’t have the test or validation data at the end of training. But you can use the entire data to train once its accuracy reaches a desirable state and the model is ready for deployment.

2. Don’t start with Complex Architecture

Don’t start with too complex architecture, it may ruin the training process by overfitting your data or by making confidently wrong predictions.

If the loss is following a hockey stick pattern (you could have a very high initial loss at the start of training) make sure your activations are proper – try batch normalisations, dropouts and multiplying 0 or some noise to initial bias.

3. Choose the Activation Function Wisely

Sigmoid function cannot be completely replaced with linear activations like ReLU, due to that fact that the earlier one leads to vanishing gradients or it is computationally very expensive. Each type of task requires it’s own activation function, so choose wisely.

4. Data Data Data

Bad Data = Bad Results. Make sure that you have a good quantity and good quality of data with some contextual meaning before you prepare the tensors for training. Minimum amount of data required to train neural networks task wise:

Regression and binary classification tasks- few hundred to a thousand.
Complex classification tasks with more than 2 classes – few thousand to a hundred thousand.
Image recognition, classification and pattern analysis – few hundred thousand.
Natural Language Processing (NLP) tasks – few hundred thousand to million or billion.

5.Feature Selection for your Network

Choose your features set wisely before training your network. It is very alluring to directly split your data and fit it into the model, but without having good correlations between the data or not having purpose for why they have chosen can later lead to obsolete performance in the model.

6. correlation of features

If the data is image or audio, you can see or view the examples to understand the nature and relationship between data. When you are dealing with tabular data, perform EDA (exploratory data analysis) and visualisations to understand the data. But you can’t follow the same approach for complex tasks like Natural Language Processing, (you can technically visualise the quality of data using Wordcloud). But instead you can opt to clean the data properly and remove any gibberish.

7. Balance of classes in target variable

Please ensure that all the classes in your target variable or NLP is balanced. Because skewness in the distribution can lead to the model learning biased information and fail to identify the minority classes. Use techniques like countplot, barplot and histograms to visualise the distribution and act upon it by using techniques like under-sampling, over-sampling, SMOTE (synthetic minority oversampling technique), etc.

8. increase in epochs for a loss step to decrease

Look out for the increase in epochs for a loss step to decrease, if the loss becomes stagnant at a certain point, then it could be due to the following reasons:
1. The architecture is very simple with only a few neurons and layers.
2. The model capability is such that it could learn only upto that loss.
3. The learning rate is too high for the model after certain epochs. (Set scheduler)
4. The data doesn’t have any contextual meaning/ features to learn from.

9. Final Hidden Layer

Make sure that the final hidden layer’s number of neurons is not too big from the number of neurons in the output, as it could lead to increased computing or convergence or less accuracy in predictions.

10. transition of neurons in layers

Make sure that the transition of number of neuron’s in each layer is smooth and has no drastic changes. This will help the model to learn the information and correct its mistakes in a consistent manner.

11.Choose loss function and optimisers according to the task to be solved

The training won’t be smooth and will converge soon without achieving the desired result if the appropriate function is not used.

Here is a quick list of some of the most popular and widely used loss functions:

Regression Tasks:

MAE – Mean Absolute Error (also known as L1 Loss): Measures the average magnitude of errors in a set of predictions, without considering their direction.

Useful when you want a metric that is robust to outliers, as it treats all errors equally.
MSE – Mean Square Error (also known as L2 Loss): Measures the average of the squares of the errors or deviations. Emphasises larger errors more than MAE, making it useful when you want to penalise larger errors more heavily.
RMSE – Root Mean Square Error: The square root of the average of squared errors. Provides a measure of error in the same units as the original data, making it more interpretable. Like MSE, it penalises larger errors more heavily.
MAPE – Mean Absolute Percentage Error: Measures the average magnitude of errors as a percentage of the actual values. Useful for understanding the error in terms of relative percentage, which is helpful for comparing accuracy across datasets with different scales.
MSLE – Mean Square Logarithmic Error: Measures the mean squared difference between the true and predicted values (after taking the log of both). Useful when you care more about the relative differences between predictions and true values, particularly when dealing with exponential growth or skewed data.
Huber Loss: Combines the best properties of MAE and MSE; it is less sensitive to outliers than MSE and more sensitive than MAE. Useful when you want to balance the robustness to outliers with the penalisation for larger errors.

Classification Tasks:

Cross-Entropy Loss (Log Loss): Measures the performance of a classification model whose output is a probability value between 0 and 1. It is useful because it penalises incorrect classifications more heavily when the confidence is higher, promoting well-calibrated probability outputs.
Accuracy: Measures the proportion of correct predictions out of all predictions. It is useful for providing a simple and intuitive measure of performance when the class distribution is balanced.
Precision: Measures the proportion of true positive predictions among all positive predictions. It is useful when the cost of false positives is high, focusing on the accuracy of the positive class.
Recall (Sensitivity or True Positive Rate): Measures the proportion of true positive predictions among all actual positives. It is useful when the cost of false negatives is high, focusing on capturing as many positives as possible.
F1 Score: – Harmonic mean of precision and recall. It is useful when you need a balance between precision and recall, especially in cases of imbalanced classes.

12. loss function during evaluation

Use the same loss function during evaluation of the test data and don’t accidentally set to some other loss function. This might sound trivial, but always ensure that the evaluation phase uses the right same loss function as used in the training process, any other setting may lead to errors or wrong indication of the model’s performance.

13. Null values in the dataset

Make sure that there are absolutely no nan or null values in your training data, as null activations can spoil your model and the loss becomes nan. I spent too much time interrogating the reason for nan loss in my perfectly crafted network finally to find that there was a few null values in one of the columns of the features. So EDA is highly crucial.

14. Evaluate Results

Use heat maps or other visual analysis to visualise results, don’t just stop with getting good accuracy and loss at end of training, the model could have overfitted or could have not learnt a class of target due to its insignificant features or volume in the training data.

15. Use GPU to train large complex networks

GPU (Graphics processing unit) can help speed up your training process by multiple scaffolds and also efficiently uses the machine’s resources. Generally, you can opt for services like google colab’s premium version, paperspace, etc. You can also use physical GPU in your machine or purchase an additional GPU, the setup can be excruciating and tiring, but the effort will be worth it when a local GPU is always available for your Deep Learning tasks. Stay tuned for a post completely dedicated to setting up a Nvidia GPU for pytorch and tensorflow.

16. Network Architecture

Just a friendly remainder before hitting the train button, please verify that your model’s architecture is apt for the given task and whether all the layers are working as intended with a small portion of the entire dataset before training them completely. Here are some ground rules to keep in ming while designing the architecture:

Small Networks: A smaller network with fewer parameters requires less data to avoid overfitting.
Large Networks: Deep networks with many layers and parameters require more data to generalise well and avoid overfitting.
Hyper-parameter Tuning: play with the hyper-parameters such as number os neurons, number of layers, epochs, type of schedulers, etc to fixate upon the best parameters for efficient training.
Learning Rate: Try different learning rates. A learning rate that’s too high can cause a shaky loss curve. In contrast a very low learning can make the model not learn anything at all.
Batch Size: Experiment with different batch sizes. Smaller batch sizes can lead to more noise in the gradient estimate but might help in finding better minima.
Please use batch normalisation in your network, if a lot of your activations are vanishing and loss is quite jittery.

jeremy bishop EwKXn5CapA4 unsplash scaled

17. Practical Considerations

Overfitting: Be cautious of overfitting, especially with smaller datasets. Techniques like regularisation, dropout, and early stopping can help mitigate this.
Validation: Always set aside a portion of your data for validation to ensure your model is generalising well.
Experimentation: Start with what you have and iteratively improve. Collect more data if necessary and feasible.

pierre chatel innocenti aMhrM3RHq7Q unsplash

18. Pre-Trained Models

Use pre-trained models to fine-tune for a specific task if you have a small dataset. It saves a lot of time and resources and helps to achieve your tasks in the most efficient way.

That’s a wrap folks !!! Thank you for reading all along, I hope you got to learn some valuable lessons and got some takeaways that you can leverage to train neural networks. Stay tuned for the next post on using PyTorch to code neural Networks. Make sure to drop your thoughts and ideas in the comment section below !!!

Subscribe to sapiencespace and enable notifications to get regular insights.

Click here to view similar insights.

What’s your Reaction?

Insightful

Helpful

Amazing

Clap

Hi-fi

Recently Posted

Data Science & Programming

Art of Training Neural Networks – 18 Secrets to Mastering Neural Networks – 5

Since the time I started to learn and implement neural networks, I got to encounter a lot of different situations and hurdles in setting up and training a neural network.

Regression Tasks:

Classification Tasks:

Leave a Reply Cancel reply

Recently Posted

Can page-based indexing save Compute, Memory and Time in RAG(Retrieval Augmented Generation)? A comparative study in medical field

Share

Subscribe To Newsletter

Art of Training Neural Networks – 18 Secrets to Mastering Neural Networks – 5

Since the time I started to learn and implement neural networks, I got to encounter a lot of different situations and hurdles in setting up and training a neural network.

Regression Tasks:

Classification Tasks:

Share this post!

Leave a Reply Cancel reply

Recently Posted

Can page-based indexing save Compute, Memory and Time in RAG(Retrieval Augmented Generation)? A comparative study in medical field

Share

Subscribe To Newsletter

Home

Data Science & Programming

Book Summaries & Review

Personal Development