Breaking down Gradient Descent in 4 minutes

Gradient Descent

Reading Time: 4 minutes

Gradient Descent, it one of the core concepts in ML that confused me for a long time, but I finally got to build a proper intuition around it now. You are in the right page if you have difficulty in understanding gradient descent or want to enhance you understanding of the same.

In the first ever post on this blog, I posted a page explaining Linear regression using the Ordinary Least Square method. In that method, we define a base hypothesis function for the line and partially differentiate the sum of squared residuals(actual dependent variable – predicted dependent variable)^2 [basically mean square error(mse) – cost function] with respect to slope and intercept to get the final equation. In this post we will look at a simple yet powerful technique known as Gradient Descent with very simple math to solve the Linear Regression problem.

Pre-requisite – basic knowledge of differentiation and partial differentiation.

We took the example of a sales dataset in which we had a dependent variable named sales and three independent variables which influence the sales variable namely TV, Radio, and Newspaper.

pair plor

By visual inspection we can see that sales and TV are strongly correlated, so we can predict sales value based on TV.

In gradient descent, we follow a very similar approach, but here we initialize random values for intercept and slope in the partially differentiated equation (hypothesis function is the initial equation) and add the results to get the new slope value and multiply it with a learning rate to get a step size . new intercept=old intercept-step size . same process is done to find new slope. That’s the entire process for gradient descent. I know it can be a bit intimidating, let’s break it down step by step with simple equations and python code.

We need to find and fit a line as shown above in order to predict a sales figure for a given input of TV value. To train the model to find the line (y=mx+c), we need to find the slope(m) and intercept(c).

We use the sum of squared residuals to find the optimised value for slope and intercept.

pexels karolina grabowska 5412429 1

The Math Behind the Magic of Gradient Descent

For the next step, we partially differentiate the sum of squared residuals(MSE) with respect to c and m to find the local minima for each of them. Now, we plug in each value from the dataset into the partial derivative equation.

We have to utilise all the datapoints and plug them into the equation with c=0 and m=1 as the initial random values.

IMG 5D01EC6BEA10 1

The same procedure is carried out for the partial differentiation of SOSR with respect to m.

All the values plugged in are summed together for the slope and intercept values, and then it is used to compute the step size for m and c, which is then subtracted from previous c and m values.

All of this process is carried out iteratively over a certain amount of epochs in order for the model to learn the optimised slope and intercept.

nubelson fernandes jKL2PvKN8Q0 unsplash

The Source Code

Preparing the Data

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

ad = pd.read_csv("advertising.csv") # I have removed the ID column 
ad.head()


TV	Radio	Newspaper	Sales
0	230.1	37.8	69.2	22.1
1	44.5	39.3	45.1	10.4
2	17.2	45.9	69.3	12.0
3	151.5	41.3	58.5	16.5
4	180.8	10.8	58.4	17.9

Instead of the scatterplot to find how the dependent variables are correlated with the independent variables, we can also do a heatmap analysis to understand the relation between the variables.

sns.heatmap(ad.corr(), cmap='jet', annot=True)
plt.show()

heatmap sales

It is clearly visible that sales and TV variables are strongly correlated in the positive direction. To learn more about correlation, covariance and heatmaps click here to go the page on sapiencespace explaining it in detail.

Now, we only need to take the variables we require and rename for convenience.

Building the Model

df = ad[['TV','Sales']]
df = df.rename(columns={'TV':'x','Sales':'y'})

# Standardize both x and y because y values are greater by a large extent when compared to x
df['x_scaled'] = (df['x'] - df['x'].mean()) / df['x'].std()
df['y_scaled'] = (df['y'] - df['y'].mean()) / df['y'].std()

# Initialize parameters
c, m = 0, 0.02  # Starting guess for intercept and slope
lr = 0.0001  # Learning rate
epochs = 80  # Number of iterations

# Gradient descent loop
for i in range(epochs):
    # Compute predictions using scaled x
    y_pred = c + m * df['x_scaled']
    
    # Compute residuals between scaled y and predicted values
    residuals = df['y_scaled'] - y_pred
    
    # Compute partial derivatives (gradients)
    par_der_c = -2 * residuals.sum()  # Derivative w.r.t intercept
    par_der_m = -2 * (df['x_scaled'] * residuals).sum()  # Derivative w.r.t slope
    
    # Update intercept and slope
    c -= lr * par_der_c
    m -= lr * par_der_m
    
        
# Convert the learned slope and intercept back to the original scale
m_orig = m * (df['y'].std() / df['x'].std()) 
c_orig = df['y'].mean() - m_orig * df['x'].mean() # y=mx+c

# Final learned values for intercept and slope in the original scale
print(f"Final values -> Intercept (c): {c_orig}, Slope (m): {m_orig}")

Final values -> Intercept (c): 7.27245235166918, Slope (m): 0.05546477046955886

Here’s a plot of the line in the original scatterplot of Sales and TV.

sns.set_style('darkgrid')

plt.scatter(df['x'], df['y'])
plt.plot(df['x'], 7.27+0.05*df['x'], 'r')
plt.xlabel('TV')
plt.ylabel('Sales')
plt.show()
sns.despine()

fin

Here’s a visualisation of how the intercept and slope values change over multiple epochs performing gradient decent on the given data.

The slope m describes how much y changes when x changes by one unit. But when x and y are standardised, one unit in the standardised x corresponds to a different amount of change in the original x, and similarly for y. So to convert the slope back to the original scale, we multiply the learned slope by the ratio of the standard deviations.

That’s a wrap, thank you for reading all along. I hope you got to learn some valuable concepts in a very simple manner. Try to program this logic on your own and implement the code on any new dataset which has correlated features, and post your doubts if any in the comment section. This approach of partially differentiating the hypothesis function with any number of parameters and then finding the local minima will work in any use case.

Next post on Stochastic Gradient Descent is set to roll out very soon, so stay tuned and happy coding. Comment below if you are interested in learning to make the animated plot, I will include the coding and explanation part of that in the next post.

_{^{Cover picture and title image credits – unsplash content creators}}

Click here to view similar insights.

What’s your Reaction?

Like

3

Like

Insightful

4

Insightful

Helpful

8

Helpful

Amazing

7

Amazing

Clap

9

Clap

Hi-fi

7

Hi-fi

Leave a Reply Cancel reply

Recently Posted

Data Science & Programming

Can page-based indexing save Compute, Memory and Time in RAG(Retrieval Augmented Generation)? A comparative study in medical field

Share

Subscribe To Newsletter