Hassle Free Gradient Descent with Pytorch – A 101

gradient descent

Reading Time: 4 Minutes

Uncover the magic of Gradient Descent without any hassle of calculating the derivatives for back-propagation for the neural network to learn.

The simple idea of gradient descent is minimizing the loss function with respect to the parameters involved in the equation. In the previous post on sapiencespace in gradient descent, we understood the fundamentals of choosing a loss function for linear regression and mathematically deriving the derivate for the loss function.

Breaking down Gradient Descent in 4 minutes

Here we will follow a similar approach to the previous post but as the title signifies there will be no stress of manually calculating the gradients. The first step is to initialize the feature and target variables. The goal here is to predict a target variable based on a single feature x. We take the example of a sales dataset as in the previous post in which we had a dependent variable named sales and three independent variables which influence the sales variable namely TV, Radio, and Newspaper.

import seaborn as sns
import matplotlib.pyplot as plt

df = pd.DataFrame(pd.read_csv("advertising.csv"))
sns.pairplot(df, x_vars=['TV','Radio','Newspaper'], y_vars='Sales', height=4, kind='scatter')
plt.show()

pair plor

Since sales and TV are strongly correlated and have a linear pattern, we will be taking them as the training variables.

# Standardize both x and y because y values are greater by a large extent when compared to x
df['x_scaled'] = (df['TV'] - df['TV'].mean()) / df['TV'].std()
df['y_scaled'] = (df['Sales'] - df['Sales'].mean()) / df['Sales'].std()

# Training data (x, y_true)
x = torch.tensor(df['x_scaled'], dtype=torch.float32)  # Feature
y_true = torch.tensor(df['y_scaled'], dtype=torch.float32)  # Target

Next we need to decide our residual function and loss function to find the optimal line/ prediction for the given data.

loss function – mean squared error/ sum of squared residuals

residual function – difference between the predicted and actual variable

Random initialization of the hyperparameters – slope, intercept and learning rate are done.

# Parameters
m = torch.randn(1, requires_grad=True, dtype=torch.float32)
c = torch.randn(1, requires_grad=True, dtype=torch.float32) 
learning_rate = 1e-2

Main Training Loop

# Storage for visualization
epochs = 80
loss_history = []
m_history = []
c_history = []

# Training loop
for epoch in range(epochs):
    # Forward pass
    y_pred = c + m * x
    loss = torch.mean((y_true - y_pred) ** 2)
    loss_history.append(loss.item())  # Convert to Python float for plotting

    # Backward pass
    loss.backward() # once this is set, them there is no need for manual setting of gradients of the loss function with respect to each parameter

    # Update parameters
    with torch.no_grad():
        m -= learning_rate * m.grad
        c -= learning_rate * c.grad

        # Zero gradients
        m.grad.zero_()
        c.grad.zero_()

    # Store parameters
    m_history.append(m.item()) # do not use .detach().numpy()
    c_history.append(c.item())

The parameters are updated by setting no_grad() function to diable gradient tracking. By setting loss.backward(), Pytorch creates a computational graph where gradients are automatically calculated using the chain rule.

The output from .grad for the parameters do not change when no_grad() is used, rather the computations made on updating the parameters with the learning rate should not be tracked by Pytorch in the computational graph. If PyTorch tracked these updates, it would consume unnecessary memory and might lead to incorrect gradient computations in the next iteration.

The Gradients are then zeroed out so that gradients are not accumulated. If the gradients are not zeroed out, then the next backward pass will add new gradients to the previous ones, leading to incorrect updates.

Visualizing the results

# Visualization
import seaborn as sns
sns.set_style('darkgrid')

fig, axs = plt.subplots(3, 1, figsize=(8, 12))

# Loss curve
axs[0].plot(range(epochs), loss_history, label="Loss")
axs[0].set_title("Loss over Epochs")
axs[0].set_xlabel("Epoch")
axs[0].set_ylabel("Loss")
axs[0].legend()

# Parameter updates
axs[1].plot(range(epochs), m_history, label="Slope (m)")
axs[1].plot(range(epochs), c_history, label="Intercept (c)")
axs[1].set_title("Parameter Updates over Epochs")
axs[1].set_xlabel("Epoch")
axs[1].set_ylabel("Parameter Value")
axs[1].legend()

# the original slope and intercept values are obtained 
m_orig = m_history[-1] * (df['Sales'].std() / df['TV'].std()) 
c_orig = df['Sales'].mean() - m_orig * df['TV'].mean() # y=mx+c

print(m_orig, c_orig)
y_pred_np = m_orig * df['TV'] + c_orig

axs[2].scatter(df['TV'], df['Sales'], label="True Values")
axs[2].plot(df['TV'], y_pred_np, color='r', label="Predictions")
axs[2].set_title("Predictions vs. True Values")
axs[2].set_xlabel("Input (x)")
axs[2].set_ylabel("Output (y)")
axs[2].legend()

plt.tight_layout()
plt.show()
sns.despine()

test 1

The final Slope and intercept values respectively are – 0.051474068850748574, 7.561624231013808. These numbers are very similar to the previous approaches explored in this blog.

One quick note, as always I didn’t get the correct learning rate and number of epochs to train at first try, it is all about knowing the complexity of the model and the data it trains to choose a learning rate. An extremely low/high learning rate will not help the model in learning any new information, so it is about experimenting and finding the right values.

Here is a small clip for you to understand how learning rates can influence the predictions, parameter and the loss curve.

The linear lines in loss history occur when the gradients get zeroed out as the learning rate is too small and the model does not learn anything. A very small learning rate might cause parameter updates to be so negligible that they appear to have no effect.

That’s a wrap, thank you for reading all along. I hope you got to learn some valuable concepts in a very simple manner. Try to program this logic on your own and implement the code with pytorch on other deep learning problems, and post your doubts if any in the comment section. I also highly encourage you to modify the code and implement stochastic gradient descent and mini-batch gradient descent to reinforce the fundamentals learnt from this post.

Subscribe to sapiencespace and enable notifications to get regular insights.

_{^{Cover picture and title image credits – unsplash content creators}}

Click here to view similar insights.

What’s your Reaction?

Like

2

Like

Insightful

2

Insightful

Helpful

7

Helpful

Amazing

4

Amazing

Clap

5

Clap

Hi-fi

3

Hi-fi

Leave a Reply Cancel reply

Recently Posted

Data Science & Programming

Spelling out Linear Regression in Code – a 101 guide

Share

Subscribe To Newsletter