Stochastic Gradient Descent – SGD – Simpler Gradient Descent

Reading Time: 3 Minutes

In the previous post, we saw what is Gradient Descent and it can be easily implemented with the help of a few lines of code. When I was first learning the basics of AI, I used to think that Stochastic gradient descent is something much more complicated than Gradient Descent. But in reality, it is much easier than the earlier. The only difference is that, here we do not sum all the residual values and take their derivatives, instead we only take a random coordinate in each iteration and then calculate the residuals, step size and the updated the values for unknown parameters.

In this post we will solve the same problem of finding the regression line for the sales dataset. I highly you to have a glance on the previous post explaining Gradient Descent if you have not seen it yet to understand the context of what we will be dealing with Stochastic Gradient Descent.

In simple terms ‘Stochastic’ basically means randomness. And that is exactly what we will be doing in this post. We will incorporate random selection of coordinates in the training loop instead of computing the residuals for all the data points and thereby improve accuracy and use less compute power.

Steps for Stochastic Gradient Descent:

Random Sampling: At each iteration, you randomly select a single data point.
Compute Residuals: Instead of computing the residuals for the entire dataset, compute it only for the selected data point.
Update Parameters: Use the same gradient update logic, but now the gradients are computed using only one data point. This will update the slope (m) and intercept (c) more frequently, with smaller but noisier updates.

Here is a quick comparison of Gradient Descent and Stochastic Gradient Descent.

Method	Time Complexity	Space Complexity	Compute Usage
Gradient Descent	O(n⋅d) per iteration	O(n⋅d)	More accurate updates but high compute cost
Stochastic Gradient Descent	O(d) per iteration	O(d)	Faster per iteration, but more iterations needed

Coding SGD

import random
# Standardize both x and y
df['x_scaled'] = (df['x'] - df['x'].mean()) / df['x'].std()
df['y_scaled'] = (df['y'] - df['y'].mean()) / df['y'].std()

# Initialize parameters
c, m = 0, 0.2  # Starting guess for intercept and slope
lr = 0.0001  # Learning rate
epochs = 1000  # Number of iterations

for i in range(epochs):
    # Selecting a random point in the dataframe
    rand_index = random.randint(0, len(df)-1) 
    x_i = df['x_scaled'].iloc[rand_index]
    y_i = df['y_scaled'].iloc[rand_index]
    
    # Compute predictions using the randomly selected x
    y_pred = c + m * x_i
    
    # Compute residuals between randomly selected y and predicted values
    residual = y_i - y_pred
    
    # Compute partial derivatives (gradients)
    par_der_c = -2 * residual  # Derivative w.r.t intercept
    par_der_m = -2 * x_i * residual  # Derivative w.r.t slope
    
    # Update intercept and slope
    c -= lr * par_der_c
    m -= lr * par_der_m
    

# Convert the learned slope and intercept back to the original scale
m_orig = m * (df['y'].std() / df['x'].std())
c_orig = df['y'].mean() - m_orig * df['x'].mean()

# Final learned values for intercept and slope in the original scale
print(f"Final values -> Intercept (c): {c_orig}, Slope (m): {m_orig}")

Here I have increased the epochs to a very high number because such initial c,m and learning rate required it. Increasing the learning rate by a factor of 10 will only require 1000 epochs compared to 10000 which is scaling the computational power by 10 times.

But the model can be trained using lesser epochs using different parameters. c, m = 0, 1 with lr = 0.01 and 80 epochs gives the answer very quickly. So it is all a matter of understanding each parameter and then configuring them according to a problem. To learn more about mastering training AI models, check out my blog post on art of training neural networks.

Finally after training, we get a perfect line that fits our data for regression tasks.

In the next post, we’ll explore mini-batch Gradient Descent which strikes a balance between Gradient Descent and Stochastic Gradient Descent in terms of accuracy and computational efficiency. In mini-batch Gradient Descent, the dataset is divided into small batches of data points, with each batch typically containing multiple random points, and the gradient is computed using only that batch at each iteration.

Thank you for reading all along. I hope you got to learn some core concepts in a very simple manner. Post your thoughts or doubts if any in the comment section below. Stay tuned and subscribe to sapiencespace for more amazing content.

_{^{Cover picture and title image credits – unsplash content creators}}

Click here to view similar insights.

What’s your Reaction?