Master the Magic of Mini Batch Gradient Descent

mini batch gradient descent

Reading Time: 3 Minutes

Mini Batch Gradient Descent – a perfect balance of Gradient descent and Stochastic Gradient Descent. So balanced that it looks like magic, let’s uncover the logic in a simple framework to understand it’s internals.

One of the biggest advantages in Mini batch gradient descent is the ability to average down the gradients over multiple batches from the dataset instead of randomly choosing rows or iterating throughout high epochs.

I highly recommend you to have a look of the previous two posts explaining gradient descent to have a strong foundation of the logic and skeletal idea of what we will be implementing in this post. Link to previous posts:

Batch gradient averaging serves some very important purposes:

  1. Smooth updates: mitigates the possible harsh impact from any individual sample’s gradient, hence leading to more stable and consistent updates.
  2. Handling outliers: it helps avoid erratic parameter updates that can arise due to unremovable outlier datapoints that could play an important role in data dynamics.
  3. Approximation of Full-batch Gradient: allows to make meaningful progress with less data, saving on compute and time.
  4. Variance Reduction: makes the optimisation path less noisy and prevents vanishing/ exploding gradients.

In this post, we’ll follow the same approach as the earlier ones. To find the regression line, we partially differentiate the sum of squared residuals with respect to the slope and intercept. Here, the dataset will be randomly shuffled on each epoch and batches of the shuffled dataset will be extracted and performed gradient descent to update the slope and intercept value.

nubelson fernandes jKL2PvKN8Q0 unsplash

Python Code for Mini Batch Gradient Descent

Import the necessary libraries.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Loading and preprocessing the data

The csv file was downloaded from kaggle.

ad = pd.read_csv("advertising.csv")
df = ad[['TV','Sales']]
df = df.rename(columns={'TV':'x','Sales':'y'})

Code to batch and compute the gradients

import random
# Standardize both x and y
df['x_scaled'] = (df['x'] - df['x'].mean()) / df['x'].std()
df['y_scaled'] = (df['y'] - df['y'].mean()) / df['y'].std()

# Initialize parameters
c, m = 0, 0.02  # Starting guess for intercept and slope
lr = 0.001  # Learning rate
epochs = 100  

for i in range(epochs):
    # Shuffle the dataset for each epoch
    df = df.sample(frac=1).reset_index(drop=True)

    batch = 5
    for j in range(0, len(df), batch):
        x_b = df['x_scaled'].iloc[j : j + batch]
        y_b = df['y_scaled'].iloc[j : j + batch]
        
        y_pred = c + m * x_b
        residual = y_b - y_pred
        
        # Compute partial derivatives (gradients)
        par_der_c = -2 * residual.mean()  # Compute average of residual and Derivative w.r.t intercept
        par_der_m = -2 * (x_b * residual).mean()  # Compute average of x_b * residual and Derivative w.r.t slope
        
        # Update intercept and slope
        c -= lr * par_der_c
        m -= lr * par_der_m
        
    print(f"Epoch {i}: c = {c}, m = {m}")

# Convert the learned slope and intercept back to the original scale
m_orig = m * (df['y'].std() / df['x'].std())
c_orig = df['y'].mean() - m_orig * df['x'].mean()

# Final learned values for intercept and slope in the original scale
print(f"Final values -> Intercept (c): {c_orig}, Slope (m): {m_orig}")

Output->
Epoch 0: c = -5.272691181568167e-05, m = 0.08756708909727037
Epoch 1: c = -0.0001560078925206223, m = 0.14984546502616755
.

.

.
Epoch 28: c = 4.819179878551715e-05, m = 0.8141059956418741
Epoch 29: c = 9.73543426171841e-06, m = 0.8207224705675141
Final values -> Intercept (c): 7.7031921406993265, Slope (m): 0.05051130019756642

Graph of the line plotted over the data points->

sns.set_style('darkgrid')

plt.scatter(df['x'], df['y'])
plt.plot(df['x'], 7.70710+0.050484*df['x'], 'r')
plt.xlabel('TV')
plt.ylabel('Sales')
plt.show()
sns.despine()
st

The small fluctuations from the mini-batches can help the overall model escape shallow local minima, thereby improving the ability to generalise well on the data.

In conclusion, Mini-batch gradient descent combines the benefits of both stochastic gradient descent (SGD) and full-batch gradient descent to reduce computational load and provides smoother, more stable updates, enabling faster convergence overall. And this is exactly what makes mini-batch gradient descent to be practical in training modern tools such as Large Language Models.

That’s a wrap folks, thank you for reading all the way along. I hope you got to learn some valuable lessons from this post, I highly recommend you to run the code yourself and check out the math behind the partial derivations to build a solid framework through which you can be able to solve any optimisation or Machine Learning problem by understanding the pattern of data, the mathematical function associated with the data and its tuneable parameters.

Post your thoughts or doubts if any in the comment section below. Stay tuned and subscribe to sapiencespace for more amazing content.

Click here to view similar insights.

😀
0
😍
0
😢
0
😡
0
👍
0
👎
0

Leave a Reply

Your email address will not be published. Required fields are marked *

Subscribe To our Newsletter

Subscription Form

Recently Posted

Share

Subscribe To Newsletter

Search

Home