
Reading Time: 7 minutes
The linear regression is one of the most basic concepts in machine learning. It comes under the classification of supervised learning.
For those who don’t know what supervised learning is, it is a machine learning process in which the algorithm learns from training data, without following any instructions that are separately programmed to solve a task. In supervised learning, a labeled data set is provided to the algorithm, in which the data is well structured into inputs and outputs(target variables).
The main objective of a supervised learning model is to develop a mapping function which can accurately predict the target variable. This model can be used for both regressive(when the target variable is continuous) and classification (when the target variable is categorical) tasks.
Logic behind linear Regression
- Linearity: A change in an independent variable is associated with a constant change in the dependent variable.
- Independence of Errors: The errors (residuals), which are the differences between the observed and predicted values, should be independent of each other. I
- Normality of Errors: The errors should be normally distributed. This assumption typically applies to the errors, not the dependent variable itself.
- No or Little Multicollinearity: When there are multiple independent variables in the model, they should not be highly correlated with each other. Multicollinearity can make it difficult to interpret the individual effects of predictors.
- No Endogeneity: The independent variables should not be affected by the dependent variable. In other words, the model assumes a causal relationship from predictors to the dependent variable and not the other way around.
- No Outliers: Extreme values or outliers in the data should be minimised or appropriately handled. Outliers can unduly influence the estimated coefficients and affect model performance.
Once the above conditions are met and the data is ready for the model, our end goal is to find the equation of a line which fits like this →

The red line is our objective, and we need to develop a hypothesis function in order to obtain the line. We will be making use of the ordinary least squares method in this post.
First we need to define our hypothesis function, which is the y=mx+c+E equation. Where m is the slope, c is the intercept and E is the error.

The dataset which I have used for this project will be available here : Click here
From this dataset, we have to prepare a linear regression model that predicts the sales that happened on the basis of the amount spent on a specific media platform, it could be TV, newspaper or Radio.
To find the best column for the independent variable, after EDA(exploratory data analysis) we make a simple scatter plot of each variable and get the results as follow:

from the above visualisations, we can see that there is a linear correlation between the Sales and TV column.
So TV is our independent variable and Sales is the dependent variable which we want to predict.
The actual y value – predict value is the residual value and the OLS methods aims at minimising the sum of the squared residuals.
This equation is the general linear regression model, from here we will derive the m, c, e parameters.

Now, we have to minimise the above equation to get the c, m parameters, basically we are aiming to minimise the Ei. The basic idea that flashes when it comes to minimisation, is differentiation, since we are dealing with two random variables m and c, we have to adapt partial differentiation with respect to least square estimators of c and m respectively.

First we partially differentiate the above equation with respect to c, then m.



The last step is performed to make the equation more intuitive, because the numerator denotes the covariance of the two variables and the denominator represents the variance in x(independent variable) which tells how much y changes per unit change in x. Additionally, this form of representation makes computation efficient and faster.
From this we can obtain the m, c for our regression line. Now lets get into coding→

The Program/Code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df = ad[['TV','Sales']]
df.head()
– | TV | Sales |
---|---|---|
0 | 230.1 | 22.1 |
1 | 44.5 | 10.4 |
2 | 17.2 | 12.0 |
3 | 151.5 | 16.5 |
4 | 180.8 | 17.9 |
Let’s keep the variables as x, y for simplified view
df = df.rename(columns={'TV':'x','Sales':'y'})
df["x-x'"] = df.x - np.mean(df.x)
df["y-y'"] = df.y - np.mean(df.y)
df["(y-y')*(x-x')"] = df["y-y'"]*df["x-x'"]
df["(x-x')^2"] = df["x-x'"]**2
df.head()
– | x | y | x-x’ | y-y’ | (y-y’)*(x-x’) | (x-x’)^2 |
---|---|---|---|---|---|---|
0 | 230.1 | 22.1 | 83.0575 | 6.9695 | 578.869246 | 6898.548306 |
1 | 44.5 | 10.4 | -102.5425 | -4.7305 | 485.077296 | 10514.964306 |
2 | 17.2 | 12.0 | -129.8425 | -3.1305 | 406.471946 | 16859.074806 |
3 | 151.5 | 16.5 | 4.4575 | 1.3695 | 6.104546 | 19.869306 |
4 | 180.8 | 17.9 | 33.7575 | 2.7695 | 93.491396 | 1139.568806 |
m = sum(df["(y-y')*(x-x')"])/sum(df["(x-x')^2"])
m
0.055464770469558805
c = np.mean(df.y - m*(np.mean(df.x)))
c
6.974821488229896
So our final equation is y=6.974+0.0554*x
if we plot the line on the graph we would get something like this →
x = np.linspace(1,10,100)
y = np.linspace(1,10,100)
plt.plot(x, 6.974+0.0554*x)
plt.show()

Instead of doing all the calculation maually, we have a module called statmodels.api.
first we need to split the dataset into test and train
X = ad['TV']
y = ad['Sales']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
import statsmodels.api as sm
X_train_sm = sm.add_constant(X_train)
lr = sm.OLS(y_train, X_train_sm).fit() # OLS is ordinary least square
lr.params
const 6.926214
TV 0.055278
dtype: float64
We can see that we have obtained a very close value by doing manual calculation
sns.set_style('whitegrid')
plt.scatter(X_test, y_test)
plt.plot(X_test, 6.948+0.054*X_test, 'r')
plt.xlabel('TV')
plt.ylabel('Sales')
plt.title('Linear Regression')
plt.show()

The error term which we mentioned at the beginning can be found using the root mean square error formula, which is basically square root of sum of the (predicted value – actual value of y)^2 divided be n. This error can be minimised using optimisation techniques.
That’s a wrap, we have covered the math and the code behind linear regression using OLS method. If you have any questions or need further clarification, please feel free to ask in the comment section below. Your curiosity and engagement are highly valued.
Thank you for reading all along, subscribe to sapiencespace and enable notifications to get regular insights.
Click here to view all the concepts related to machine learning.
One Response
I really like reading through a post that can make men and women think. Also, thank you for allowing me to comment!