#1 Linear Regression Simplified

igor surkov 71sBY s5qfE unsplash

Reading Time: 7 minutes

The linear regression is one of the most basic concepts in machine learning. It comes under the classification of supervised learning.

For those who don’t know what supervised learning is, it is a machine learning process in which the algorithm learns from training data, without following any instructions that are separately programmed to solve a task. In supervised learning, a labeled data set is provided to the algorithm, in which the data is well structured into inputs and outputs(target variables).

The main objective of a supervised learning model is to develop a mapping function which can accurately predict the target variable. This model can be used for both regressive(when the target variable is continuous) and classification (when the target variable is categorical) tasks.

Logic behind linear Regression

  1. Linearity: A change in an independent variable is associated with a constant change in the dependent variable.
  2. Independence of Errors: The errors (residuals), which are the differences between the observed and predicted values, should be independent of each other. I
  3. Normality of Errors: The errors should be normally distributed. This assumption typically applies to the errors, not the dependent variable itself.
  4. No or Little Multicollinearity: When there are multiple independent variables in the model, they should not be highly correlated with each other. Multicollinearity can make it difficult to interpret the individual effects of predictors.
  5. No Endogeneity: The independent variables should not be affected by the dependent variable. In other words, the model assumes a causal relationship from predictors to the dependent variable and not the other way around.
  6. No Outliers: Extreme values or outliers in the data should be minimised or appropriately handled. Outliers can unduly influence the estimated coefficients and affect model performance.

Once the above conditions are met and the data is ready for the model, our end goal is to find the equation of a line which fits like this →

The red line is our objective, and we need to develop a hypothesis function in order to obtain the line. We will be making use of the ordinary least squares method in this post.

First we need to define our hypothesis function, which is the y=mx+c+E equation. Where m is the slope, c is the intercept and E is the error.

linear regression

The dataset which I have used for this project will be available here : Click here

From this dataset, we have to prepare a linear regression model that predicts the sales that happened on the basis of the amount spent on a specific media platform, it could be TV, newspaper or Radio.

To find the best column for the independent variable, after EDA(exploratory data analysis) we make a simple scatter plot of each variable and get the results as follow:

linear regression

from the above visualisations, we can see that there is a linear correlation between the Sales and TV column.

So TV is our independent variable and Sales is the dependent variable which we want to predict.

The actual y value – predict value is the residual value and the OLS methods aims at minimising the sum of the squared residuals.

This equation is the general linear regression model, from here we will derive the m, c, e parameters.

linear regression

Now, we have to minimise the above equation to get the c, m parameters, basically we are aiming to minimise the Ei. The basic idea that flashes when it comes to minimisation, is differentiation, since we are dealing with two random variables m and c, we have to adapt partial differentiation with respect to least square estimators of c and m respectively.

linear regression

First we partially differentiate the above equation with respect to c, then m.

linear regression
linear regression
linear regression

The last step is performed to make the equation more intuitive, because the numerator denotes the covariance of the two variables and the denominator represents the variance in x(independent variable) which tells how much y changes per unit change in x. Additionally, this form of representation makes computation efficient and faster.

From this we can obtain the m, c for our regression line. Now lets get into coding→

nubelson fernandes UcYBL5V0xWQ unsplash

The Program/Code

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

df = ad[['TV','Sales']]
df.head()
TVSales
0230.122.1
144.510.4
217.212.0
3151.516.5
4180.817.9

Let’s keep the variables as x, y for simplified view

df = df.rename(columns={'TV':'x','Sales':'y'})
df["x-x'"] = df.x - np.mean(df.x)
df["y-y'"] = df.y - np.mean(df.y)

df["(y-y')*(x-x')"] = df["y-y'"]*df["x-x'"]
df["(x-x')^2"] = df["x-x'"]**2
df.head()
xyx-x’y-y’(y-y’)*(x-x’)(x-x’)^2
0230.122.183.05756.9695578.8692466898.548306
144.510.4-102.5425-4.7305485.07729610514.964306
217.212.0-129.8425-3.1305406.47194616859.074806
3151.516.54.45751.36956.10454619.869306
4180.817.933.75752.769593.4913961139.568806
m = sum(df["(y-y')*(x-x')"])/sum(df["(x-x')^2"])
m

0.055464770469558805

c = np.mean(df.y - m*(np.mean(df.x)))
c

6.974821488229896

So our final equation is y=6.974+0.0554*x

if we plot the line on the graph we would get something like this →

x = np.linspace(1,10,100)
y = np.linspace(1,10,100)

plt.plot(x, 6.974+0.0554*x)
plt.show()
linear regression

Instead of doing all the calculation maually, we have a module called statmodels.api.

first we need to split the dataset into test and train

X = ad['TV']
y = ad['Sales']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
import statsmodels.api as sm
X_train_sm = sm.add_constant(X_train)
lr = sm.OLS(y_train, X_train_sm).fit() # OLS is ordinary least square

lr.params

const 6.926214

TV 0.055278

dtype: float64

We can see that we have obtained a very close value by doing manual calculation

sns.set_style('whitegrid')
plt.scatter(X_test, y_test)
plt.plot(X_test, 6.948+0.054*X_test, 'r')
plt.xlabel('TV')
plt.ylabel('Sales')
plt.title('Linear Regression')
plt.show()
linear regression

The error term which we mentioned at the beginning can be found using the root mean square error formula, which is basically square root of sum of the (predicted value – actual value of y)^2 divided be n. This error can be minimised using optimisation techniques.

That’s a wrap, we have covered the math and the code behind linear regression using OLS method. If you have any questions or need further clarification, please feel free to ask in the comment section below. Your curiosity and engagement are highly valued.

Thank you for reading all along, subscribe to sapiencespace and enable notifications to get regular insights.

Click here to view all the concepts related to machine learning.

😀
0
😍
0
😢
0
😡
0
👍
1
👎
0

One Response

Leave a Reply

Your email address will not be published. Required fields are marked *

Subscribe To our Newsletter

Subscription Form

Recently Posted

Share

Subscribe To Newsletter

Search

Home