#1 Linear Regression Simplified

Reading Time: 7 minutes

The linear regression is one of the most basic concepts in machine learning. It comes under the classification of supervised learning.

For those who don’t know what supervised learning is, it is a machine learning process in which the algorithm learns from training data, without following any instructions that are separately programmed to solve a task. In supervised learning, a labeled data set is provided to the algorithm, in which the data is well structured into inputs and outputs(target variables).

The main objective of a supervised learning model is to develop a mapping function which can accurately predict the target variable. This model can be used for both regressive(when the target variable is continuous) and classification (when the target variable is categorical) tasks.

The goal of a Linear Regression model is to predict a continuous target variable given a independent/feature variable(s).

Logic behind Linear Regression

Linearity: A change in an independent variable is associated with a constant change in the dependent variable.
Independence of Errors: The errors (residuals), which are the differences between the observed and predicted values, should be independent of each other.
Normality of Errors: The errors should be normally distributed. This assumption typically applies to the errors, not the dependent variable itself.
No or Little Multicollinearity: When there are multiple independent variables in the model, they should not be highly correlated with each other. Multicollinearity can make it difficult to interpret the individual effects of predictors.
No Endogeneity: The independent variables should not be affected by the dependent variable. In other words, the model assumes a causal relationship from predictors to the dependent variable and not the other way around.
No Outliers: Extreme values or outliers in the data should be minimised or appropriately handled. Outliers can unduly influence the estimated coefficients and affect model performance.

Once the above conditions are met and the data is ready for the model, our end goal is to find the equation of a line which fits like this →

The red line is our objective, and we need to develop a hypothesis function in order to obtain the line. We will be making use of the ordinary least squares method in this post.

First we need to define our hypothesis function, which is the y=mx+c+E equation. Where m is the slope, c is the intercept and E is the error.

The dataset which I have used for this project will be available here : Click here

From this dataset, we have to prepare a linear regression model that predicts the sales on the basis of the amount spent on a specific media platform, it could be TV, newspaper or Radio.

To find the best column for the independent variable, after EDA(exploratory data analysis) we make a simple scatter plot of each variable and get the results as follow:

from the above visualisations, we can see that there is a linear correlation between the Sales and TV column.

So, TV is our independent variable and Sales is the dependent variable which we want to predict. We basically want to find a line for 1st plot that when the x-axis (TV) is given – a sales output has to be found.

The actual y value – predict value is the residual value and the OLS methods aims at minimising the sum of the squared residuals.

This equation is the general linear regression model, from here we will derive the m, c parameters.

Now, we have to minimise the above equation to get the c, m parameters, basically we are aiming to minimise the Ei. The basic idea that flashes when it comes to minimisation, is differentiation, since we are dealing with two random variables m and c, we have to adapt partial differentiation with respect to least square estimators of c and m respectively.

First we partially differentiate the above equation with respect to c, then m.

The last step is performed to make the equation more intuitive, because the numerator denotes the covariance of the two variables and the denominator represents the variance in x(independent variable) which tells how much y changes per unit change in x. Additionally, this form of representation makes computation efficient and faster.

From this we can obtain the m, c for our regression line by plugging in our dataset into it. So, that’s a wrap. Thank you for reading all along. Though you might not use this framework of deriving the math every time to implement an ML solution, it is good to understand the base from which the libraries we import work. This will help you in choosing the right algorithms and techniques for a given problem statement.

Stay tuned for the next post where we will spell out the math into python code and then reveal a smart and efficient way to perform linear regression with a few lines.

Thank you for reading all along, subscribe to sapiencespace and enable notifications to get regular insights.

Click here to view all the concepts related to machine learning.

What’s your Reaction?

Insightful

Helpful

Amazing

Clap

Hi-fi

One Response

Gunnar Park says:

October 5, 2023 at 9:12 pm

I really like reading through a post that can make men and women think. Also, thank you for allowing me to comment!

Reply

Recently Posted

Data Science & Programming

#1 Linear Regression Simplified

Logic behind Linear Regression

One Response

Leave a Reply Cancel reply

Recently Posted

Can page-based indexing save Compute, Memory and Time in RAG(Retrieval Augmented Generation)? A comparative study in medical field

Share

Subscribe To Newsletter

#1 Linear Regression Simplified

Logic behind Linear Regression

Share this post!

One Response

Leave a Reply Cancel reply

Recently Posted

Can page-based indexing save Compute, Memory and Time in RAG(Retrieval Augmented Generation)? A comparative study in medical field

Share

Subscribe To Newsletter

Home

Data Science & Programming

Book Summaries & Review

Personal Development