
Reading Time: 5 Minutes
Since we are exploring core level concepts like ridge regression at sapiencespace, I felt it is important for us to understand how linear regression applies for n-dimensional space which is the case for real world uses. So to start lets understand 3D linear regression or regression using 2 independent variables to predict a target variable using simple OLS. Keep reading to get a bonus section at end !!!
Here our goal is to predict a plane that fits the given data, thereby effectively predict a dependent variable given a few independent variables.


The mathematical derivation is something very similar to the first ever post on this blog on linear regression. Here we partially derivate the cost function w.r.t to 3 least square estimates -> m1, m2 and c.
Since this problem is projected into a 3-D space, lets visualize and understand what plane we want to achieve it with the help of a graph->
y = m1*x + m2*z + c

graph generated in desmos.com/3d

Source: wikipedia
Here’s a small video clip to clear the confusion and explain how the plane changes with modifications in the parameters(least square estimators).
So now that we understand how the change in parameters can impact the position of the plane we want to fit, now we can proceed with finding the correct plane that fits over data spread in the 3D space with the help of math.

Logic behind OLS for 3-D space
Similar to our previous posts, we will perform the same operation of differentiation to find the optimal value for the least square estimates/ parameters (the values we want to find based on the training data to get a generalized equation that can be used for predictions on unseen data).
The derivative (gradient) tells us how much the loss function changes when we slightly change the least square estimate/ parameter. Thumb rule:
- If the gradient is positive, increasing the parameter increases the loss → we need to decrease parameter.
- If the gradient is negative, increasing parameter decreases the loss → we need to increase parameter.
The minimum of the loss function occurs when the gradient is zero – because at that time, the difference between actual and predicted value will be zero. And this is something we want to achieve – Make predictions that are accurate.
In real world applications, we refer the independent variables in x and z axis as features. So, I have kept x as x1 and z as x2. You don’t need to derive every equation in a real world application, as you have libraries and api’s that does the task. But you can follow this post to understand the logic behind OLS.

Now that our cost function is set, yi is the actual dependent value in training data and the thing in the bracket is the predicted value (y_hat). Please refer to the bottom of the post to check how the simplifications as done.
Partially Differentiating w.r.t c

Partially Differentiating w.r.t m1

Partially Differentiating w.r.t m2

Now that we have clubbed the equations and made it simpler, lets define some variables to make finding the final equation simple.

Sample Dataset used->
X1 = np.linspace(0, 10, 100) # Feature 1 - X
X2 = X1 + np.random.normal(0, 0.5, size=len(X1)) # Feature 2 - Z
y = 4 * X1 + 2 * X2 + np.random.normal(0, 5, size=len(X1))
df = pd.DataFrame({'x1': X1, 'x2': X2, 'y': y})
ols_df = df
x1_mean = np.mean(ols_df["x1"])
x2_mean = np.mean(ols_df["x2"])
y_mean = np.mean(ols_df["y"])
ols_df["x1-x1_mean"] = ols_df["x1"] - x1_mean
ols_df["x2-x2_mean"] = ols_df["x2"] - x2_mean
ols_df["y-y_mean"] = ols_df["y"] - y_mean
Sa = sum(ols_df["x1-x1_mean"]**2)
Sb = sum(ols_df["x2-x2_mean"]**2)
Sc = sum(ols_df["x1-x1_mean"]*ols_df["x2-x2_mean"])
Sd = sum(ols_df["x1-x1_mean"]*ols_df["y-y_mean"])
Se = sum(ols_df["x2-x2_mean"]*ols_df["y-y_mean"])



Now, we can plug in our transformed data into these equations to finally find the perfect line.
m1 = (Sd*Sb - Sc*Se) / (Sa*Sb - Sc**2)
m2 = (Sc*Sd - Sa*Se) / (Sc**2 - Sa*Sb)
c = y_mean - m1 * np.mean(ols_df['x1']) - m2 * np.mean(ols_df['x2'])
print(m1, m2, c)
5.458322306778714 0.561682391142402 -0.06318269164918355
Visualization of the predicted plane on the data->


But how accurate is the prediction
R-squared metric to check the accuracy-
R-squared is also known as coefficient of determination, it is used to evaluate the accuracy of Ordinary Least Squares (OLS) regression models.
R² measures how much of the total variance is actually explained by your model. We cannot only rely on the variance or sum of squares of residuals as we need a bound metric. So we divide the sum of squares of residuals (SSres) by the SStotal measures the variance that the model fails to explain (the leftover error), then we subtract that from 1 to get the correct predictions in terms of percentage.
SSres / SStotal is basically the proportion of total variance the model failed to explain (the error), so when we subtract 1 from it we get the proportion the model did explain.

R-squared values getting closer to 1 is yields high accuracy, if it is less than 0 than the model is worse than just using the mean for predicting for any given input value.
I got a 0.93 score for this prediction, which is a great fit given a sparse data.
Bonus Section
What if I say you can implement all of this math and code using a few simple lines instead like above, YES !!! statsmodels.api library does the same, here’s the code
import statsmodels.api as sm
X = sm.add_constant(df[['x1', 'x2']]) # put all your features/independent features here
y = df['y'] # target/dependent variable
model = sm.OLS(y, X).fit()
print(model.summary())

Here it is clear that we have obtained the same results as our previous implementation above. We will look into the other metrics in our upcoming post to maintain the reading time of this post.
The coefficients can be accessed in this manner ->
c_new = model.params['const']
m1_new = model.params['x1']
m2_new = model.params['x2']
Math simplifications for reference


That’s a wrap, thank you reading all along. I highly recommend you to go through the derivations a few times, as it will help you to develop a firm undersatnding of how things works behind the libraries like statsmodels. To summarize, the process is:
- Find a business problem with some correlated features to make predictions.
- Define an appropriate cost function (check out my post on art of training neural networks on sapiencespace to know about cost functions).
- Perform partial differentiation w.r.t each least square estimates/ or go for a direct library such as statsmodels. The earlier approach will be very helpful when you are dealing with deep learning problems and handling libraries like pytorch and tensorflow.
- Plug in the processed data (if normalization is required) and get the results for predictions !!!
Click here for more such interesting AI concepts.
cover and title image credits: unsplash content creators