2 Simple Ways to Identify Relationships between your Data and Visualising them with Heatmap

Reading Time: 8 Minutes

When dealing a huge datasets, or any data with multiple attributes where the data is continuous, you we need to make EDA(exploratory data analysis) or perform predictive analysis or feed your data into a machine learning model. It becomes essential at a point to understand the relationships between the data attributes, and Heatmaps are an amazing and simple data visualization tool to achieve them.

Let’s say we have a dataset consisting of a few attributes that is related to weather data, such as maximum temperature on a given day, global radiation, precipitation, etc. In order to understand the relationship between each attribute with one another, we have technique in data visualization called the heat-map to effectively visualise the same.

This is how the visualisation will look like, when you render it→

I know its a lot to take in, the numbers and colours did scare me when I first used heat-maps. We’ll discuss about the colours later, so for this moment forget about the colours, the bar in the left and the attributes placed over the left and bottom of the map.

Let’s direct our focus towards the numbers which are inside the square, by observing the grid for a few seconds, you can easily spot that all the numbers are lying the range of -1 to 1, to understand what they actually mean, we need to delve into a very fundamental concept in statistics which is COVARIANCE and CORRELATION.

Correlation and Covariance are fundamental tools in the field of maths used to find out the relationship between data. Covariance represents the measure of joint variability between two random variables. The formula for covariance stems from the concept of expected values and variations from those expectations.

It indicates whether two variables tend to move together (positive covariance), move in opposite directions (negative covariance), or have no discrete pattern (zero covariance). Here x and y with a bar over them represents their mean value.

Correlation on the other hand is a measure that is very similar to covariance but here the values that covariance yields are effectively transformed and displayed in the range of -1 to 1. So it becomes easy for statisticians to estimate the strength of relationship along with the information that indicates whether they are directly proportional or inversely proportional.

If the sign is negative, then the variables are inversely proportional with one another, which means that when one variable goes up the other comes down and vice-versa. If the sign is positive, then they are directly proportional, if one goes up the other variable also goes up with it proportional and vice-versa.

The number which is get out of the correlation function is known as the correlation coefficient.

Now the image which you saw will make more sense, sunshine and global_radiation has a correlation coefficient of 0.85, which means that both are strongly related in the same direction, which actually makes sense from a general perspective.

So to boil it down, correlation is just a function that effectively takes in covariance of two variables and scales it between the range of -1 to 1 by dividing the input value (covariance of x and y) by the product of their (x and y) standard deviation.

For those who need a quick intuition around standard deviation, it is basically a tool that outputs the how much the data is spread from its mean. A larger standard deviation indicates greater variability or dispersion of data points from the mean, while a smaller standard deviation suggests that the data points are closer to the mean. It’s a widely used measure in statistics to understand the distribution and variability of a dataset.

Now lets take an example of two random variables and calculate their correlation coefficient.

import pandas as pd
import numpy as np

x = [52,54,56,52,54,60,59,56]
y = [67,68,65,68,63,62,58,61]
df = pd.DataFrame(list(zip(x,y)))
df.columns = ['x','y']
df

–	x	y
0	52	67
1	54	68
2	56	65
3	52	68
4	54	63
5	60	62
6	59	58
7	56	61

x_bar = sum(x)/len(x)
y_bar = sum(y)/len(y)
df['x-mean_x'] = df['x'] - mean_x
df['y-mean_y'] = df['y'] - mean_y
df

–	x	y	x-mean_x	y-mean_y
0	52	67	-3.375	3.0
1	54	68	-1.375	4.0
2	56	65	0.625	1.0
3	52	68	-3.375	4.0
4	54	63	-1.375	-1.0
5	60	62	4.625	-2.0
6	59	58	3.625	-6.0
7	56	61	0.625	-3.0

df['(x-mean_x)*(y-mean_y)'] = df['x-mean_x']*df['y-mean_y']
df

–	x	y	x-mean_x	y-mean_y	(x-mean_x)*(y-mean_y)
0	52	67	-3.375	3.0	-10.125
1	54	68	-1.375	4.0	-5.500
2	56	65	0.625	1.0	0.625
3	52	68	-3.375	4.0	-13.500
4	54	63	-1.375	-1.0	1.375
5	60	62	4.625	-2.0	-9.250
6	59	58	3.625	-6.0	-21.750
7	56	61	0.625	-3.0	-1.875

cov = sum(df['(x-mean_x)*(y-mean_y)'])/len(x)
cov

-7.5

This -7.5 indicates that x and y are inversely proportional, but we can’t extract much of information from the number rather than the sign, so we find the correlation value.

sd_x = np.std(x)
sd_y = np.std(y)
denominator = sd_x * sd_y
corr = cov/denominator
corr

-0.795242772487544

From this output we come to know that they have quite a strong inverse relation.

But when we are dealing with a dataset that has multiple attributes and loads of data, we can’t manually form dataframes and do all the calculations, so we have a simple function which will do all the job and give you the correlation coefficients of each relation in a dataframe which can be interpreted as matrix.

This function is from pandas module is known as the .corr() function, the syntax is dataframe_name.corr().

Once we have the correlation matrix, we can view the values by printing the dataframe. But to make it more visually appealing and easy to interpret, we use the HEATMAP to enhance the matrix with transition colours for easy identification of extreme correlations.

Now let’s explore the python program to generate a Heatmap

The Program

1.Import the necessary modules.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

2. Read the csv file into a data frame and extract only the necessary features.

The dataset is available in kaggle, you can go to the link by clicking here

data = pd.read_csv("london_weather.csv")
data.head()

–	date	cloud_cover	sunshine	global_radiation	max_temp	mean_temp	min_temp	precipitation	pressure	snow_depth
0	19790101	2.0	7.0	52.0	2.3	-4.1	-7.5	0.4	101900.0	9.0
1	19790102	6.0	1.7	27.0	1.6	-2.6	-7.5	0.0	102530.0	8.0
2	19790103	5.0	0.0	13.0	1.3	-2.8	-7.2	0.0	102050.0	4.0
3	19790104	8.0	0.0	13.0	-0.3	-2.6	-6.5	0.0	100840.0	2.0
4	19790105	6.0	2.0	29.0	5.6	-0.8	-1.4	0.0	102250.0	1.0

data.info()

RangeIndex: 15341 entries, 0 to 15340
Data columns (total 10 columns):
# Column Non-Null Count Dtype
— —— ————– —–
0 date 15341 non-null int64
1 cloud_cover 15322 non-null float64
2 sunshine 15341 non-null float64
3 global_radiation 15322 non-null float64
4 max_temp 15335 non-null float64
5 mean_temp 15305 non-null float64
6 min_temp 15339 non-null float64
7 precipitation 15335 non-null float64
8 pressure 15337 non-null float64
9 snow_depth 13900 non-null float64
dtypes: float64(9), int64(1)
memory usage: 1.2 MB

3. We don’t require date for this analysis, so we remove it.

new = data.drop(['date'], axis=1)

4. Now we generate the correlation matrix.

new.corr()

–	cloud_cover	sunshine	global_radiation	max_temp	mean_temp	min_temp	precipitation	pressure	snow_depth
cloud_cover	1.000000	-0.738291	-0.485973	-0.212224	-0.110556	0.048838	0.235269	-0.241955	-0.001256
sunshine	-0.738291	1.000000	0.852632	0.472182	0.396535	0.219082	-0.231636	0.226943	-0.034222
global_radiation	-0.485973	0.852632	1.000000	0.690946	0.635432	0.478119	-0.162668	0.150078	-0.061781
max_temp	-0.212224	0.472182	0.690946	1.000000	0.912200	0.810514	-0.071799	0.100455	-0.130594
mean_temp	-0.110556	0.396535	0.635432	0.912200	1.000000	0.955593	-0.010462	0.004764	-0.154945
min_temp	0.048838	0.219082	0.478119	0.810514	0.955593	1.000000	0.037233	-0.074274	-0.157882
precipitation	0.235269	-0.231636	-0.162668	-0.071799	-0.010462	0.037233	1.000000	-0.349456	-0.001352
pressure	-0.241955	0.226943	0.150078	0.100455	0.004764	-0.074274	-0.349456	1.000000	-0.021229
snow_depth	-0.001256	-0.034222	-0.061781	-0.130594	-0.154945	-0.157882	-0.001352	-0.021229	1.000000

4. Generation of heatmap using seaborn library:

sns.heatmap(new.corr(), annot=True, cmap='Wistia') # setting annot to True will how the correlation coefficients

So that is how you generate a heatmap and get the best interpretation of relationship between the attributes out of correlation coefficient.

That’s a wrap, thank you for reading all along, your curiosity and engagement are highly valued, post your comments or doubts in the comment section below. Subscribe to sapiencespace and enable notifications to get regular insights.

Click here to view similar insights.

What’s your Reaction?