Exams Correlations in the Dataset


Data Preparation

We are going to use a widely used dataset: California Housing Prices to note down two common ways to illustrate the correlations between every pair of attributes in the dataset.

The first step is to download the data from Kaggle and read it by using Pandas.

import pandas as pd

housing = pd.read_csv('housing.csv')
housing.head()

longitudelatitudehousing_median_agetotal_roomstotal_bedroomspopulationhouseholdsmedian_incomemedian_house_valueocean_proximity
0-122.2337.8841.0880.0129.0322.0126.08.3252452600.0NEAR BAY
1-122.2237.8621.07099.01106.02401.01138.08.3014358500.0NEAR BAY
2-122.2437.8552.01467.0190.0496.0177.07.2574352100.0NEAR BAY
3-122.2537.8552.01274.0235.0558.0219.05.6431341300.0NEAR BAY
4-122.2537.8552.01627.0280.0565.0259.03.8462342200.0NEAR BAY
housing.plot(kind='scatter', x='longitude',y='latitude',alpha=0.1)
<Axes: xlabel='longitude', ylabel='latitude'>

high density area

%matplotlib inline
import matplotlib.pyplot as plt
housing.plot(kind='scatter', x='longitude',y='latitude',alpha=0.4,
            s=housing['population']/100,label='Population',
            c='median_house_value',cmap=plt.get_cmap('jet'),colorbar=True)
plt.legend()
<matplotlib.legend.Legend at 0x26bee165690>

population vs location density

Method 1: Standard Correlation Coefficient

If the dataset is not too large, it is easy to calculate the standard correlation coefficient by using the corr() method.

The correlation coefficient ranges from -1 to 1, when it is close to -1, it means that there is a strong negative correlation, while when the coefficient is close to 1, it means that there is a strong positive correlation. Finally, coefficients close to zero mean that there is no linear correlation.

The correlation coefficient only measures linear correlations (“if x goes up, then y generally goes up or down”). It may completely miss out on nonlinear relationships.

import warnings
warnings.filterwarnings('ignore')

corr_matrix = housing.corr()
corr_matrix['median_house_value'].sort_values(ascending=False)
median_house_value    1.000000
median_income         0.688075
total_rooms           0.134153
housing_median_age    0.105623
households            0.065843
total_bedrooms        0.049686
population           -0.024650
longitude            -0.045967
latitude             -0.144160
Name: median_house_value, dtype: float64

Method 2: Pandas scatter_matrix Function

Pandas’ scatter_matrix function plots every numerical attribute against every other numerical attribute, this could help to find out the correlation bettween each attribute visually.

from pandas.plotting import scatter_matrix

attributes = ['median_house_value', 'median_income', 'total_rooms', 'housing_median_age']
scatter_matrix(housing[attributes], figsize=(12,8))
array([[<Axes: xlabel='median_house_value', ylabel='median_house_value'>,
        <Axes: xlabel='median_income', ylabel='median_house_value'>,
        <Axes: xlabel='total_rooms', ylabel='median_house_value'>,
        <Axes: xlabel='housing_median_age', ylabel='median_house_value'>],
       [<Axes: xlabel='median_house_value', ylabel='median_income'>,
        <Axes: xlabel='median_income', ylabel='median_income'>,
        <Axes: xlabel='total_rooms', ylabel='median_income'>,
        <Axes: xlabel='housing_median_age', ylabel='median_income'>],
       [<Axes: xlabel='median_house_value', ylabel='total_rooms'>,
        <Axes: xlabel='median_income', ylabel='total_rooms'>,
        <Axes: xlabel='total_rooms', ylabel='total_rooms'>,
        <Axes: xlabel='housing_median_age', ylabel='total_rooms'>],
       [<Axes: xlabel='median_house_value', ylabel='housing_median_age'>,
        <Axes: xlabel='median_income', ylabel='housing_median_age'>,
        <Axes: xlabel='total_rooms', ylabel='housing_median_age'>,
        <Axes: xlabel='housing_median_age', ylabel='housing_median_age'>]],
      dtype=object)

scatter matrix

Reference:

Aurelien Geron, Hands-On Machine Learning with Scikit-Learn & TensorFlow


Author: wenvenn
Reprint policy: All articles in this blog are used except for special statements CC BY 4.0 reprint policy. If reproduced, please indicate source wenvenn !