Imputation of missing values using Scikit-Learn


Data Preparation

The widely used dataset: California Housing Prices to note down two ways to handle missing values in dataset by using Scikit-Learn. One type of imputation algorithm is univariate, and the other is multivariate imputation.

The first step is to download the data from Kaggle and read it by using Pandas.

import pandas as pd

housing = pd.read_csv('housing.csv')
housing.head()

longitudelatitudehousing_median_agetotal_roomstotal_bedroomspopulationhouseholdsmedian_incomemedian_house_valueocean_proximity
0-122.2337.8841.0880.0129.0322.0126.08.3252452600.0NEAR BAY
1-122.2237.8621.07099.01106.02401.01138.08.3014358500.0NEAR BAY
2-122.2437.8552.01467.0190.0496.0177.07.2574352100.0NEAR BAY
3-122.2537.8552.01274.0235.0558.0219.05.6431341300.0NEAR BAY
4-122.2537.8552.01627.0280.0565.0259.03.8462342200.0NEAR BAY
housing.isnull().sum()
longitude               0
latitude                0
housing_median_age      0
total_rooms             0
total_bedrooms        207
population              0
households              0
median_income           0
median_house_value      0
ocean_proximity         0
dtype: int64

Univariate feature imputation

This method imputes values in the i-th feature dimension using only non-missing values in that feature dimension. Missing values can be imputed with a provided constant value, or using the statistics (mean, median or most frequent) of each column in which the missing values are located.

import numpy as np
from sklearn.impute import SimpleImputer
s_imp = SimpleImputer(missing_values=np.nan, strategy='median')
s_imp.fit(housing.drop("ocean_proximity", axis=1))
SimpleImputer(strategy='median')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
s_imp.statistics_
array([-1.1849e+02,  3.4260e+01,  2.9000e+01,  2.1270e+03,  4.3500e+02,
        1.1660e+03,  4.0900e+02,  3.5348e+00,  1.7970e+05])
housing.drop("ocean_proximity", axis=1).median().values
array([-1.1849e+02,  3.4260e+01,  2.9000e+01,  2.1270e+03,  4.3500e+02,
        1.1660e+03,  4.0900e+02,  3.5348e+00,  1.7970e+05])
# After transform by the imputer, the result is a plain Numpy array, therefore it needs to be put back into a Pandas DataFrame.
housing_s = pd.DataFrame(s_imp.transform(housing.drop("ocean_proximity", axis=1)), columns=housing.drop("ocean_proximity", axis=1).columns)
housing_s.isnull().sum()
longitude             0
latitude              0
housing_median_age    0
total_rooms           0
total_bedrooms        0
population            0
households            0
median_income         0
median_house_value    0
dtype: int64
housing_s['total_bedrooms'].median()
435.0

Multivariate feature imputation

As its names stands, multivariate imputation algorithms use the entire set of available feature dimensions to estimate the missing values. This algorithm models each feature with missing values as a function of other features, and uses that estimate for imputation.

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
m_imp = IterativeImputer(max_iter=10, random_state=0)
m_imp.fit(housing.drop("ocean_proximity", axis=1))
IterativeImputer(random_state=0)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
housing_m = pd.DataFrame(m_imp.transform(housing.drop("ocean_proximity", axis=1)), columns=housing.drop("ocean_proximity", axis=1).columns)
housing_m.isnull().sum()
longitude             0
latitude              0
housing_median_age    0
total_rooms           0
total_bedrooms        0
population            0
households            0
median_income         0
median_house_value    0
dtype: int64
housing_m['total_bedrooms'].median()
435.0
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(20,15))

s_heights, s_bins = np.histogram(housing_s['total_bedrooms'], bins=25)
m_heights, m_bins = np.histogram(housing_m['total_bedrooms'], bins=s_bins)

width = (s_bins[1] - m_bins[0])/3

ax.bar(s_bins[:-1], s_heights, width=width, facecolor='cornflowerblue')
ax.bar(m_bins[:-1]+width, m_heights, width=width, facecolor='seagreen')
<BarContainer object of 25 artists>

Single vs mutivariate imputation visualization

import seaborn as sns


df = pd.concat([housing_s['total_bedrooms'], housing_m['total_bedrooms']], axis=1, keys=['total_bedrooms_s', 'total_bedrooms_m'])
sns.set(rc={'figure.figsize':(20,15)})
sns.histplot(df.melt(), x='value', hue='variable', multiple='dodge', shrink=.75, bins=25);

Single vs mutivariate imputation visualization by Seaborn


Author: wenvenn
Reprint policy: All articles in this blog are used except for special statements CC BY 4.0 reprint policy. If reproduced, please indicate source wenvenn !