Imputation of missing values using Scikit-Learn

Data Preparation

The widely used dataset: California Housing Prices to note down two ways to handle missing values in dataset by using Scikit-Learn. One type of imputation algorithm is univariate, and the other is multivariate imputation.

The first step is to download the data from Kaggle and read it by using Pandas.

import pandas as pd

housing = pd.read_csv('housing.csv')

0-122.2337.8841.0880.0129.0322.0126.08.3252452600.0NEAR BAY
1-122.2237.8621.07099.01106.02401.01138.08.3014358500.0NEAR BAY
2-122.2437.8552.01467.0190.0496.0177.07.2574352100.0NEAR BAY
3-122.2537.8552.01274.0235.0558.0219.05.6431341300.0NEAR BAY
4-122.2537.8552.01627.0280.0565.0259.03.8462342200.0NEAR BAY
longitude               0
latitude                0
housing_median_age      0
total_rooms             0
total_bedrooms        207
population              0
households              0
median_income           0
median_house_value      0
ocean_proximity         0
dtype: int64

Univariate feature imputation

This method imputes values in the i-th feature dimension using only non-missing values in that feature dimension. Missing values can be imputed with a provided constant value, or using the statistics (mean, median or most frequent) of each column in which the missing values are located.

import numpy as np
from sklearn.impute import SimpleImputer
s_imp = SimpleImputer(missing_values=np.nan, strategy='median')"ocean_proximity", axis=1))
array([-1.1849e+02,  3.4260e+01,  2.9000e+01,  2.1270e+03,  4.3500e+02,
        1.1660e+03,  4.0900e+02,  3.5348e+00,  1.7970e+05])
housing.drop("ocean_proximity", axis=1).median().values
array([-1.1849e+02,  3.4260e+01,  2.9000e+01,  2.1270e+03,  4.3500e+02,
        1.1660e+03,  4.0900e+02,  3.5348e+00,  1.7970e+05])
# After transform by the imputer, the result is a plain Numpy array, therefore it needs to be put back into a Pandas DataFrame.
housing_s = pd.DataFrame(s_imp.transform(housing.drop("ocean_proximity", axis=1)), columns=housing.drop("ocean_proximity", axis=1).columns)
longitude             0
latitude              0
housing_median_age    0
total_rooms           0
total_bedrooms        0
population            0
households            0
median_income         0
median_house_value    0
dtype: int64

Multivariate feature imputation

As its names stands, multivariate imputation algorithms use the entire set of available feature dimensions to estimate the missing values. This algorithm models each feature with missing values as a function of other features, and uses that estimate for imputation.

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
m_imp = IterativeImputer(max_iter=10, random_state=0)"ocean_proximity", axis=1))
housing_m = pd.DataFrame(m_imp.transform(housing.drop("ocean_proximity", axis=1)), columns=housing.drop("ocean_proximity", axis=1).columns)
longitude             0
latitude              0
housing_median_age    0
total_rooms           0
total_bedrooms        0
population            0
households            0
median_income         0
median_house_value    0
dtype: int64
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(20,15))

s_heights, s_bins = np.histogram(housing_s['total_bedrooms'], bins=25)
m_heights, m_bins = np.histogram(housing_m['total_bedrooms'], bins=s_bins)

width = (s_bins[1] - m_bins[0])/3[:-1], s_heights, width=width, facecolor='cornflowerblue')[:-1]+width, m_heights, width=width, facecolor='seagreen')
<BarContainer object of 25 artists>

Single vs mutivariate imputation visualization

import seaborn as sns

df = pd.concat([housing_s['total_bedrooms'], housing_m['total_bedrooms']], axis=1, keys=['total_bedrooms_s', 'total_bedrooms_m'])
sns.histplot(df.melt(), x='value', hue='variable', multiple='dodge', shrink=.75, bins=25);

Single vs mutivariate imputation visualization by Seaborn

Author: wenvenn
Reprint policy: All articles in this blog are used except for special statements CC BY 4.0 reprint policy. If reproduced, please indicate source wenvenn !