Data Preparation
The widely used dataset: California Housing Prices to note down several ways to handle text and categorical attributes or labels in the dataset, due to many machine learning algorithms prefers to handle numerical data.
The first step is to download the data from Kaggle and read it by using Pandas.
import pandas as pd
housing = pd.read_csv('housing.csv')
housing.head()
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | |
---|---|---|---|---|---|---|---|---|---|---|
0 | -122.23 | 37.88 | 41.0 | 880.0 | 129.0 | 322.0 | 126.0 | 8.3252 | 452600.0 | NEAR BAY |
1 | -122.22 | 37.86 | 21.0 | 7099.0 | 1106.0 | 2401.0 | 1138.0 | 8.3014 | 358500.0 | NEAR BAY |
2 | -122.24 | 37.85 | 52.0 | 1467.0 | 190.0 | 496.0 | 177.0 | 7.2574 | 352100.0 | NEAR BAY |
3 | -122.25 | 37.85 | 52.0 | 1274.0 | 235.0 | 558.0 | 219.0 | 5.6431 | 341300.0 | NEAR BAY |
4 | -122.25 | 37.85 | 52.0 | 1627.0 | 280.0 | 565.0 | 259.0 | 3.8462 | 342200.0 | NEAR BAY |
housing.dtypes
longitude float64
latitude float64
housing_median_age float64
total_rooms float64
total_bedrooms float64
population float64
households float64
median_income float64
median_house_value float64
ocean_proximity object
dtype: object
Encode categorical attributes
OrdinalEncoder can be sued to covert categorical features to integers (0 to n_categories - 1)
from sklearn.preprocessing import OrdinalEncoder
o_encoder = OrdinalEncoder()
housing_o_encoded = o_encoder.fit_transform(housing[['ocean_proximity']])
import numpy as np
np.unique(housing_o_encoded, return_counts=True)
(array([0., 1., 2., 3., 4.]),
array([9136, 6551, 5, 2290, 2658], dtype=int64))
np.unique(housing[['ocean_proximity']], return_counts=True)
(array(['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'],
dtype=object),
array([9136, 6551, 5, 2290, 2658], dtype=int64))
One issue with OrdinalEncoder is that it makes machine learning algorithms assume two nearby values are more similar than two distant values, this is not the case for ‘ocean_proximity’ in the dataset (for example, categories 0 and 4 are clearly more similar than categories 0 and 1). To handle this issue, OneHotEncoder is used to encode only one attribute to be equal to 1 (hot), while the others will be 0 (cold).
from sklearn.preprocessing import OneHotEncoder
oh_encoder = OneHotEncoder()
housing_oh_encoded = oh_encoder.fit_transform(housing[['ocean_proximity']])
housing_oh_encoded.toarray().shape
(20640, 5)
housing[['ocean_proximity']].shape
(20640, 1)
np.unique(housing_oh_encoded.toarray(), return_counts=True, axis=0)
(array([[0., 0., 0., 0., 1.],
[0., 0., 0., 1., 0.],
[0., 0., 1., 0., 0.],
[0., 1., 0., 0., 0.],
[1., 0., 0., 0., 0.]]),
array([2658, 2290, 5, 6551, 9136], dtype=int64))
OneHotEncoder also has limitations when handling categorical attributes especially when a categorical attribute has a large number of possible categories (e.g., post code, profession, species, etc.), in this case, OneHotEncoder could introduce a large number of input features, which could slow down training and degrade performance. In this case it is probably better to replace the categorical attributes with useful numerical features related to the categories: for example, the ‘ocean_proximity’ could be replaces with the distance to the ocean.
Encode categorical labels
LabelEncoder could be used to encode target labels with value between 0 and n_classes-1. It works more or less the same as OrdinalEncoder.
from sklearn.preprocessing import LabelEncoder
import warnings
warnings.filterwarnings('ignore')
l_encoder = LabelEncoder()
housing_l_encoded = l_encoder.fit_transform(housing[['ocean_proximity']])
np.unique(housing_l_encoded, return_counts=True)
(array([0, 1, 2, 3, 4]), array([9136, 6551, 5, 2290, 2658], dtype=int64))
Reference:
Aurelien Geron, Hands-On Machine Learning with Scikit-Learn & TensorFlow