Analysis with text and categorical attributes or labels


Data Preparation

The widely used dataset: California Housing Prices to note down several ways to handle text and categorical attributes or labels in the dataset, due to many machine learning algorithms prefers to handle numerical data.

The first step is to download the data from Kaggle and read it by using Pandas.

import pandas as pd

housing = pd.read_csv('housing.csv')
housing.head()

longitudelatitudehousing_median_agetotal_roomstotal_bedroomspopulationhouseholdsmedian_incomemedian_house_valueocean_proximity
0-122.2337.8841.0880.0129.0322.0126.08.3252452600.0NEAR BAY
1-122.2237.8621.07099.01106.02401.01138.08.3014358500.0NEAR BAY
2-122.2437.8552.01467.0190.0496.0177.07.2574352100.0NEAR BAY
3-122.2537.8552.01274.0235.0558.0219.05.6431341300.0NEAR BAY
4-122.2537.8552.01627.0280.0565.0259.03.8462342200.0NEAR BAY
housing.dtypes
longitude             float64
latitude              float64
housing_median_age    float64
total_rooms           float64
total_bedrooms        float64
population            float64
households            float64
median_income         float64
median_house_value    float64
ocean_proximity        object
dtype: object

Encode categorical attributes

OrdinalEncoder can be sued to covert categorical features to integers (0 to n_categories - 1)

from sklearn.preprocessing import OrdinalEncoder

o_encoder = OrdinalEncoder()
housing_o_encoded = o_encoder.fit_transform(housing[['ocean_proximity']])
import numpy as np
np.unique(housing_o_encoded, return_counts=True)
(array([0., 1., 2., 3., 4.]),
 array([9136, 6551,    5, 2290, 2658], dtype=int64))
np.unique(housing[['ocean_proximity']], return_counts=True)
(array(['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'],
       dtype=object),
 array([9136, 6551,    5, 2290, 2658], dtype=int64))

One issue with OrdinalEncoder is that it makes machine learning algorithms assume two nearby values are more similar than two distant values, this is not the case for ‘ocean_proximity’ in the dataset (for example, categories 0 and 4 are clearly more similar than categories 0 and 1). To handle this issue, OneHotEncoder is used to encode only one attribute to be equal to 1 (hot), while the others will be 0 (cold).

from sklearn.preprocessing import OneHotEncoder

oh_encoder = OneHotEncoder()
housing_oh_encoded = oh_encoder.fit_transform(housing[['ocean_proximity']])
housing_oh_encoded.toarray().shape
(20640, 5)
housing[['ocean_proximity']].shape
(20640, 1)
np.unique(housing_oh_encoded.toarray(), return_counts=True, axis=0)
(array([[0., 0., 0., 0., 1.],
        [0., 0., 0., 1., 0.],
        [0., 0., 1., 0., 0.],
        [0., 1., 0., 0., 0.],
        [1., 0., 0., 0., 0.]]),
 array([2658, 2290,    5, 6551, 9136], dtype=int64))

OneHotEncoder also has limitations when handling categorical attributes especially when a categorical attribute has a large number of possible categories (e.g., post code, profession, species, etc.), in this case, OneHotEncoder could introduce a large number of input features, which could slow down training and degrade performance. In this case it is probably better to replace the categorical attributes with useful numerical features related to the categories: for example, the ‘ocean_proximity’ could be replaces with the distance to the ocean.

Encode categorical labels

LabelEncoder could be used to encode target labels with value between 0 and n_classes-1. It works more or less the same as OrdinalEncoder.

from sklearn.preprocessing import LabelEncoder
import warnings
warnings.filterwarnings('ignore')

l_encoder = LabelEncoder()
housing_l_encoded = l_encoder.fit_transform(housing[['ocean_proximity']])
np.unique(housing_l_encoded, return_counts=True)
(array([0, 1, 2, 3, 4]), array([9136, 6551,    5, 2290, 2658], dtype=int64))

Reference:

Aurelien Geron, Hands-On Machine Learning with Scikit-Learn & TensorFlow


Author: wenvenn
Reprint policy: All articles in this blog are used except for special statements CC BY 4.0 reprint policy. If reproduced, please indicate source wenvenn !