Analysis with text and categorical attributes or labels

Publish Date: 2024-01-31

Update Date: 2024-02-03

Word Count: 595

Read Times: 3 Min

Read Count:

Data Preparation

The widely used dataset: California Housing Prices to note down several ways to handle text and categorical attributes or labels in the dataset, due to many machine learning algorithms prefers to handle numerical data.

The first step is to download the data from Kaggle and read it by using Pandas.

import pandas as pd

housing = pd.read_csv('housing.csv')
housing.head()

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value	ocean_proximity
0	-122.23	37.88	41.0	880.0	129.0	322.0	126.0	8.3252	452600.0	NEAR BAY
1	-122.22	37.86	21.0	7099.0	1106.0	2401.0	1138.0	8.3014	358500.0	NEAR BAY
2	-122.24	37.85	52.0	1467.0	190.0	496.0	177.0	7.2574	352100.0	NEAR BAY
3	-122.25	37.85	52.0	1274.0	235.0	558.0	219.0	5.6431	341300.0	NEAR BAY
4	-122.25	37.85	52.0	1627.0	280.0	565.0	259.0	3.8462	342200.0	NEAR BAY

housing.dtypes

longitude             float64
latitude              float64
housing_median_age    float64
total_rooms           float64
total_bedrooms        float64
population            float64
households            float64
median_income         float64
median_house_value    float64
ocean_proximity        object
dtype: object

Encode categorical attributes

OrdinalEncoder can be sued to covert categorical features to integers (0 to n_categories - 1)

from sklearn.preprocessing import OrdinalEncoder

o_encoder = OrdinalEncoder()
housing_o_encoded = o_encoder.fit_transform(housing[['ocean_proximity']])

import numpy as np
np.unique(housing_o_encoded, return_counts=True)

(array([0., 1., 2., 3., 4.]),
 array([9136, 6551,    5, 2290, 2658], dtype=int64))

np.unique(housing[['ocean_proximity']], return_counts=True)

(array(['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'],
       dtype=object),
 array([9136, 6551,    5, 2290, 2658], dtype=int64))

One issue with OrdinalEncoder is that it makes machine learning algorithms assume two nearby values are more similar than two distant values, this is not the case for ‘ocean_proximity’ in the dataset (for example, categories 0 and 4 are clearly more similar than categories 0 and 1). To handle this issue, OneHotEncoder is used to encode only one attribute to be equal to 1 (hot), while the others will be 0 (cold).

from sklearn.preprocessing import OneHotEncoder

oh_encoder = OneHotEncoder()
housing_oh_encoded = oh_encoder.fit_transform(housing[['ocean_proximity']])

housing_oh_encoded.toarray().shape

(20640, 5)

housing[['ocean_proximity']].shape

(20640, 1)

np.unique(housing_oh_encoded.toarray(), return_counts=True, axis=0)

(array([[0., 0., 0., 0., 1.],
        [0., 0., 0., 1., 0.],
        [0., 0., 1., 0., 0.],
        [0., 1., 0., 0., 0.],
        [1., 0., 0., 0., 0.]]),
 array([2658, 2290,    5, 6551, 9136], dtype=int64))

OneHotEncoder also has limitations when handling categorical attributes especially when a categorical attribute has a large number of possible categories (e.g., post code, profession, species, etc.), in this case, OneHotEncoder could introduce a large number of input features, which could slow down training and degrade performance. In this case it is probably better to replace the categorical attributes with useful numerical features related to the categories: for example, the ‘ocean_proximity’ could be replaces with the distance to the ocean.

Encode categorical labels

LabelEncoder could be used to encode target labels with value between 0 and n_classes-1. It works more or less the same as OrdinalEncoder.

from sklearn.preprocessing import LabelEncoder
import warnings
warnings.filterwarnings('ignore')

l_encoder = LabelEncoder()
housing_l_encoded = l_encoder.fit_transform(housing[['ocean_proximity']])

np.unique(housing_l_encoded, return_counts=True)

(array([0, 1, 2, 3, 4]), array([9136, 6551,    5, 2290, 2658], dtype=int64))

Reference:

Aurelien Geron, Hands-On Machine Learning with Scikit-Learn & TensorFlow

wenvenn

https://www.wenvenn.com/20240131/analysis-with-text-and-categorical-attributes-or-labels/

All articles in this blog are used except for special statements CC BY 4.0 reprint policy. If reproduced, please indicate source wenvenn !

numpy pandas sklearn

Confusion matrix and its metrics

2024-02-02 Machine Learning

numpy sklearn matplotlib

Feature scaling using Scikit-Learn

2024-01-29 Machine Learning

pandas seaborn sklearn matplotlib

Analysis with text and categorical attributes or labels

Data Preparation

Encode categorical attributes

Encode categorical labels

Reference:

你的赏识是我前进的动力