Dealing with Categorical data

Categorical data !

Dealing with Categorical Data

One of the main time consuming part in any kind of ML applications is data preprocessing. Here we have categorical data. There are several methods to pre-process categorical data. I have used label encoder , a library available in scikit learn to pre-process it.

For more details refer-Page to sklearn

In [21]:
import numpy as np
import pandas as pd

Load Data

In [22]:
Traindata = pd.read_csv('data_1.csv',header = None)
In [24]:
rawdata = Traindata
Traindata
Out[24]:
0 1 2
0 abd 0 op3e
1 pqr 1 i93e
2 lmn 3 op3e
3 d3e9 4 klij
4 pqr 2 klij
5 d3e9 4 klij
6 abd 3 op3e
7 d3e9 4 i93e
8 abd 2 i93e
9 d3e9 1 i93e
10 pqr 0 op3e

Importing Library

In [25]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
In [26]:
Traindata =  np.array(Traindata)
a = Traindata
[m,n] = a.shape
b = a.reshape(m*n,1)
[r1,c1 ] = Traindata.shape

Note: LabelEncoder accepts only vector. i.e why b ( a vector) is passed to le.fit()

In [27]:
le.fit(b)
Out[27]:
LabelEncoder()

Encoding each column (because each entry of a particular column belong to one particular feature)

In [28]:
TRAINDATA = []
TRAINDATA = np.zeros((r1,c1))
for i in range (0,n):
    v1 = le.transform(Traindata[:,i])
    TRAINDATA[:,i] = v1 
   

Pre-processed data

In [29]:
TRAINDATA
Out[29]:
array([[  5.,   0.,  10.],
       [ 11.,   1.,   7.],
       [  9.,   3.,  10.],
       [  6.,   4.,   8.],
       [ 11.,   2.,   8.],
       [  6.,   4.,   8.],
       [  5.,   3.,  10.],
       [  6.,   4.,   7.],
       [  5.,   2.,   7.],
       [  6.,   1.,   7.],
       [ 11.,   0.,  10.]])

Comparing Raw data and Pre-processed data

In [33]:
#rawdata
rawdata
Out[33]:
0 1 2
0 abd 0 op3e
1 pqr 1 i93e
2 lmn 3 op3e
3 d3e9 4 klij
4 pqr 2 klij
5 d3e9 4 klij
6 abd 3 op3e
7 d3e9 4 i93e
8 abd 2 i93e
9 d3e9 1 i93e
10 pqr 0 op3e
In [32]:
# Preprocessed data
TRAINDATA
Out[32]:
array([[  5.,   0.,  10.],
       [ 11.,   1.,   7.],
       [  9.,   3.,  10.],
       [  6.,   4.,   8.],
       [ 11.,   2.,   8.],
       [  6.,   4.,   8.],
       [  5.,   3.,  10.],
       [  6.,   4.,   7.],
       [  5.,   2.,   7.],
       [  6.,   1.,   7.],
       [ 11.,   0.,  10.]])

Comments