Source: View original notebook on GitHub
Category: Machine Learning / Learn ML
PreProcessing
Missing values on numerical data
from sklearn.impute import SimpleImputer
import numpy as np
import pandas as pd
imp=SimpleImputer()
X_train = [[np.nan, 1, 2], [3, np.nan, 4], [5, np.nan, 6]]
X_test = [[np.nan, 10, 10], [120, np.nan, 600], [10, np.nan, 30]]
X_train= imp.fit_transform(X_train)
X_test=imp.transform(X_test)
print(X_train)
print(X_test)
Output:
[[4. 1. 2.]
[3. 1. 4.]
[5. 1. 6.]]
[[ 4. 10. 10.]
[120. 1. 600.]
[ 10. 1. 30.]]
Encoding on categorical data
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
le.fit_transform(["paris", "paris", "tokyo", "amsterdam"])
Output:
array([1, 1, 2, 0], dtype=int32)
There is a problem here. Here the machine learning model understands that tokyo has higher value than paris and then amsterdam. That is not the case. These are not ordered categories and we cannot compare them. This can be done on sizes like small , medium and large
We should prevent machine learning model to think that tokyo is greater than paris and amsterdam. For this we gonna use dummy variables. This can be done in two ways
using get_dummies from pandas
dummy=pd.get_dummies(["paris", "paris", "tokyo", "amsterdam"])
dummy
Output:
amsterdam paris tokyo
0 0 1 0
1 0 1 0
2 0 0 1
3 1 0 0
using OneHotEncoder from sklearn
from sklearn.preprocessing import OneHotEncoder
ohe=OneHotEncoder()
cat=[["paris", "paris", "tokyo", "amsterdam"]]
cate=ohe.fit_transform(cat).toarray()
print(cate)
Output:
[[1. 1. 1. 1.]]
Splitting the data into train and test
from sklearn.model_selection import train_test_split
X, y = np.arange(10).reshape((5, 2)), range(5)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
Feature Scaling
standard scaling is must since it helps to compute faster by many ml models as many models uses euclidean distance
x train and x test should be scaled on same basis, hence we use fit transform in train and transform in test
The reason is that we want to pretend that the test data is “new, unseen data.” We use the test dataset to get a good estimate of how our model performs on any new data.
Now, in a real application, the new, unseen data could be just 1 data point that we want to classify. (How do we estimate mean and standard deviation if we have only 1 data point?) That’s an intuitive case to show why we need to keep and use the training data parameters for scaling the test set.
from sklearn.preprocessing import StandardScaler
X_train = [[0, 0], [0, 0], [2, 10], [91, 199]]
X_test=[[187,190], [91, 19]]
scaler = StandardScaler()
X_t= scaler.fit_transform(X_train)
X_tes=scaler.transform(X_test)
print(X_t)
print(X_tes)
Output:
[[-0.59426437 -0.61597805]
[-0.59426437 -0.61597805]
[-0.54314485 -0.49808751]
[ 1.73167358 1.73004362]]
[[ 4.18541032 1.62394213]
[ 1.73167358 -0.39198603]]
