Data Preprocessing in Python

Download Report

Transcript Data Preprocessing in Python

1
Data Preprocessing in
Python
Ahmedul Kabir
TA, CS 548, Spring 2015
2
Preprocessing Techniques Covered
Standardization and Normalization
Missing value replacement
Resampling
Discretization
Feature Selection
Dimensionality Reduction: PCA
3
Python Packages/Tools for Data Mining
Scikit-learn
Orange
Pandas
MLPy
MDP
PyBrain … and many more
4
Some Other Basic Packages
 NumPy and SciPy
 Fundamental Packages for scientific computing with Python
 Contains powerful n-dimensional array objects
 Useful linear algebra, random number and other capabilities
 Pandas
 Contains useful data structures and algorithms
 Matplotlib
 Contains functions for plotting/visualizing data.
5
Standardization and Normalization
 Standardization: To transform data so that it has zero mean and unit variance.
Also called scaling
 Use function sklearn.preprocessing.scale()
 Parameters:
 X: Data to be scaled
 with_mean: Boolean. Whether to center the data (make zero mean)
 with_std: Boolean (whether to make unit standard deviation
 Normalization: to transform data so that it is scaled to the [0,1] range.
 Use function sklearn.preprocessing.normalize()
 Parameters:
 X: Data to be normalized
 norm: which norm to use: l1 or l2
 axis: whether to normalize by row or column
6
Example code of
Standardization/Scaling
>>> from sklearn import preprocessing
>>> import numpy as np
>>> X = np.array([[ 1., -1., 2.],
...
[ 2., 0., 0.],
...
[ 0., 1., -1.]])
>>> X_scaled = preprocessing.scale(X)
>>> X_scaled
array([[ 0. ..., -1.22..., 1.33...],
[ 1.22..., 0. ..., -0.26...],
[-1.22..., 1.22..., -1.06...]])
7
Missing Value Replacement
 In scikit-learn, this is referred to as “Imputation”
 Class be used sklearn.preprocessing.Imputer
 Important parameters:
 strategy: What to replace the missing value with: mean / median / most_frequent
 axis: Boolean. Whether to replace along rows or columns
 Attribute:
 statistics_ : The imputer-filled values for each feature
 Important methods
 fit(X[, y])
Fit the model with X.
 transform(X) Replace all the missing values in X.
8
Example code for Replacing Missing
Values
>>> import numpy as np
>>> from sklearn.preprocessing import Imputer
>>> imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
>>> imp.fit([[1, 2], [np.nan, 3], [7, 6]])
Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)
>>> X = [[np.nan, 2], [6, np.nan], [7, 6]]
>>> print(imp.transform(X))
[[ 4.
2.
]
[ 6.
3.666...]
[ 7.
6.
]]
9
Resampling
 Using class sklearn.utils.resample
 Important parameters:
 n_sample: No. of samples to keep
 replace: Boolean. Whether to resample with or without replacement
 Returns sequence of resampled views of the collections. The
original arrays are not impacted.
 Another useful class is sklearn.utils.shuffle
10
Discretization
 Scikit-learn doesn’t have a direct class that performs
discretization.
 Can be performed with cut and qcut functions available
in pandas.
 Orange has discretization functions in
Orange.feature.discretization
11
Feature Selection
 The sklearn.feature_selection module implements feature selection
algorithms.
 Some classes in this module are:
 GenericUnivariateSelect: Univariate feature selector based on statistical tests.
 SelectKBest: Select features according to the k highest scores.
 RFE: Feature ranking with recursive feature elimination.
 VarianceThreshold: Feature selector that removes all low-variance features.
 Scikit-learn does not have a CFS implementation, but RFE works in
somewhat similar fashion.
12
Dimensionality Reduction: PCA
 The sklearn.decomposition module includes matrix decomposition
algorithms, including PCA
 sklearn.decomposition.PCA class
 Important parameters:
 n_components: No. of components to keep
 Important attributes:
 components_ : Components with maximum variance
 explained_variance_ratio_ : Percentage of variance explained by each of the selected
components
 Important methods
 fit(X[, y])
Fit the model with X.
 score_samples(X) Return the log-likelihood of each sample
 transform(X) Apply the dimensionality reduction on X.
13
Other Useful Information
 Generate a random permutation of numbers 1.… n:
numpy.random.permutation(n)
 You can randomly generate some toy datasets using Sample generators in
sklearn.datasets
 Scikit-learn doesn’t directly handle categorical/nominal attributes well. In
order to use them in the dataset, some sort of encoding needs to be
performed.
 One good way to encode categorical attributes: if there are n categories,
create n dummy binary variables representing each category.
 Can be done easily using the sklearn.preprocessing.oneHotEncoder class.
14
References
 Preprocessing Modules: http://scikit-learn.org/stable/modules/preprocessing.html
 Video Tutorial: http://conference.scipy.org/scipy2013/tutorial_detail.php?id=107
 Quick Start Tutorial http://scikit-learn.org/stable/tutorial/basic/tutorial.html
 User Guide http://scikit-learn.org/stable/user_guide.html
 API Reference http://scikit-learn.org/stable/modules/classes.html
 Example Gallery http://scikit-learn.org/stable/auto_examples/index.html