Technical Presentation for Internal FC Use only
Download
Report
Transcript Technical Presentation for Internal FC Use only
Course
Summer
Mining
Data
Summer Course: Data Mining
Feature Selection,
Dimensionality Reduction, and
Clustering
Presenter: Georgi Nalbantov
August 2009
Course
2/54
Summer
Mining
Data
Structure
Feature Selection
Filtering approach
Wrapper approach
Embedded methods
Dimensionality Reduction
Principal Components Analysis (PCA)
Nonlinear PCA (Kernel PCA, CatPCA)
Multi-Dimensional Scaling (MDS)
Homogeneity Analysis
Clustering
Density estimation and clustering
K-means clustering
Hierarchical clustering
Clustering with Support Vector Machines (SVMs)
Course
3/54
Summer
Mining
Data
Feature Selection, Dimensionality Reduction, and
Clustering in the KDD Process
U.M.Fayyad, G.PatetskyShapiro and P.Smyth (1995)
Course
4/54
Summer
Mining
Data
Feature Selection
In the presence of millions of features/attributes/inputs/variables,
select the most relevant ones.
Advantages: build better, faster, and easier to understand learning
machines.
X
n
m features
m’
Course
5/54
Summer
Mining
Data
Feature Selection
Goal: select the two best
features individually
Any reasonable objective J will
rank the features
J(x1) > J(x2) = J(x3) > J(x4)
Thus, features chosen [x1,x2] or
[x1,x3].
However, x4 is the only feature
that provides complementary
information to x1
Course
6/54
Summer
Mining
Data
Feature Selection
Filtering approach:
ranks features or feature subsets independently of the predictor
(classifier).
…using univariate methods: consider one variable at a time
…using multivariate methods: consider more than one variables at a time
Wrapper approach:
uses a classifier to assess (many) features or feature subsets.
Embedding approach:
uses a classifier to build a (single) model with a subset of features
that are internally selected.
Course
7/54
Summer
Mining
Data
Feature Selection: univariate filtering approach
Issue: determine the relevance of a given single feature.
m
density, P(Xi| Y)
density, P(Xi| Y)
m
s-
xi
-1
s-
xi
Course
8/54
Summer
Mining
Data
Feature Selection: univariate filtering approach
Issue: determine the relevance of a given single feature.
Under independence:
P(X, Y) = P(X) P(Y)
Measure of dependence (Mutual Information):
MI(X, Y) = P(X,Y) log dX dY
= KL( P(X,Y) || P(X)P(Y) )
Course
9/54
Summer
Mining
Data
Feature Selection: univariate filtering approach
Correlation and MI
Note: Correlation is a measure of linear dependence
Course
10/54
Summer
Mining
Data
Feature Selection: univariate filtering approach
Correlation and MI under the Gaussian distribution
Course
Summer
Mining
Data
Feature Selection: univariate filtering approach.
Criteria for measuring dependence.
11/59
Course
12/59
Summer
Mining
Data
Feature Selection: univariate filtering approach
m-
m-, m+
m+
Density
P(Xi| Y=-1)
P(Xi| Y=1)
Legend:
Y=1
Y=-1
-1
ss+
xi
s-
s+
xi
Course
13/59
Summer
Mining
Data
Feature Selection: univariate filtering approach
P(Xi| Y=1) = P(Xi| Y=-1)
P(Xi| Y=1) /= P(Xi| Y=-1)
density
Legend:
Y=1
Y=-1
-1
ss+
xi
s-
s+
xi
Course
14/59
Summer
Mining
Data
Feature Selection: univariate filtering approach
mT-test
m+
Is this distance
significant?
• Normally distributed classes, equal
variance s2 unknown; estimated from
data as s2within.
• Null hypothesis H0: m+ = m• T statistic:
If H0 is true, then
-1
t= (m+ - m-)/(swithin1/m++1/m-) ~
Student(m++m--2 d.f.)
s-
s+
xi
Course
15/59
Summer
Mining
Data
Feature Selection: multivariate filtering approach
Guyon-Elisseeff, JMLR 2004; Springer 2006
Course
16/59
Summer
Mining
Data
Feature Selection: search strategies
Kohavi-John, 1997
N features, 2N possible feature subsets!
Course
17/59
Summer
Mining
Data
Feature Selection: search strategies
Forward selection or backward elimination.
Beam search: keep k best path at each step.
GSFS: generalized sequential forward selection – when (n-k)
features are left try all subsets of g features. More trainings at each
step, but fewer steps.
PTA(l,r): plus l , take away r – at each step, run SFS l times
then SBS r times.
Floating search: One step of SFS (resp. SBS), then SBS
(resp. SFS) as long as we find better subsets than those of the
same size obtained so far.
Course
18/59
Summer
Mining
Data
Feature Selection: filters vs. wrappers vs. embedding
Main goal: rank subsets of useful features
Course
19/59
Summer
Mining
Data
Feature Selection: feature subset assessment (wrapper)
N variables/features
Split data into 3 sets:
training, validation, and test set.
m2
m3
M samples
m1
1) For each feature subset, train
predictor on training data.
2) Select the feature subset, which
performs best on validation data.
Repeat and average if you want to
reduce variance (cross-validation).
3) Test on test data.
Danger of over-fitting with intensive search!
Course
20/59
Summer
Mining
Data
Feature Selection via Embedded Methods:
L1-regularization
sum(|beta|)
sum(|beta|)
Course
21/59
Summer
Mining
Data
Feature Selection: summary
Univariate
Multivariate
Linear
T-test, AUC,
feature ranking
RFE with linear
SVM or LDA
Non-linear
Mutual
information
feature ranking
Nearest
Neighbors
Neural Nets
Trees, SVM
Course
22/59
Summer
Mining
Data
Dimensionality Reduction
In the presence of may of features, select the most relevant subset
of (weighted) combinations of features.
Feature Selection:
Dimensionality Reduction:
X1,, X m X k1,, X kp
X1,, X m f1 ( X1,, X m ),, f p ( X1,, X m )
Course
23/59
Summer
Mining
Data
Dimensionality Reduction:
(Linear) Principal Components Analysis
PCA finds a linear mapping of dataset X to a dataset X’ of lower
dimensionality. The variance of X that is remained in X’ is maximal.
Dataset X is mapped to dataset X’, here of the same
dimensionality. The first dimension in X’ (= the first principal
component) is the direction of maximal variance. The second
principal component is orthogonal to the first.
Course
24/59
Summer
Mining
Data
Dimensionality Reduction:
Nonlinear (Kernel) Principal Components Analysis
Original dataset X
Map X to a HIGHERdimensional space,
and carry out LINEAR
PCA in that space
(If necessary,) map
the resulting
principal
components back to
the origianl space
Course
25/59
Summer
Mining
Data
Dimensionality Reduction:
Multi-Dimensional Scaling
MDS is a mathematical dimension reduction technique that maps the
distances between observations from the original (high) dimensional
space into a lower (for example, two) dimensional space.
MDS attempts to retain pairwise Euclidean distances in the lowdimensional space .
Error on the fit is measured using a so-called “stress” function
Different choices for a stress function are possible
Course
26/59
Summer
Mining
Data
Dimensionality Reduction:
Multi-Dimensional Scaling
Raw stress function (identical to PCA):
Sammon cost function:
Course
27/59
Summer
Mining
Data
Dimensionality Reduction:
Multi-Dimensional Scaling (Example)
Input:
Output:
Course
28/59
Summer
Mining
Data
Dimensionality Reduction:
Homogeneity analysis
Homals finds a lower-dimensional representation of categorical
data matrix X. It may be considered as a type of nonlinear
extension of PCA.
Course
29/59
Summer
Mining
Data
Clustering:
Similarity measures for hierarchical clustering
Clustering
X2
+
+
Classification
X2
+
+
++ +
+
Regression
+
+
+
+
+
+
+ +
+
+
++ +
+
+
+ +
+
-
-
-
+
-
+
+
-
X1
+
+
X1
X1
k-th Nearest Neighbour
Linear Discriminant Analysis, QDA
Classical Linear Regression
Parzen Window
Logistic Regression (Logit)
Ridge Regression
Unfolding, Conjoint
Analysis, Cat-PCA
Decision Trees, LSSVM, NN, VS
NN, CART
Course
30/59
Summer
Mining
Data
Clustering
Clustering in an unsupervised
learning technique.
Task: organize objects into groups
whose members are similar in
some way
Clustering finds structures in a
collection of unlabeled data
A cluster is a collection of objects
which are similar between them
and are dissimilar to the objects
belonging to other clusters
Course
31/59
Summer
Mining
Data
Density estimation and clustering
Bayesian separation curve (optimal)
Course
32/59
Summer
Mining
Data
Clustering:
K-means clustering
Minimizes the sum of the squared distances to the cluster centers
(reconstruction error)
Iterative process:
Estimate current assignments (construct Voronoi partition)
Given the new cluster assignments, set cluster center to centerof-mass
Course
33/59
Summer
Mining
Data
Clustering:
K-means clustering
Step 1
Step 2
Step 3
Step 4
Course
34/59
Summer
Mining
Data
Clustering:
Hierarchical clustering
Dendrogram
Clustering based on (dis)similarities.
Multilevel clustering: level 1 has n
clusters, level n has one cluster
Agglomerative HC: starts with N
clusters and combines clusters
iteratively
Divisive HC: starts with one cluster
and divides iteratively
Disadvantage: wrong division cannot
be undone
Course
35/59
Summer
Mining
Data
Clustering:
Nearest Neighbor algorithm for hierarchical clustering
1. Nearest Neighbor, Level 2, k = 7 clusters.
2. Nearest Neighbor, Level 3, k = 6 clusters.
3. Nearest Neighbor, Level 4, k = 5 clusters.
Course
36/59
Summer
Mining
Data
Clustering:
Nearest Neighbor algorithm for hierarchical clustering
4. Nearest Neighbor, Level 5, k = 4 clusters.
5. Nearest Neighbor, Level 6, k = 3 clusters.
6. Nearest Neighbor, Level 7, k = 2 clusters.
Course
37/59
Summer
Mining
Data
Clustering:
Nearest Neighbor algorithm for hierarchical clustering
7. Nearest Neighbor, Level 8, k = 1 cluster.
Course
38/59
Summer
Mining
Data
Clustering:
Similarity measures for hierarchical clustering
Course
39/59
Summer
Mining
Data
Clustering:
Similarity measures for hierarchical clustering
Pearson Correlation: Trend Similarity
b 0.5a
c a 0.2
C pearson (a, b ) 1
C pearson (a, c ) 1
C pearson (b , c ) 1
Course
40/59
Summer
Mining
Data
Clustering:
Similarity measures for hierarchical clustering
Euclidean Distance
d ( x, y)
N
n 1
( xn yn )
2
x1
y1
x y
x N
y N
Course
41/59
Summer
Mining
Data
Clustering:
Similarity measures for hierarchical clustering
Cosine Correlation
y1
x1
x y
x N
y N
N
1
x
y
i
i
i 1
Ccosine ( x , y ) N
x y
xy
+1 Cosine Correlation – 1 x y
Course
42/59
Summer
Mining
Data
Clustering:
Similarity measures for hierarchical clustering
Cosine Correlation: Trend + Mean Distance
b 0.5a
c a 0.2
Ccos ine (a, b ) 1
Ccos ine (a, c ) 0.9622
Ccos ine (b , c ) 0.9622
Course
43/54
Summer
Mining
Data
Clustering:
Similarity measures for hierarchical clustering
b 0.5a
c a 0.2
C pearson (a, b ) 1
C pearson (a, c ) 1
C pearson (b , c ) 1
d (a, b ) 2.8025
d (a, c ) 1.5875
d (b , c ) 3.2211
Ccos ine (a, b ) 1
Ccos ine (a, c ) 0.9622
Ccos ine (b , c ) 0.9622
Course
44/54
Summer
Mining
Data
Clustering:
Similarity measures for hierarchical clustering
Similar?
C pearson (a, b ) 0.1175
Cpearson (a, c ) 0.1244
C pearson (b , c ) 0.1779
d (a, b ) 0.0279
d (a, c ) 0.0255
d (b , c ) 0.0236
Ccos ine (a, b ) 0.7544
Ccos ine (a, c ) 0.8092
Ccos ine (b , c ) 0.844
Course
45/54
Summer
Mining
Data
Clustering:
Grouping strategies for hierarchical clustering
C1
C2
C3
Merge which pair of clusters?
Course
46/54
Summer
Mining
Data
Clustering:
Grouping strategies for hierarchical clustering
Single Linkage
+
Dissimilarity between two clusters =
Minimum dissimilarity between the
members of two clusters
+
C2
C1
Tend to generate “long chains”
Course
47/54
Summer
Mining
Data
Clustering:
Grouping strategies for hierarchical clustering
Complete Linkage
+
Dissimilarity between two clusters =
Maximum dissimilarity between the
members of two clusters
+
C2
C1
Tend to generate “clumps”
Course
48/54
Summer
Mining
Data
Clustering:
Grouping strategies for hierarchical clustering
Average Linkage
+
+
C2
C1
Dissimilarity between two clusters =
Averaged distances of all pairs of objects
(one from each cluster).
Course
49/54
Summer
Mining
Data
Clustering:
Grouping strategies for hierarchical clustering
Average Group Linkage
+
+
C2
C1
Dissimilarity between two clusters =
Distance between two cluster means.
Course
50/54
Summer
Mining
Data
Clustering:
Support Vector Machines for clustering
The not-noisy case
Objective function:
Ben-Hur, Horn, Siegelmann and Vapnik, 2001
Course
51/54
Summer
Mining
Data
Clustering:
Support Vector Machines for clustering
The noisy case
Objective function:
Ben-Hur, Horn, Siegelmann and Vapnik, 2001
Course
52/54
Summer
Mining
Data
Clustering:
Support Vector Machines for clustering
The noisy case (II)
Objective function:
Ben-Hur, Horn, Siegelmann and Vapnik, 2001
Course
53/54
Summer
Mining
Data
Clustering:
Support Vector Machines for clustering
The noisy case (III)
Objective function:
Ben-Hur, Horn, Siegelmann and Vapnik, 2001
Course
54/54
Summer
Mining
Data
Conclusion / Summary / References
Feature Selection
Dimensionality Reduction
Filtering approach
Wrapper approach
Embedded methods
Principal Components Analysis (PCA)
Nonlinear PCA (Kernel PCA, CatPCA)
Multi-Dimensional Scaling (MDS)
Homogeneity Analysis
Kohavi and John, 1996
Kohavi and John, 1996
I. Guyon et. al., 2006
http://www.cs.otago.ac.nz/cosc453/student_tutorials/...
principal_components.pdf
Schoelkopf et. al., 2001; .;Gifi, 1990
Born and Groenen, 2005
Gifi, 1990
Clustering
Density estimation and clustering
K-means clustering
Hierarchical clustering
Clustering with Support Vector Machines (SVMs)
Hastie et. el., 2001
MacQueen, 1967
http://www.autonlab.org/tutorials/kmeans11.pdf
Ben-Hur, Horn, Siegelmann and Vapnik, 2001