Technical Presentation for Internal FC Use only

Download Report

Transcript Technical Presentation for Internal FC Use only

Course
Summer
Mining
Data
Summer Course: Data Mining
Feature Selection,
Dimensionality Reduction, and
Clustering
Presenter: Georgi Nalbantov
August 2009
Course
2/54
Summer
Mining
Data
Structure

Feature Selection



Filtering approach
Wrapper approach
Embedded methods

Dimensionality Reduction





Principal Components Analysis (PCA)
Nonlinear PCA (Kernel PCA, CatPCA)
Multi-Dimensional Scaling (MDS)
Homogeneity Analysis
Clustering




Density estimation and clustering
K-means clustering
Hierarchical clustering
Clustering with Support Vector Machines (SVMs)
Course
3/54
Summer
Mining
Data
Feature Selection, Dimensionality Reduction, and
Clustering in the KDD Process
U.M.Fayyad, G.PatetskyShapiro and P.Smyth (1995)
Course
4/54
Summer
Mining
Data
Feature Selection
In the presence of millions of features/attributes/inputs/variables,
select the most relevant ones.
Advantages: build better, faster, and easier to understand learning
machines.
X
n
m features
m’
Course
5/54
Summer
Mining
Data
Feature Selection

Goal: select the two best
features individually




Any reasonable objective J will
rank the features
J(x1) > J(x2) = J(x3) > J(x4)
Thus, features chosen [x1,x2] or
[x1,x3].
However, x4 is the only feature
that provides complementary
information to x1
Course
6/54
Summer
Mining
Data
Feature Selection

Filtering approach:
ranks features or feature subsets independently of the predictor
(classifier).


…using univariate methods: consider one variable at a time
…using multivariate methods: consider more than one variables at a time

Wrapper approach:
uses a classifier to assess (many) features or feature subsets.

Embedding approach:
uses a classifier to build a (single) model with a subset of features
that are internally selected.
Course
7/54
Summer
Mining
Data
Feature Selection: univariate filtering approach

Issue: determine the relevance of a given single feature.
m
density, P(Xi| Y)
density, P(Xi| Y)
m
s-
xi
-1
s-
xi
Course
8/54
Summer
Mining
Data
Feature Selection: univariate filtering approach

Issue: determine the relevance of a given single feature.
 Under independence:
P(X, Y) = P(X) P(Y)
 Measure of dependence (Mutual Information):
MI(X, Y) =  P(X,Y) log dX dY
= KL( P(X,Y) || P(X)P(Y) )
Course
9/54
Summer
Mining
Data
Feature Selection: univariate filtering approach

Correlation and MI
Note: Correlation is a measure of linear dependence
Course
10/54
Summer
Mining
Data
Feature Selection: univariate filtering approach

Correlation and MI under the Gaussian distribution
Course
Summer
Mining
Data
Feature Selection: univariate filtering approach.
Criteria for measuring dependence.
11/59
Course
12/59
Summer
Mining
Data
Feature Selection: univariate filtering approach
m-
m-, m+
m+
Density
P(Xi| Y=-1)
P(Xi| Y=1)
Legend:
Y=1
Y=-1
-1
ss+
xi
s-
s+
xi
Course
13/59
Summer
Mining
Data
Feature Selection: univariate filtering approach
P(Xi| Y=1) = P(Xi| Y=-1)
P(Xi| Y=1) /= P(Xi| Y=-1)
density
Legend:
Y=1
Y=-1
-1
ss+
xi
s-
s+
xi
Course
14/59
Summer
Mining
Data
Feature Selection: univariate filtering approach
mT-test
m+
Is this distance
significant?
• Normally distributed classes, equal
variance s2 unknown; estimated from
data as s2within.
• Null hypothesis H0: m+ = m• T statistic:
If H0 is true, then
-1
t= (m+ - m-)/(swithin1/m++1/m-) ~
Student(m++m--2 d.f.)
s-
s+
xi
Course
15/59
Summer
Mining
Data
Feature Selection: multivariate filtering approach
Guyon-Elisseeff, JMLR 2004; Springer 2006
Course
16/59
Summer
Mining
Data
Feature Selection: search strategies
Kohavi-John, 1997
N features, 2N possible feature subsets!
Course
17/59
Summer
Mining
Data
Feature Selection: search strategies
 Forward selection or backward elimination.
 Beam search: keep k best path at each step.
 GSFS: generalized sequential forward selection – when (n-k)
features are left try all subsets of g features. More trainings at each
step, but fewer steps.
 PTA(l,r): plus l , take away r – at each step, run SFS l times
then SBS r times.
 Floating search: One step of SFS (resp. SBS), then SBS
(resp. SFS) as long as we find better subsets than those of the
same size obtained so far.
Course
18/59
Summer
Mining
Data
Feature Selection: filters vs. wrappers vs. embedding

Main goal: rank subsets of useful features
Course
19/59
Summer
Mining
Data
Feature Selection: feature subset assessment (wrapper)
N variables/features
Split data into 3 sets:
training, validation, and test set.
m2
m3
M samples
m1
1) For each feature subset, train
predictor on training data.
2) Select the feature subset, which
performs best on validation data.
 Repeat and average if you want to
reduce variance (cross-validation).
3) Test on test data.
Danger of over-fitting with intensive search!
Course
20/59
Summer
Mining
Data
Feature Selection via Embedded Methods:
L1-regularization
sum(|beta|)
sum(|beta|)
Course
21/59
Summer
Mining
Data
Feature Selection: summary
Univariate
Multivariate
Linear
T-test, AUC,
feature ranking
RFE with linear
SVM or LDA
Non-linear
Mutual
information
feature ranking
Nearest
Neighbors
Neural Nets
Trees, SVM
Course
22/59
Summer
Mining
Data
Dimensionality Reduction
In the presence of may of features, select the most relevant subset
of (weighted) combinations of features.
Feature Selection:
Dimensionality Reduction:
X1,, X m  X k1,, X kp
X1,, X m  f1 ( X1,, X m ),, f p ( X1,, X m )
Course
23/59
Summer
Mining
Data
Dimensionality Reduction:
(Linear) Principal Components Analysis

PCA finds a linear mapping of dataset X to a dataset X’ of lower
dimensionality. The variance of X that is remained in X’ is maximal.
Dataset X is mapped to dataset X’, here of the same
dimensionality. The first dimension in X’ (= the first principal
component) is the direction of maximal variance. The second
principal component is orthogonal to the first.
Course
24/59
Summer
Mining
Data
Dimensionality Reduction:
Nonlinear (Kernel) Principal Components Analysis
Original dataset X
Map X to a HIGHERdimensional space,
and carry out LINEAR
PCA in that space
(If necessary,) map
the resulting
principal
components back to
the origianl space
Course
25/59
Summer
Mining
Data
Dimensionality Reduction:
Multi-Dimensional Scaling
MDS is a mathematical dimension reduction technique that maps the
distances between observations from the original (high) dimensional
space into a lower (for example, two) dimensional space.
MDS attempts to retain pairwise Euclidean distances in the lowdimensional space .

Error on the fit is measured using a so-called “stress” function

Different choices for a stress function are possible
Course
26/59
Summer
Mining
Data
Dimensionality Reduction:
Multi-Dimensional Scaling

Raw stress function (identical to PCA):

Sammon cost function:
Course
27/59
Summer
Mining
Data
Dimensionality Reduction:
Multi-Dimensional Scaling (Example)
Input:
Output:
Course
28/59
Summer
Mining
Data
Dimensionality Reduction:
Homogeneity analysis

Homals finds a lower-dimensional representation of categorical
data matrix X. It may be considered as a type of nonlinear
extension of PCA.
Course
29/59
Summer
Mining
Data
Clustering:
Similarity measures for hierarchical clustering
Clustering
X2
+
+
Classification
X2
+
+
++ +
+
Regression
+
+
+
+
+
+
+ +
+
+
++ +
+
+
+ +
+
-
-
-
+
-
+
+
-
X1
+
+
X1
X1

k-th Nearest Neighbour

Linear Discriminant Analysis, QDA

Classical Linear Regression

Parzen Window

Logistic Regression (Logit)

Ridge Regression

Unfolding, Conjoint
Analysis, Cat-PCA

Decision Trees, LSSVM, NN, VS

NN, CART
Course
30/59
Summer
Mining
Data
Clustering

Clustering in an unsupervised
learning technique.

Task: organize objects into groups
whose members are similar in
some way

Clustering finds structures in a
collection of unlabeled data

A cluster is a collection of objects
which are similar between them
and are dissimilar to the objects
belonging to other clusters
Course
31/59
Summer
Mining
Data
Density estimation and clustering
Bayesian separation curve (optimal)
Course
32/59
Summer
Mining
Data
Clustering:
K-means clustering

Minimizes the sum of the squared distances to the cluster centers
(reconstruction error)

Iterative process:

Estimate current assignments (construct Voronoi partition)

Given the new cluster assignments, set cluster center to centerof-mass
Course
33/59
Summer
Mining
Data
Clustering:
K-means clustering
Step 1
Step 2
Step 3
Step 4
Course
34/59
Summer
Mining
Data
Clustering:
Hierarchical clustering
Dendrogram

Clustering based on (dis)similarities.
Multilevel clustering: level 1 has n
clusters, level n has one cluster

Agglomerative HC: starts with N
clusters and combines clusters
iteratively

Divisive HC: starts with one cluster
and divides iteratively

Disadvantage: wrong division cannot
be undone
Course
35/59
Summer
Mining
Data
Clustering:
Nearest Neighbor algorithm for hierarchical clustering
1. Nearest Neighbor, Level 2, k = 7 clusters.
2. Nearest Neighbor, Level 3, k = 6 clusters.
3. Nearest Neighbor, Level 4, k = 5 clusters.
Course
36/59
Summer
Mining
Data
Clustering:
Nearest Neighbor algorithm for hierarchical clustering
4. Nearest Neighbor, Level 5, k = 4 clusters.
5. Nearest Neighbor, Level 6, k = 3 clusters.
6. Nearest Neighbor, Level 7, k = 2 clusters.
Course
37/59
Summer
Mining
Data
Clustering:
Nearest Neighbor algorithm for hierarchical clustering
7. Nearest Neighbor, Level 8, k = 1 cluster.
Course
38/59
Summer
Mining
Data
Clustering:
Similarity measures for hierarchical clustering
Course
39/59
Summer
Mining
Data
Clustering:
Similarity measures for hierarchical clustering

Pearson Correlation: Trend Similarity


b  0.5a
 
c  a  0.2
 
C pearson (a, b )  1
 
C pearson (a, c )  1
 
C pearson (b , c )  1
Course
40/59
Summer
Mining
Data
Clustering:
Similarity measures for hierarchical clustering

Euclidean Distance
 
d ( x, y) 

N
n 1
( xn  yn )
2
 x1 
 y1 
     
x   y 
 
 x N 
 y N 
Course
41/59
Summer
Mining
Data
Clustering:
Similarity measures for hierarchical clustering

Cosine Correlation
 y1 
 x1 
     
x   y 
 
 x N 
 y N 
N
1
x

y

i
i
i 1
 
Ccosine ( x , y )  N 

x  y
 
xy


+1  Cosine Correlation  – 1 x   y
Course
42/59
Summer
Mining
Data
Clustering:
Similarity measures for hierarchical clustering

Cosine Correlation: Trend + Mean Distance


b  0.5a
 
c  a  0.2
 
Ccos ine (a, b )  1
 
Ccos ine (a, c )  0.9622

Ccos ine (b , c )  0.9622
Course
43/54
Summer
Mining
Data
Clustering:
Similarity measures for hierarchical clustering


b  0.5a
 
c  a  0.2
 
C pearson (a, b )  1
 
C pearson (a, c )  1
 
C pearson (b , c )  1
 
d (a, b )  2.8025
 
d (a, c )  1.5875
 
d (b , c )  3.2211
 
Ccos ine (a, b )  1
 
Ccos ine (a, c )  0.9622

Ccos ine (b , c )  0.9622
Course
44/54
Summer
Mining
Data
Clustering:
Similarity measures for hierarchical clustering
Similar?
 
C pearson (a, b )  0.1175
 
Cpearson (a, c )  0.1244
 
C pearson (b , c )  0.1779
 
d (a, b )  0.0279
 
d (a, c )  0.0255
 
d (b , c )  0.0236
 
Ccos ine (a, b )  0.7544
 
Ccos ine (a, c )  0.8092
 
Ccos ine (b , c )  0.844
Course
45/54
Summer
Mining
Data
Clustering:
Grouping strategies for hierarchical clustering
C1
C2
C3
Merge which pair of clusters?
Course
46/54
Summer
Mining
Data
Clustering:
Grouping strategies for hierarchical clustering
Single Linkage
+
Dissimilarity between two clusters =
Minimum dissimilarity between the
members of two clusters
+
C2
C1
Tend to generate “long chains”
Course
47/54
Summer
Mining
Data
Clustering:
Grouping strategies for hierarchical clustering
Complete Linkage
+
Dissimilarity between two clusters =
Maximum dissimilarity between the
members of two clusters
+
C2
C1
Tend to generate “clumps”
Course
48/54
Summer
Mining
Data
Clustering:
Grouping strategies for hierarchical clustering
Average Linkage
+
+
C2
C1
Dissimilarity between two clusters =
Averaged distances of all pairs of objects
(one from each cluster).
Course
49/54
Summer
Mining
Data
Clustering:
Grouping strategies for hierarchical clustering
Average Group Linkage
+
+
C2
C1
Dissimilarity between two clusters =
Distance between two cluster means.
Course
50/54
Summer
Mining
Data
Clustering:
Support Vector Machines for clustering
The not-noisy case
Objective function:
Ben-Hur, Horn, Siegelmann and Vapnik, 2001
Course
51/54
Summer
Mining
Data
Clustering:
Support Vector Machines for clustering
The noisy case
Objective function:
Ben-Hur, Horn, Siegelmann and Vapnik, 2001
Course
52/54
Summer
Mining
Data
Clustering:
Support Vector Machines for clustering
The noisy case (II)
Objective function:
Ben-Hur, Horn, Siegelmann and Vapnik, 2001
Course
53/54
Summer
Mining
Data
Clustering:
Support Vector Machines for clustering
The noisy case (III)
Objective function:
Ben-Hur, Horn, Siegelmann and Vapnik, 2001
Course
54/54
Summer
Mining
Data
Conclusion / Summary / References

Feature Selection




Dimensionality Reduction





Filtering approach
Wrapper approach
Embedded methods
Principal Components Analysis (PCA)
Nonlinear PCA (Kernel PCA, CatPCA)
Multi-Dimensional Scaling (MDS)
Homogeneity Analysis
Kohavi and John, 1996
Kohavi and John, 1996
I. Guyon et. al., 2006
http://www.cs.otago.ac.nz/cosc453/student_tutorials/...
principal_components.pdf
Schoelkopf et. al., 2001; .;Gifi, 1990
Born and Groenen, 2005
Gifi, 1990
Clustering




Density estimation and clustering
K-means clustering
Hierarchical clustering
Clustering with Support Vector Machines (SVMs)
Hastie et. el., 2001
MacQueen, 1967
http://www.autonlab.org/tutorials/kmeans11.pdf
Ben-Hur, Horn, Siegelmann and Vapnik, 2001