On Combining Principal Components with Parametric LDA

Download Report

Transcript On Combining Principal Components with Parametric LDA

ADMKD’05 Tallinn, Estonia September 15-16, 2005
On Combining Principal Components with
Parametric LDA-based Feature Extraction
for Supervised Learning:
(kNN, Naïve Bayes and C4.5 )
Mykola Pechenizkiy, Seppo Puuronen
Department of Computer Science
University of Jyväskylä
Finland
Alexey Tsymbal
Department of Computer Science
Trinity College Dublin
Ireland
Contents
•
DM and KDD background
– KDD as a process
– DM strategy
•
Classification
– Curse of dimensionality and Indirectly relevant features
– Dimensionality reduction
• Feature Selection (FS)
• Feature Extraction (FE)
•
Feature Extraction for Classification
– Conventional PCA
– Class-conditional FE: parametric and non-parametric
– Combining principal component (PCs) and linear discriminants (LDs)
•
Experimental Results
– 3 FE strategies, 3 Classifiers, 21 UCI datasets
•
Conclusions and Further Research
ADMKD’05 Tallinn, Estonia September 15-16, 2005
On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen
2
What is Data Mining
Data mining or Knowledge discovery is the process of
finding previously unknown and potentially interesting
patterns and relations in large databases (Fayyad, KDD’96)
Data mining is the emerging science and industry of
applying modern statistical and computational
technologies to the problem of finding useful patterns
hidden within large databases (John 1997)
Intersection of many fields: statistics, AI, machine
learning, databases, neural networks, pattern recognition,
econometrics, etc.
ADMKD’05 Tallinn, Estonia September 15-16, 2005
On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen
3
I
Knowledge discovery as a process
Fayyad, U., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R.,
Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, 1997.
ADMKD’05 Tallinn, Estonia September 15-16, 2005
On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen
4
The task of classification
J classes, n training observations, p features
Given n training instances
Training
New instance
(xi, yi) where xi are values of
Set
to be classified
attributes and y is class
CLASSIFICATION
Goal: given new x0,
predict class y0
Examples:
Class Membership of
the new instance
- prognostics of recurrence of breast cancer;
- diagnosis of thyroid diseases;
- heart attack prediction, etc.
ADMKD’05 Tallinn, Estonia September 15-16, 2005
On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen
5
Goals of Feature Extraction
Improvement of representation space
ADMKD’05 Tallinn, Estonia September 15-16, 2005
On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen
6
Selecting most
representative instances
pa
rti t
io
ns
Constructive Induction
representation of instances of class y1
representation of instances of class yk
Selecting most relevant features
•Feature extraction (FE) is a dimensionality reduction technique that
extracts a subset of new features from the original set by means of
some functional mapping keeping as much information in the data as
possible (Fukunaga 1990).
ADMKD’05 Tallinn, Estonia September 15-16, 2005
On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen
7
Feature selection or transformation
• Features can be (and often are) correlated
– FS techniques that just assign weights to individual
features are insensitive to interacted or correlated
features.
• Data is often not homogenous
– For some problems a feature subset may be useful in one
part of the instance space, and at the same time it may be
useless or even misleading in another part of it.
– Therefore, it may be difficult or even impossible to remove
irrelevant and/or redundant features from a data set and
leave only useful ones by means of feature selection.
• That is why the transformation of the given representation
before weighting the features is often preferable.
ADMKD’05 Tallinn, Estonia September 15-16, 2005
On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen
8
FE for Classification
ADMKD’05 Tallinn, Estonia September 15-16, 2005
On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen
9
Principal Component Analysis
• PCA extracts a lower dimensional space
by analyzing the covariance structure of
multivariate statistical observations.
• The main idea – determine the features
that explain as much of the total variation
in the data as possible with as few of
these features as possible.
PCA has the following properties:
(1) it maximizes the variance of the extracted features;
(2) the extracted features are uncorrelated;
(3) it finds the best linear approximation;
(4) it maximizes the information contained in the extracted
features.
ADMKD’05 Tallinn, Estonia September 15-16, 2005
On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen
10
The Computation of the PCA
ADMKD’05 Tallinn, Estonia September 15-16, 2005
On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen
11
The Computation of the PCA
1) Calculate the covariance matrix S from the input data.
2) Compute the eigenvalues and eigenvectors of S and
sort them in a descending order with respect to the
eigenvalues.
3) Form the actual transition matrix by taking the
predefined number of components (eigenvectors).
4) Finally, multiply the original feature space with the
obtained transition matrix, which yields a lowerdimensional representation.
• The necessary cumulative percentage of variance
explained by the principal axes is used commonly as a
threshold, which defines the number of components to
be chosen.
ADMKD’05 Tallinn, Estonia September 15-16, 2005
On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen
12
FT example “Heart Disease”
100%
Variance covered
87%
-0.7·Age+0.1·Sex-0.43·RestBP+0.57·MaxHeartRate
-0.01·Age+0.78·Sex-0.42·RestBP-0.47·MaxHeartRate
0.1·Age-0.6·Sex-0.73·RestBP-0.33·MaxHeartRate
60%
<= 3NN Accuracy =>
67%
ADMKD’05 Tallinn, Estonia September 15-16, 2005
On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen
13
PCA for Classification
PCA gives high weights to features with higher variabilities
disregarding whether they are useful for classification or not.
x2
PC(2)
a)
PC(1)
x1
x2
PC(2)
b)
PC(1)
x1
PCA for classification: a) effective work of PCA, b) the case where an
irrelevant principal component was chosen from the classification point of
view.
ADMKD’05 Tallinn, Estonia September 15-16, 2005
On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen
14
Class-conditional Eigenvector-based FE
The usual decision is to use some class separability criterion,
based on a family of functions of scatter matrices: the withinclass, the between-class, and the total covariance matrices.
J (w ) 
wT S B w
w T SW w
w SSimultaneous
 eig _ decomposit
ionAlgorithm
( SW ),
Diagonalization
W
•
Transformation of X to Y: Y  Λ 1/2ΦT X , where  and  are the
SB
B
W
eigenvalues
and eigenvectors matrices of SB.
w
 eig _ decomposition ( S | x )
wComputation
 w SW w SofB S
•
xW  w SW x;
B
in the obtained Y space.
•
Selection of m eigenvectors of SB, which correspond to the m largest
eigenvalues.
•
Computation of new feature space Z  ΨTm Y, where  is the set of
selected eigenvectors.
ADMKD’05 Tallinn, Estonia September 15-16, 2005
On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen
15
Parametric Eigenvalue-based FE
The within-class covariance matrix shows the scatter of
samples around their respective class expected
vectors:
n
c
SW   ni  (x (ji )  m (i ) )( x (ji )  m (i ) )T
i
i 1
j 1
The between-class covariance matrix shows the
scatter of the expected vectors around the mixture
c
mean:
S B   ni (m (i )  m)(m (i )  m) T
i 1
where c is the number of classes, ni is the number of
(i )
instances in a class i, x j is the j-th instance of i-th
class, m(i) is the mean vector of the instances of i-th
class, and m is the mean vector of all the input data.
ADMKD’05 Tallinn, Estonia September 15-16, 2005
On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen
16
Nonparametric Eigenvalue-based FE
Tries to increase the number of degrees of freedom in the between-class
covariance matrix, measuring the between-class covariances on a local
basis. K-nearest neighbor (kNN) technique is used for this purpose.
ni
c
c
S B  ni wik (x (ki )  m ik( j*) )(x (ki )  m ik( j*) )T
i 1 k 1
j 1
j i
  
The coefficient wik is a weighting coefficient, which shows importance of
each summand.
• assign more weight to those elements of the matrix, which involve instances
lying near the class boundaries and are more important for classification.
wik 
j)
min j {d  (x (ki ) , x (nNN
)}
c
(i )
( j)

d
(
x
,
x

nNN )
k
j 1
ADMKD’05 Tallinn, Estonia September 15-16, 2005
On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen
17
Sb: Parametric vs Nonparametric
Differences in the between-class covariance matrix calculation for
nonparametric (left) and parametric (right)
approaches for the two-class case.
ADMKD’05 Tallinn, Estonia September 15-16, 2005
On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen
18
Combining PCs and LDs for SL
Test
Data
Transform
PAR
LDs
Train
PCs + LDs
Training
Data
PCA
Test
PCs+LDs
PCs
SL
Classifier
Accuracy
• Improvement of the parametric class-conditional
LDA-based approach by adding a few principal
components (PCs) to the Linear Discriminants (LDs)
for further supervised learning (SL)
ADMKD’05 Tallinn, Estonia September 15-16, 2005
On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen
19
Experimental Settings
•
•
21 data sets with different characteristics taken from the UCI
machine learning repository
3 classifiers: 3-nearest neighbor classification (3NN), Naïve-Bayes
(NB) learning algorithm, and C4.5 decision tree learning (C4.5)
– The classifiers were used from WEKA library with their defaults
settings.
•
3 approaches with each classifier:
– PCA with classifier
– Parametric LDA with classifier
– PCA+LDA with classifier
•
After PCA we took 3 main PCs. We took all the LDs (features extracted by
parametric LDA) as it was always equal to #classes – 1.
•
30 test runs of Monte-Carlo cross validation were made for each
data set to evaluate the classification accuracy.
In each run, the training set/the test set = 70%/30% by stratified
random sampling to keep class distributions approximately same.
•
ADMKD’05 Tallinn, Estonia September 15-16, 2005
On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen
20
ADMKD’05 Tallinn, Estonia September 15-16, 2005
On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen
21
Ranking of the FE approaches
according to the results on 21 UCI data sets
++
+
=
-
--
12
10
8
kNN
6
4
2
0
PAR vs
PCA
PAR+PCA PAR+PCA
vs PAR
vs PCA
ADMKD’05 Tallinn, Estonia September 15-16, 2005
On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen
22
Ranking of the FE approaches
according to the results on 21 UCI data sets
++
+
=
-
--
12
10
Naïve
Bayes
8
6
4
2
0
PAR vs
PCA
PAR+PCA PAR+PCA
vs PAR
vs PCA
ADMKD’05 Tallinn, Estonia September 15-16, 2005
On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen
23
Ranking of the FE approaches
according to the results on 21 UCI data sets
++
+
=
-
--
12
10
8
C4.5
6
4
2
0
PAR vs
PCA
PAR+PCA PAR+PCA
vs PAR
vs PCA
ADMKD’05 Tallinn, Estonia September 15-16, 2005
On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen
24
Accuracy of classifiers, averaged over 21
datasets.
3NN
NB
C4.5
0,79
0,78
0,77
0,76
0,75
0,74
0,73
0,72
0,71
0,70
PCA
PAR
PAR+3PCs
ADMKD’05 Tallinn, Estonia September 15-16, 2005
On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen
25
State transition diagram
NB: ”PAR vs PCA” (I)  ”PAR+PCA vs PCA” (II)
effect = +4*1+2*2+3*1+3*1-4*1-1*1-2*2 = +5
I
II
++ 6
4
+
1
1
=
2
7
1
1
–
2
5
– – 10 4
=
2
3
++
6
2
1
1
+
1
2
1
––
3
10
4
1
–
2
1
ADMKD’05 Tallinn, Estonia September 15-16, 2005
On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen
26
Effect of combining PCs with LDs according
to state transition diagrams
PAR+PCA
vs PCA
PAR+PCA
vs PAR
kNN
-10
+5
NB
+11
+14
C4.5
+15
+17
ADMKD’05 Tallinn, Estonia September 15-16, 2005
On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen
27
When combination of PCs and LDs is
practically useful.
++
+
=
-
--
12
10
8
6
4
2
0
3NN
NB
C4.5
ADMKD’05 Tallinn, Estonia September 15-16, 2005
On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen
28
Conclusions
•
“The curse of dimensionality” is a serious problem in ML/DM/KDD
–
–
•
Classification accuracy decreases and processing time increases dramatically
FE is a common way to cope with this problem.
• Before applying a learning algorithm the space of instances is transformed
into a new space of a lower dimensionality, trying to preserve the distances
among instances and class separability.
A classical approach (that takes into account class information) here
is Fisher’s LDA,
–
–
tries to minimize the within class covariance and to maximize the between class
covariance in the extracted features.
is well studied, and commonly used, - often provides informative features for
classification, but!
• extracts no more than the #classes -1 features.
• often fails to provide reasonably good classification accuracy even with fairly
simple datasets, where the intrinsic dimensionality exceeds that number.
• There has been considered a number of ways to solve this problem.
– Many approaches suggest non-parametric variations of LDA (rather
time-consuming), which lead to greater numbers of extracted features.
– Dataset partitioning and local FE
ADMKD’05 Tallinn, Estonia September 15-16, 2005
On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen
29
Conclusions (cont.)
• In this paper we consider an alternative way to improve LDAbased FE for classification,
– combining the extracted LDs with a few PCs.
• Our experiments with the combination of LDs with PCs have
shown that the discriminating power of LDA features can be
improved by PCs with many datasets and learning algorithms.
– The best performance is exhibited with the C4.5:
• a possible explanation for the good behaviour with
C4.5 is that decision trees use implicit feature
selection, and thus implicitly select LDs and/or PCs,
useful for classification, out of the combined set of
features, discarding the less relevant and duplicate
ones. Moreover, this feature selection is local.
ADMKD’05 Tallinn, Estonia September 15-16, 2005
On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen
30
Thank You!
Mykola Pechenizkiy
Department of Computer Science and Information Systems,
University of Jyväskylä, FINLAND
E-mail: [email protected]
Tel. +358 14 2602472
Mobile: +358 44 3851845
Fax: +358 14 2603011
www.cs.jyu.fi/~mpechen
Acknowledgments:
• ADMKD Reviewers
• COMAS Graduate School of the University of Jyväskylä
• Finland Science Foundation Ireland
• WEKA software library
• UCI datasets
ADMKD’05 Tallinn, Estonia September 15-16, 2005
On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen
31