Feature Extraction for Supervised Learning in Knowledge Discovery

Download Report

Transcript Feature Extraction for Supervised Learning in Knowledge Discovery

Mykola Pechenizkiy
Feature Extraction for
Supervised Learning in
Knowledge Discovery Systems
Supervisors:
Prof. Seppo Puuronen,
JYU
Dr. Alexey Tsymbal,
TCD
Prof. Tommi Kärkkäinen,
JYU
Reviewers:
Prof. Ryszard Michalski,
Prof. Peter Kokol,
GMU
UM
Public examination of dissertation
Opponent:
JYU, Agora Building, Auditorium 2
December 20, 2005 12:00
Dr. Kari Torkkola,
20.12.05/12:00 Public examination of PhD thesis:
Agora, Aud 2
“Feature Extraction for Supervised Learning in Knowledge Discovery Systems”
Motorola
Labs
1
Outline

DM and KDD background
– KDD as a process
– DM strategy

Classification
– Curse of dimensionality and indirectly relevant features
– Feature extraction (FE) as dimensionality reduction

Feature Extraction for Classification
– Conventional Principal Component Analysis
– Class-conditional FE: parametric and non-parametric
Research Questions
 Research Methods
 Contributions

20.12.05/12:00 Public examination of PhD thesis:
Agora, Aud 2
“Feature Extraction for Supervised Learning in Knowledge Discovery Systems”
2
Knowledge discovery as a process
Fayyad, U., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R.,
Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, 1997.
20.12.05/12:00 Public examination of PhD thesis:
Agora, Aud 2
“Feature Extraction for Supervised Learning in Knowledge Discovery Systems”
3
The task of classification
J classes, n training observations, p features
Training
Set
New instance
to be classified
Given n training instances
(xi, yi) where xi are values of
attributes and y is class
CLASSIFICATION
Goal: given new x0,
predict class y0
Examples:
Class Membership of
the new instance
- diagnosis of thyroid diseases;
- heart attack prediction, etc.
20.12.05/12:00 Public examination of PhD thesis:
Agora, Aud 2
“Feature Extraction for Supervised Learning in Knowledge Discovery Systems”
4
Improvement of Representation Space
 Curse of dimensionality
 drastic increase in computational complexity and classification error
with data having a large number of dimensions
 Indirectly relevant features
20.12.05/12:00 Public examination of PhD thesis:
Agora, Aud 2
“Feature Extraction for Supervised Learning in Knowledge Discovery Systems”
5
FE example “Heart Disease”
100%
Variance covered
87%
-0.7·Age+0.1·Sex-0.43·RestBP+0.57·MaxHeartRate
-0.01·Age+0.78·Sex-0.42·RestBP-0.47·MaxHeartRate
0.1·Age-0.6·Sex-0.73·RestBP-0.33·MaxHeartRate
60%
<= 3NN Accuracy =>
67%
20.12.05/12:00 Public examination of PhD thesis:
Agora, Aud 2
“Feature Extraction for Supervised Learning in Knowledge Discovery Systems”
6
How to construct good RS for SL?
Selecting most
representative instances
pa
rti t
io n
s
RQ6
How
cope
with
the
presence
of contextual
features
infor SL?
RQ2
–Which
Is
FEto
data
oriented
or
SLclass
oriented
or
RQ1
How
important
to
use
information
the
FE process?
RQ4:–
features
–isoriginal,
extracted
orboth?
both in
– are
useful
data, and data heterogeneity?
Original features
Extracted features
representation of instances of class y1
representation of instances of class yk
Selecting most relevant features
RQ3 –
– What
Is FE for
dynamic
integration
of base-level
classifiers
useful
RQ7
is the
effect of
sample reduction
on the
performance
RQ5 – How
many
extracted
useful for
SL?
in a
way as forfeatures
a singleare
base-level
classifier?
of
FEsimilar
for SL?
20.12.05/12:00 Public examination of PhD thesis:
Agora, Aud 2
“Feature Extraction for Supervised Learning in Knowledge Discovery Systems”
7
Research Problem

Studying both theoretical background and practical
aspects of FE for SL in KDSs
Research Method

Theory
Building to the construction
A multimethodological
approach
of an artefact for DM (following Nunamaker et al.,
1990-91)
Artifact
MainDM
Contribution


Development
Many-sided analysis of the research problem
Ensemble of relatively small contributions
Observation
Experimentation
20.12.05/12:00 Public examination of PhD thesis:
Agora, Aud 2
“Feature Extraction for Supervised Learning in Knowledge Discovery Systems”
8
Further Research
When FE is
useful for SL?
Meta-Data
Data
generator
Meta-learning
GUI
Meta-Model,
ES, KB
Data set
Data Preprocessors
Instances
Manipulators
KDDManager
Feature
Manipulators
What is the effect of
FE on interpretability of results and
transparency of SL?
Evaluators
Postprocessors/
visualisers
ML
algorithms/
Classifiers
How to help in decision making on the selection of the appropriate DM
strategy for a problem at consideration?
20.12.05/12:00 Public examination of PhD thesis:
Agora, Aud 2
“Feature Extraction for Supervised Learning in Knowledge Discovery Systems”
9
Additional Slides …

Further Slides for Step-by-Step
Analysis of Research Questions and
Corresponding Contributions
20.12.05/12:00 Public examination of PhD thesis:
Agora, Aud 2
“Feature Extraction for Supervised Learning in Knowledge Discovery Systems”
10
Research Questions:
RQ1 – How important is to use class information in the
FE process?
RQ2 – Is FE a data- or hypothesis-driven constructive
induction?
RQ3 – Is FE for dynamic integration of base-level
classifiers useful in a similar way as for a single
base-level classifier?
RQ4 – Which features – original, extracted or both – are
useful for SL?
RQ5 – How many extracted features are useful for SL?
20.12.05/12:00 Public examination of PhD thesis:
Agora, Aud 2
“Feature Extraction for Supervised Learning in Knowledge Discovery Systems”
11
Research Questions (cont.):
RQ6 – How to cope with the presence of contextual
features in data, and data heterogeneity?
RQ7 – What is the effect of sample reduction on the
performance of FE for SL?
RQ8 – When FE is useful for SL?
RQ9 – What is the effect of FE on interpretability of
results and transparency of SL?
RQ10 – How to make a decision about the selection of
the appropriate DM strategy for a problem at
consideration?
20.12.05/12:00 Public examination of PhD thesis:
Agora, Aud 2
“Feature Extraction for Supervised Learning in Knowledge Discovery Systems”
12
RQ1: Use of class information in FE
x2
PC(2)
a)
PC(1)
x1
x2
PC(2)
b)
PC(1)
x1
Use of class information in FE
process is crucial for many datasets:
Class-conditional FE can result in
better classification accuracy
while solely variance-based FE has
no effect on or deteriorates the
accuracy.
No superior technique, but
nonparametric approaches are
more stables to various dataset
characteristics

Tsymbal A., Puuronen S., Pechenizkiy M., Baumgarten M., Patterson D. 2002. Eigenvectorbased Feature Extraction for Classification (Article I, FLAIRS’02)
20.12.05/12:00 Public examination of PhD thesis:
Agora, Aud 2
“Feature Extraction for Supervised Learning in Knowledge Discovery Systems”
13
RQ2: Is FE a data- or hypothesis-driven CI?
Train
Train set
set
Test set
Test set
Search for the most
appropriate
FEmost
Search for the
technique
appropriate FE
technique
FE
process
FE
process
TransTransformed
formed
train set
Train
set
Search for the most
appropriate
SLmost
Search for the
technique
appropriate SL
technique
FE
model
FE
model
SL
process
SL
process
SL
model
SL
model
Prediction
Prediction
Ranking of different FE techniques according to the corresponding
accuracy results of a SL technique can vary a lot for different datasets.
Different FE techniques behave also in a different way when integrated with
different SL techniques.
Selection of FE method is not independent from the selection of classifier

Pechenizkiy M. 2005. Impact of the Feature Extraction on the Performance of a Classifier:
kNN, Naïve Bayes and C4.5 (Article III, AI’05)
20.12.05/12:00 Public examination of PhD thesis:
Agora, Aud 2
“Feature Extraction for Supervised Learning in Knowledge Discovery Systems”
14
RQ3: FE for Dynamic Integration of Classifiers
S - size of the emsemble
N - number of features
TS - training subset
BC - base classifier
NN - nearest neighborhood
(Article VIII, Pechenizkiy et al., 2005)
Data set
Divide instances
Training set
Validation set
Test set
RSM(S,N)
Feature Extraction :
training phase
PCA
Par
Non-Par
TS1
...
TS i
...
TS S
Training
BC1
...
Training
BCi
...
Training
BCS
Trained
Base
classifiers
Transformation models
accuracy
estimation
...
accuracy
estimation
...
accuracy
estimation
Local
accuracy
estimates
Transformed
training set
Meta-Data
application phase
Feature subsets refinement
Dynamic Selection
Dynamic Voting
Dynamic
Integration
Meta-Learning
WNN: for each nn
predict local errors
of every BC
Search for NN
Dynamic Voting with Selection
20.12.05/12:00 Public examination of PhD thesis:
Agora, Aud 2
“Feature Extraction for Supervised Learning in Knowledge Discovery Systems”
15
RQ4: How to construct good RS for SL?
Selecting most
representative instances
pa
rti t
io n
s
Which features – original, extracted or both – are useful for SL?
representation of instances of class y1
Combination of original features with extracted features
can be beneficial for SL with many datasets, especially
representation of instances of class yk
when tree-based
inducers like C4.5 are used for
classification.
Selecting most relevant features

Pechenizkiy M., Tsymbal A., Puuronen S. 2004. PCA-based feature transformation for
classification: issues in medical diagnostics, (Article II, CBMS’2004)
20.12.05/12:00 Public examination of PhD thesis:
Agora, Aud 2
“Feature Extraction for Supervised Learning in Knowledge Discovery Systems”
16
RQ4: How to construct good RS for SL? (cont.)
Test
Data
Transform
PAR
Test
PCs+LDs
LDs
Train
PCs + LDs
Training
Data
PCA
LDA
PCA
Classifier
SL
PCs
Accuracy
--
LDA+ PCA
0.80
12
0.78
10
-
=
+
++
8
0.76
6
0.74
4
0.72
2
0.70
0
3NN

NB
C4.5
3NN
NB
C4.5
Pechenizkiy M., Tsymbal A., Puuronen S. 2005. On Combining Principal Components with
Parametric LDA-based Feature Extraction for Supervised Learning. (Article III, FCDS)
20.12.05/12:00 Public examination of PhD thesis:
Agora, Aud 2
“Feature Extraction for Supervised Learning in Knowledge Discovery Systems”
17
RQ5: How many extracted features are useful?


Criteria for selecting the most useful transformed features are
often based on variance accounted by the features to be selected
all the components, the corresponding eigenvalues of which are
significantly greater than one
# features  1
# eigenvalues  1  2
# instaces  1


a ranking procedure: select principal components that have the
highest correlations with the class attribute
Pechenizkiy M., Tsymbal A., Puuronen S. 2004. PCA-based feature transformation for
classification: issues in medical diagnostics, (Article II, CBMS’2004)
20.12.05/12:00 Public examination of PhD thesis:
Agora, Aud 2
“Feature Extraction for Supervised Learning in Knowledge Discovery Systems”
18
RQ6: How to cope with data heterogeneity?
Training
Data
Natural Clustering
Cluster1
DR
Test
Data

Cluster2
Clustern
DR
DR
DR
SL
SL
SL
SL
SL
SL
SL
SL
Classifier
Classifier
C1
C2
Cn
C1
C2
Cn
Accuracy
Accuracy
Accuracy
Accuracy
Pechenizkiy M., Tsymbal A., Puuronen S. 2005. Supervised Learning and Local
Dimensionality Reduction within Natural Clusters: Biomedical Data Analysis, (T-ITB,
"Mining Biomedical Data“)
20.12.05/12:00 Public examination of PhD thesis:
Agora, Aud 2
“Feature Extraction for Supervised Learning in Knowledge Discovery Systems”
19
RQ7: What is the effect of sample reduction?
k
k
Root
k
kd-tree
building
N1
1
Nn
kd-tree
cla
s
k
s1
N1
1
n
N
1
i
 N1
c
ooo
S
ooo
i 1
ss
cla
Nc
i
S
FE + SL
k
c
Data
Root
k
kd-tree
building
N1
c
Nn
kd-tree
Nc  p
 Sc
100%
c
n
N
i 1

Random Sampling
N
i 1
ooo
N1  p
 S1
100%
Random Sampling
c
i
 Nc
Sample
Pechenizkiy M., Puuronen S., Tsymbal A. 2005. The Impact of Sample Reduction on PCAbased Feature Extraction for Naïve Bayes Classification. (Article V, ACM SAC’06: DM
Track)
20.12.05/12:00 Public examination of PhD thesis:
Agora, Aud 2
“Feature Extraction for Supervised Learning in Knowledge Discovery Systems”
20
RQ8: When FE is useful for SL?
 Kaiser-Meyer-Olkin (KMO) criterion: accounts total and partial
correlation
 rij2
KMO 
i

i
j
rij2
j
 
i
aij2
j
,
aij. X ( i , j ) 
 Rij
Rii R jj
General recommendation:
IF
KMO > 0.5
THEN
Apply PCA
Rarely works in the
context of SL
20.12.05/12:00 Public examination of PhD thesis:
Agora, Aud 2
“Feature Extraction for Supervised Learning in Knowledge Discovery Systems”
21
RQ9: What is the effect of FE on interpretability?
Interpretability refers to whether a classifier is easy to
understand.
– rule-based classifiers like a decision tree and association rules
are very easy to interpret,
– neural networks and other connectionist and “black-box”
classifiers have low interpretability.
FE enables:

• New concepts – new understanding
• Information summary from a large number of features into a limited
number of components
• The transformation formulae provide information about the
importance of the original features
• Better RS – better neighbourhood – better interpretability by analogy
with similar medical cases
• Visual analysis projecting data onto 2D or 3D plots.

Pechenizkiy M., Tsymbal A., Puuronen S. 2004. PCA-based feature transformation for
classification: issues in medical diagnostics, (Article II, CBMS’2004)
20.12.05/12:00 Public examination of PhD thesis:
Agora, Aud 2
“Feature Extraction for Supervised Learning in Knowledge Discovery Systems”
22
RQ9: Feature Extraction & Interpretability (cont.)
Objectivity of interpretability




The assessment of interpretability relies on the user’s perception
of the classifier
The assessment of an algorithm’s practicality depends much on
a user’s background, preferences and priorities.
Most of the characteristics related to practicality can be
described only by reporting users’ subjective evaluations.
Thus,
– the interpretability issues are disputable and difficult to evaluate,
– many conclusions on interpretability are relative and subjective.


Collaboration between DM researchers and domain experts is
needed for further analysis of interpretability issues
Pechenizkiy M., Tsymbal A., Puuronen S. 2004. PCA-based feature transformation for
classification: issues in medical diagnostics, (Article II, CBMS’2004)
20.12.05/12:00 Public examination of PhD thesis:
Agora, Aud 2
“Feature Extraction for Supervised Learning in Knowledge Discovery Systems”
23
RQ10: Framework for DM Strategy Selection
Meta-Data
Data
generator
Meta-learning
GUI
Meta-Model,
ES, KB
Data set
Data Preprocessors
Instances
Manipulators

KDDManager
Feature
Manipulators
Evaluators
Postprocessors/
visualisers
ML
algorithms/
Classifiers
Pechenizkiy M. 2005. DM strategy selection via empirical and constructive induction.
(Article IX, DBA’05)
20.12.05/12:00 Public examination of PhD thesis:
Agora, Aud 2
“Feature Extraction for Supervised Learning in Knowledge Discovery Systems”
24
Additional Slides …
20.12.05/12:00 Public examination of PhD thesis:
Agora, Aud 2
“Feature Extraction for Supervised Learning in Knowledge Discovery Systems”
25
Meta-Learning
Collection of
data sets
Performance criteria
A new data set
Collection of
techniques
Meta-learning space
Knowledge
repository
Meta-model
Suggested technique
Evaluation
20.12.05/12:00 Public examination of PhD thesis:
Agora, Aud 2
“Feature Extraction for Supervised Learning in Knowledge Discovery Systems”
26
New Research Framework for DM Research
DM Research
Business
Needs
People
Organizations
Technology
Relevance
Rigor
Develop/Build
Assess
Refine
Justify/Evaluate
(Un-)Successful Applications in
the appropriate environment
Applicable
Knowledge
Environment
Knowledge Base
Foundations
Design knowledge
Contribution to Knowledge Base
20.12.05/12:00 Public examination of PhD thesis:
Agora, Aud 2
“Feature Extraction for Supervised Learning in Knowledge Discovery Systems”
27
New Research Framework for DM Research
Needs
Technology
Infrastructure
Applications
Communications
Architecture
Development
Capabilities
DM Research
Develop/Build
Theories
Artifacts
Assess
Refine
Justify/
Evaluate
Analytical
Case Study
Experimental
Field Study
Simulation
(Un-)Successful Applications in
the appropriate environment
Rigor
Knowledge
Organizations
Strategy
Structure&Culture
Processes
Business
People
Roles
Capabilities
Characteristics
Relevance
Applicable
Environment
Knowledge Base
Foundations
Base-level theories
Frameworks
Models
Instantiation
Validation Criteria
Design knowledge
Methodologies
Validation Criteria
(not instantiations
of models but KDD
processes, services,
systems)
Contribution to Knowledge Base
… following Hevner et al. framework
20.12.05/12:00 Public examination of PhD thesis:
Agora, Aud 2
“Feature Extraction for Supervised Learning in Knowledge Discovery Systems”
28
Some Multidisciplinary Research




Pechenizkiy M., Puuronen S., Tsymbal A. 2005. Why Data Mining Does
Not Contribute to Business? In: C.Soares et al. (Eds.), Proc. of Data
Mining for Business Workshop, DMBiz (ECML/PKDD’05), Porto,
Portugal, pp. 67-71.
Pechenizkiy M., Puuronen S., Tsymbal A. 2005. Competitive advantage
from Data Mining: Lessons learnt in the Information Systems field. In:
IEEE Workshop Proc. of DEXA’05, 1st Int. Workshop on Philosophies
and Methodologies for Knowledge Discovery PMKD’05, IEEE CS
Press, pp. 733-737 (Invited paper).
Pechenizkiy M., Puuronen S., Tsymbal A. 2005. Does the relevance of
data mining research matter? (resubmitted as a book chapter to)
Foundations of Data Mining, Springer.
Pechenizkiy M., Tsymbal A., Puuronen S. 2005. Knowledge
Management Challenges in Knowledge Discovery Systems. In: IEEE
Workshop Proc. of DEXA’05, 6th Int. Workshop on Theory and
Applications of KM, TAKMA’05, IEEE CS Press, pp. 433-437.
20.12.05/12:00 Public examination of PhD thesis:
Agora, Aud 2
“Feature Extraction for Supervised Learning in Knowledge Discovery Systems”
29
Some Applications:

Pechenizkiy M., Tsymbal A., Puuronen S., Shifrin M., Alexandrova I.
2005. Knowledge Discovery from Microbiology Data: Many-sided
Analysis of Antibiotic Resistance in Nosocomial Infections. In: K.D.
Althoff et al. (Eds) Post-Conference Proc. of 3rd Conf. on Professional
Knowledge Management: Experiences and Visions, LNAI 3782,
Springer Verlag, pp. 360-372.

Pechenizkiy M., Tsymbal A., Puuronen S. 2005. Supervised Learning
and Local Dimensionality Reduction within Natural Clusters:
Biomedical Data Analysis, (T-ITB, "Mining Biomedical Data“)

Tsymbal A., Pechenizkiy M., Cunningham P., Puuronen S. 2005.
Dynamic Integration of Classifiers for Handling Concept Drift.
(submitted to Special Issue on Application of Ensembles, Information
Fusion, Elsevier)
20.12.05/12:00 Public examination of PhD thesis:
Agora, Aud 2
“Feature Extraction for Supervised Learning in Knowledge Discovery Systems”
30
Contact Info
MS Power Point slides of recent talks
and full texts of selected publications
are available online at: www.cs.jyu.fi/~mpechen
Mykola Pechenizkiy
Department of Computer Science and Information Systems,
University of Jyväskylä, FINLAND
E-mail: [email protected]
Tel. +358 14 2602472
Mobile: +358 44 3851845
Fax: +358 14 2603011
www.cs.jyu.fi/~mpechen
THANK YOU!
20.12.05/12:00 Public examination of PhD thesis:
Agora, Aud 2
“Feature Extraction for Supervised Learning in Knowledge Discovery Systems”
31