Variable and Feature Selection in Machine Learning (Review

Download Report

Transcript Variable and Feature Selection in Machine Learning (Review

Computational Genomics and
Proteomics
Lecture 9
CGP Part 2: Introduction
Based on: Martin Bachler slides
Overview CGP part 2
• Lesson 9: Introduction, description of
assignment, short overview papers
• Lesson 10: Mass Spectrometric Data Analysis
(Guest lecturer: Thang Pham, VUmc)
• Lesson 11: Team+E.M. meeting 1
• Lesson 12: Team+E.M. meeting 2
• Lesson 13: Presentation/Evaluation Part 2
CGP part 2
• Main focus of CGP2: Feature Selection
• Three papers available for teams’ study
–
–
–
–
Each team (pair of students) chooses one paper
Critical study of topic/method of the paper (meeting with E.M.)
Proposal of method changes (meeting with E.M.)
Public presentation, discussion, evaluation (all teams)
• One paper review on Feature Selection techniques
in Bioinformatics:
– Saeys Y, Inza I, Larrañaga P. A review of feature selection
techniques in bioinformatics Bioinformatics. 2007 Oct 1;23(19):250717
Papers
• K. Jong, E. Marchiori, M. Sebag, A. van der Vaart.
Feature Selection in Proteomic Pattern Data with Support
Vector Machines. CIBCB , pp. 41-48, IEEE, 2004.
• K. Ye, A. Feenstra, A.Ijzerman, J. Heringa, E. Marchiori.
Multi-RELIEF: a method to recognize specificity determining
residues from multiple sequence alignments using a Machine
Learning approach for feature weighting.
Bioinformatics, 2007.
• K. Ye, E.W. Lameijer, M.W. Beukers, and A.P. IJzerman
(2006) A two-entropies analysis to identify functional
positions in the transmembrane region of class A G proteincoupled receptors. Proteins, Structure, Function and
Bioinformatics 63, 1018-1030.
Students’ Evaluation
• Based on:
– Discussions
– Report describing topic/method and proposed
changes
– Presentation
Problem: Where to focus attention ?
• A universal problem of intelligent (learning)
agents is where to focus their attention.
• What aspects of the problem at hand are
important/necessary to solve it?
• Discriminate between the relevant and
irrelevant parts of experience.
What is Feature selection ?
• Feature selection:
Problem of selecting some subset of a
learning algorithm’s input variables upon
which it should focus attention, while ignoring
the rest
(DIMENSIONALITY REDUCTION)
• Humans/animals do that constantly!
Feature Selection in ML ?
Why even think about Feature Selection in ML?
- The information about the target class is inherent in the
variables!
- Naive theoretical view:
More features
=> More information
=> More discrimination power.
- In practice:
many reasons why this is not the case!
- Also:
Optimization is (usually) good, so why not try to optimize the
input-coding ?
Feature Selection in ML ? YES!
- Many explored domains have hundreds to tens of
thousands of variables/features with many irrelevant and
redundant ones!
- In domains with many features the underlying probability
distribution can be very complex and very hard to
estimate (e.g. dependencies between variables) !
- Irrelevant and redundant features can „confuse“ learners!
- Limited training data!
- Limited computational resources!
- Curse of dimensionality!
Curse of dimensionality
1
positiveexamples
examples
positive
negativeexamples
examples
negative
0.9
1
0.8
0.9
0.8
0.7
0.7
0.6
0.6
x2x3
0.5
0.5
0.4
0.3
0.4
0.2
0.1
0.3
0
1
0.20.9
0.8
0.1
0.7
0.6
0.5
0.4
0
0
0.1
0.2
x2
0.3
0.2
0.3
0.1
0.4
0
0.2
0.1 0.6
0 0.5
x1
0.3
0.4
0.7
x1
0.5
0.6
0.8
0.7
0.8
0.9
0.9
1
1
1
Curse of dimensionality
• The required number of samples (to achieve the
same accuracy) grows exponentionally with the
number of variables!
• In practice: number of training examples is fixed!
=> the classifier’s performance usually will degrade
for a large number of features!
In many cases the information
that is lost by discarding variables
is made up for by a more
accurate mapping/sampling in the
lower-dimensional space !
Example for ML-Problem
Gene selection from microarray data
– Variables:
gene expression coefficients corresponding to the
amount of mRNA in a patient‘s sample (e.g. tissue
biopsy)
– Task: Separate healthy patients from cancer patients
– Usually there are only about 100 examples (patients)
available for training and testing (!!!)
– Number of variables in the raw data: 6.000 – 60.000
– Does this work ? ([8])
[8] C. Ambroise, G.J. McLachlan: Selection bias in gene extraction on the basis of microarray gene-expresseion data. PNAS Vol. 99 6562-6566(2002)
Example for ML-Problem
Text-Categorization
- Documents are represented by a vector of
dimension the size of the vocabulary containing
word frequency counts
- Vocabulary ~ 15.000 words (i.e. each document
is represented by a 15.000-dimensional vector)
- Typical tasks:
- Automatic sorting of documents into web-directories
- Detection of spam-email
Motivation
• Especially when dealing with a large number
of variables there is a need for
dimensionality reduction!
• Feature Selection can significantly improve a
learning algorithm’s performance!
Approaches
• Wrapper
– feature selection takes into account the contribution to
the performance of a given type of classifier
• Filter
– feature selection is based on an evaluation criterion for
quantifying how well feature (subsets) discriminate the
two classes
• Embedded
– feature selection is part of the training procedure of a
classifier (e.g. decision trees)
Embedded methods
• Attempt to jointly or simultaneously train both a
classifier and a feature subset
• Often optimize an objective function that jointly
rewards accuracy of classification and
penalizes use of more features.
• Intuitively appealing
Example: tree-building algorithms
Adapted from J. Fridlyand
Approaches to Feature Selection
Filter Approach
Input
Features
Feature
Selection by
Distance Metric
Score
Train
Model
Model
Wrapper Approach
Input
Features
Feature
Selection
Search
Feature Set
Train
Model
Model
Importance of
features given
by the model
Adapted from Shin and Jasso
Filter methods
R
p
Feature selection
R
s
Classifier design
s << p
•Features are scored independently and the top s are used by
the classifier
•Score: correlation, mutual information, t-statistic, F-statistic,
p-value, tree importance statistic etc
Easy to interpret. Can provide some insight into the disease
markers.
Adapted from J. Fridlyand
Problems with filter method
• Redundancy in selected features: features are
considered independently and not measured on the
basis of whether they contribute new information
• Interactions among features generally can not be
explicitly incorporated (some filter methods are
smarter than others)
• Classifier has no say in what features should be used:
some scores may be more appropriates in conjuction
with some classifiers than others.
Adapted from J. Fridlyand
Dimension reduction: a variant on a filter
method
• Rather than retain a subset of s features, perform dimension
reduction by projecting features onto s principal components of
variation (e.g. PCA etc)
• Problem is that we are no longer dealing with one feature at a
time but rather a linear or possibly more complicated
combination of all features. It may be good enough for a black
box but how does one build a diagnostic chip on a
“supergene”? (even though we don’t want to confuse the tasks)
• Those methods tend not to work better than simple filter
methods.
Adapted from J. Fridlyand
Wrapper methods
R
p
Feature selection
R
s
Classifier design
s << p
•Iterative approach: many feature subsets are scored based
on classification performance and best is used.
•Selection of subsets: forward selection, backward selection,
Forward-backward selection, tree harvesting etc
Adapted from J. Fridlyand
Problems with wrapper methods
• Computationally expensive: for each feature
subset to be considered, a classifier must be
built and evaluated
• No exhaustive search is possible (2 subsets
p
to consider) : generally greedy algorithms only.
• Easy to overfit.
Adapted from J. Fridlyand
Example: Microarray Analysis
“Labeled” cases
(38 bone marrow samples: 27 AML, 11 ALL
Each contains 7129 gene expression values)
Train model
(using Neural Networks, Support Vector
Machines, Bayesian nets, etc.)
34 New
unlabeled bone
marrow samples
Model
key
genes
AML/ALL
Microarray Data Challenges :
• Few samples for analysis (38 labeled)
• Extremely high-dimensional data (7129 gene
expression values per sample)
• Noisy data
• Complex underlying mechanisms, not fully
understood
Some genes are more useful than others
for building classification models
Example: genes 36569_at and 36495_at are useful
Some genes are more useful than others
for building classification models
Example: genes 36569_at and 36495_at are useful
AML
ALL
Some genes are more useful than others
for building classification models
Example: genes 37176_at and 36563_at not useful
Importance of Feature (Gene)
Selection
• Majority of genes are not directly related to leukemia
• Having a large number of features enhances the model’s
flexibility, but makes it prone to overfitting
• Noise and the small number of training samples makes this even
more likely
• Some types of models, like kNN do not scale well with many
features
Classification: CV error
• Training error
N samples
– Empirical error
• Error on independent
test set
– Test error
• Cross validation (CV)
error
splitting
1/n
samples for
testing
– Leave-one-out (LOO)
– N-fold CV
Count errors
Summarize CV
error rate
N-1/n
samples for
training
Two schemes of cross validation
CV1 N samples
LOO
Train and test the
feature-selector and
the classifier
CV2
N samples
feature selection
LOO
Train and test the
classifier
Count errors
Count errors
Difference between CV1 and CV2
• CV1 gene selection within LOOCV
• CV2 gene selection before before LOOCV
• CV2 can yield optimistic estimation of classification true error
• CV2 used in paper by Golub et al. :
– 0 training error
– 2 CV error (5.26%)
– 5 test error (14.7%)
– CV error different from test error!
Significance of classification results
• Permutation test:
– Permute class label of samples
– LOOCV error on data with permuted labels
– Repeat process a high number of times
– Compare with LOOCV error on original data:
• P-value = (# times LOOCV on permuted data <= LOOCV on original data) /
total # of permutations considered
Feature Selection techniques in a
nutshell
WHY ?
WHAT ?
HOW ?
Saeys Y, Inza I, Larrañaga P. A review of feature selection techniques in bioinformatics Bioinformatics. 2007 Oct 1;23(19):2507-17
Now it is team’s work!
• Meetings: 29/11, 6/12:
– each team comes to discuss with E.M. once a
week for 30 minutes
• In the last lecture (13 Dec):
– Report due date
– Public talk by each team
See you on the 29/11!