Feature Selection and Its Applications on

Download Report

Transcript Feature Selection and Its Applications on

Feature Selection and Its Application
in Genomic Data Analysis
March 9, 2004
Lei Yu
Arizona State University
Outlines
Introduction to feature selection
Motivation
Problem statement
Key research issues
Application in genomic data analysis
Overview of data mining for microarray data
Gene selection
A case study
Current research directions
2
Motivation
An active field in
Pattern recognition
Machine learning
Data mining
Statistics
Goodness
F1
F2 . . . FN
C
I1
f11
f12 . . . f1N
c1
I2
f21
f22 . . . f2N
c2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
IM fM1 fM2 . . . fMN cM
Reducing dimensionality
Improving learning efficiency
Increasing predicative accuracy
Reducing complexity of learned results
3
Problem Statement
A process of selecting a minimum subset
of features that is sufficient to construct a
hypothesis consistent with the training
examples (Almuallim and Dietterich, 1991)
Selecting a minimum subset G such that
P(C|G) is equal or as close as possible to
P(C|F) (Koller and Sahami, 1996)
4
An Example for the Problem
Data set
F1 F2 F3 F4 F5
0 0 1 0 1
0 1 0 0 1
1 0 1 0 1
1 1 0 0 1
0 0 1 1 0
0 1 0 1 0
1 0 1 1 0
1 1 0 1 0
C
0
1
1
1
0
1
1
1
Five Boolean features
C = F1∨F2
F3 = ┐F2 , F5 = ┐F4
Optimal subset:
{F1, F2} or {F1, F3}
Combinatorial nature of
searching for an optimal
subset
5
Subset Search
An example of search space (Kohavi and John, 1997)
6
Evaluation Measures
Wrapper model
Relying on a predetermined classification
algorithm
Using predictive accuracy as goodness measure
High accuracy, computationally expensive
Filter model
Separating feature selection from classifier
learning
Relying on general characteristics of data
(distance, correlation, consistency)
No bias toward any learning algorithm, fast
7
A Framework for Algorithms
Original
Set
Subset
Generation
Candidate
Subset
Subset
Evaluation
Current Best Subset
No
Stopping
Criterion
Yes
Selected Subset
8
Feature Ranking
Weighting and ranking individual features
Selecting top-ranked ones for feature
selection
Advantages
Efficient: O(N) in terms of dimensionality N
Easy to implement
Disadvantages
Hard to determine the threshold
Unable to consider correlation between features
9
Applications of Feature Selection
Text categorization
Yang and Pederson, 1997 (CMU)
Forman, 2003 (HP Labs)
Image retrieval
Swets and Weng, 1995 (MSU)
Dy et al, 2003 (Purdue University)
Gene expression microarrray data analysis
Xing et al, 2001 (UC Berkeley)
Lee et al, 2003 (Texas A&M)
Customer relationship management
Ng and Liu, 2000 (NUS)
Intrusion detection
Lee et al, 2000 (Columbia University)
10
Microarray Technology
Enabling simultaneously measuring the
expression levels for thousands or tens of
thousands of genes in a single experiment
Providing new opportunities and challenges for
data mining
Gene
M23197_at
U66497_at
M92287_at
.
.
.
Value
261
88
4778
.
.
.
11
Two Ways to View Microarray Data
Gene M23197_at U66497_at
M92287_at
...
Class
Sample
Sample 1
261
88
4778
...
ALL
Sample 2
101
74
2700
...
ALL
Sample 3
1450
34
498
...
AML
.
.
.
.
.
.
.
.
.
.
.
.
...
...
...
.
.
.
12
Data Mining Tasks
Data points are:
Genes
Samples
Grouping
similar genes
Clustering together to find
co-regulated
genes
Classification
Grouping
similar samples
together to
find classes or
subclasses
Building a
classifier to
predict the
classes of new
samples
13
Gene Selection
Data characteristics in sample classification
High dimensionality (thousands of genes)
Small sample size (often less than 100 samples)
Problems
Curse of dimensionality
Overfitting the training data
Traditional gene selection methods
Within the filter model
Gene ranking
14
A Case Study (Golub et al., 1999)
AML
ALL
Leukemia data
7129 genes, 72 samples
Training: 38 (27 ALL, 11
AML)
Test: 34 (20 ALL, 14 AML)
Normalization
Mean: 0
Standard deviation: 1
Correlation measure
-3
0
3
Normalized Expression
15
Case Study (continued)
Performance of selected genes
Accuracy on training set: 36 out 38 (94.74%)
correctly classified
Accuracy on test set: 29 out 34 (85.29%)
correctly classified
Limitations
Domain knowledge required to determine the
number of genes selected
Unable to remove redundant genes
16
Feature/Gene Redundancy
Examining redundant genes
Two heads are not necessarily better than one
Effects of redundant genes
How to handle redundancy
A challenge
Some latest work
• MRMR (Maximum Relevance Minimum
Redundancy) (Ding and Peng, CSB-2003)
• FCBF (Fast Correlation Based Filter) (Yu and Liu,
ICML-2003)
17
Research Directions
Feature selection for unlabeled data
Common things as for labeled data
Difference
Dealing with different data types
Nominal, discrete, continuous
Discretization
Dealing with large size data
Comparative study and intelligent
selection of feature selection methods
18
References
G. John, R. Kohavi, and K. Pfleger. Irrelevant features and
the subset selection problem. ICML-1994.
L. Yu and H. Liu. Feature selection for high-dimensional
data: a fast correlation-based filter solution. ICML-2003.
T. R. Golub et al. Molecular classification of cancer: class
discovery and class prediction by gene expression monitoring.
Science-1999.
C. Ding and H. Peng. Minimum redundancy feature selection
from microarray gene expression data. CSB-2003.
J. Shavlik and D. Page. Machine learning and genetic
microarrays. ICML-2003 tutorial.
http://www.cs.wisc.edu/~dpage/ICML-2003-Tutorial-ShavlikPage.ppt
19