Transcript PPT

Classification and Feature
Selection Algorithms for Multi-class
CGH data
Jun Liu, Sanjay Ranka, Tamer Kahveci
http://www.cise.ufl.edu
1
Gene copy number
• The number of copies of
genes can vary from person to
person.
Lung images (ALA)
– ~0.4% of the gene copy
numbers are different for pairs of
people.
• Variations in copy numbers
can alter resistance to disease
– EGFR copy number can be
higher than normal in Non-small
cell lung cancer.
Healthy
Cancer
2
Comparative Genomic Hybridization (CGH)
3
Raw and smoothed CGH data
4
Example CGH dataset
862 genomic intervals
in the Progenetix
database
5
Problem description
•Given a new sample, which class does this sample belong to?
•Which features should we use to make this decision?
6
Outline
• Support Vector Machine (SVM)
• SVM for CGH data
• Maximum Influence Feature Selection
algorithm
• Results
7
SVM in a nutshell
Support Vector Machine (SVM)
SVM for CGH data
Maximum Influence Feature Selection algorithm
Results
8
Classification with SVM
• Consider a two-class,
linearly separable
classification problem
• Many decision
boundaries!
• The decision boundary
should be as far away
from the data of both
classes as possible
– We should maximize the
margin, m
Class 2
Class 1
m
9
SVM Formulation
• Let {x1, ..., xn} be our data set and let yi  {1,-1} be the class
label of xi
• Maximize J over αi
Similarity between
xi and xj
•The decision boundary can be constructed as
10
SVM for CGH data
Support Vector Machine (SVM)
SVM for CGH data
Maximum Influence Feature Selection algorithm
Results
11
Pairwise similarity measures
• Raw measure
– Count the number of genomic intervals that
both samples have gain (or loss) at that
position.
Raw = 3
12
SVM based on Raw kernel
• Using SVM with the Raw kernel amounts to solving the
following quadratic program
Maximize J over αi :
Use Raw kernel to
replace
• The resulting decision function is
Is this cool?
Use Raw kernel to
replace
13
Is Raw kernel valid?
• Not all similarity function can serve as kernel. This
requires the underlying kernel matrix M is “positive
semi-definite”.
• M is positive semi-definite if for all vectors v, vTMv ≥ 0
14
Is Raw kernel valid?
• Proof: define a function Φ() where
– Φ: a {1, 0, -1}m  b {1, 0}2m,where
•
•
•
Φ(gain)
= Φ(1) = 01
Φ(no-change) = Φ(0) = 00
Φ(loss)
= Φ(-1) = 10
– Raw(X, Y) =Φ(X)T Φ(Y)
X = 0
Y = 0
1
1
*
1 0 1 -1
0 -1 -1 -1
*
Raw(X, Y) = 2
Φ(X) = 0 0 0 1 0 1 0 0 0 1 1 0
Φ(Y) = 0 0 0 1 0 0 1 0 1 0 1 0
*
*
Φ(X)T Φ(Y) = 2
15
Raw Kernel is valid!
• Raw kernel can be written as Raw(X, Y) =Φ(X)T Φ(Y)
• Define a 2m by n matrix
Let M denote the
Kernel matrix of Raw
• Therefore,
16
MIFS algorithm
Support Vector Machine (SVM)
SVM for CGH data
Maximum Influence Feature Selection algorithm
Results
17
MIFS for multi-class data
One-versus-all SVM
Contribution
High 1. Feature 8
Feature 2
Feature 1
2. Feature 4
[8, 1,3.3] Feature
Ranks of features
[5, 915, 8]
4. Feature 33
5. Feature 2
Sort ranks of features [1, 3, 8]
[5, 8, 15]
6. Feature 48
7. Feature 27
8. Feature 1
[1, 2, 31] … [1, 3, 8]
Sort features
Low
Feature 3
Feature 4
[12, 4, 3]
[2, 31, 1]
[3, 4, 12]
[1, 2, 31]
[3, 4, 12]
[5, 8, 15]
Most promising feature. Insert Feature 4 into feature set
18
Results
Support Vector Machine (SVM)
SVM for CGH data
Maximum Influence Feature Selection algorithm
Results
19
Dataset Details
Data taken from
Progenetix database
20
Datasets
Similarity level
#cancers
best
good
fair
poor
2
478
466
351
373
4
1160
790
800
800
6
1100
850
880
810
8
1000
830
750
760
Dataset size
21
Experimental results
• Comparison of linear and Raw kernel
0.9
0.7
0.5
fa
ir8
go
od
2
go
od
4
go
od
6
go
od
8
po
or
2
po
or
4
po
or
6
po
or
8
fa
ir6
fa
ir4
fa
ir2
be
st
2
be
st
4
be
st
6
be
st
8
0.3
Linear
Raw
On average, Raw kernel improves the predictive accuracy by
6.4% over sixteen datasets compared to linear kernel.
22
Experimental results
0.8
0.7
0.65
0.6
Using
Using 80
40 features
features results
results in
in accuracy
accuracy
that
is
comparable
or
better
than
using
that is comparable to using all
features
all features
0.55
0.5
4
8
12
16
20
24
28
30
40
50
60
70
80
90
100
150
200
250
300
350
400
450
500
550
600
650
700
750
800
862
Accuracy
0.75
Number of Features
All features
MIFS
(Ding and Peng, 2005)
MRMR
SVM-RFE
(Fu and Fu-Liu, 2005)
23
Using MIFS for feature selection
• Result to test the hypothesis that 40 features
are enough and 80 features are better
24
A Web Server for Mining CGH Data
http://cghmine.cise.ufl.edu:8007/CGH/Default.html
25
Thank you
26
Appendix
27
Minimum Redundancy and
Maximum Relevance (MRMR)
• Relevance V is defined as the average mutual information
between features and class labels
• Redundancy W is defined as the average mutual information
between all pairs of features
• Incrementally select features by maximizing (V / W) or (VClass
– W)
Features
1 2 3 4
Y
1
0
1
X
x1
0 1 1 0
1
x2
0 1 1 0
1
x3
0 1 1 0
1
x4
0 0 0 1
1
x5
0 0 0 0
-1
x6
0 0 0 0
-1
28
Support Vector Machine Recursive
Feature Elimination (SVM-RFE)
Train a linear SVM
based on feature set
Compute the weight
vector
Compute the ranking
coefficient wi2 for the ith feature
Remove the feature with
smallest ranking coefficient
Is feature
set empty?
Y
29
N
Pairwise similarity measures
• Sim measure
– Segment is a contiguous block of aberrations
of the same type.
– Count the number of overlapping segment
pairs.
Sim = 2
30
Non-linear Decision Boundary
• How to generalize SVM when the two class
classification problem is not linearly separable?
• Key idea: transform xi to a higher dimensional space
to “make life easier”
– Input space: the space the point xi are located
– Feature space: the space of f(xi) after transformation
f(.)
Input space
f( )
f( )
f( )
f( ) f( ) f( )
f( )
f( )
f( )
f( ) f( )
f( ) f( )
f( )
f( ) f( )
f( )
f( )
Feature space
A linear decision
boundary can
be found!
31