Mathematical Programming in Support Vector Machines

Download Report

Transcript Mathematical Programming in Support Vector Machines

Support Vector Machine Data Mining
Olvi L. Mangasarian
with
Glenn M. Fung, Jude W. Shavlik
& Collaborators at ExonHit – Paris
Data Mining Institute
University of Wisconsin - Madison
What is a Support Vector Machine?
 An optimally defined surface
 Linear or nonlinear in the input space
 Linear in a higher dimensional feature space
 Implicitly defined by a kernel function

K(A,B)  C
What are Support Vector Machines
Used For?
 Classification
 Regression & Data Fitting
 Supervised & Unsupervised Learning
Principal Topics
Knowledge-based classification
Incorporate expert knowledge into a classifier
Breast cancer prognosis & chemotherapy
Classify patients on basis of distinct survival curves
 Isolate a class of patients that may benefit from
chemotherapy
Multiple Myeloma detection via gene expression
measurements
Drug discovery based on gene macroarray expression
Joint work with ExonHit
Support Vector Machines
Maximize the Margin between Bounding Planes
w
x 0w = í + 1
A+
A-
x 0w = í à 1
2
jj wjj
Principal Topics
Knowledge-based classification (NIPS*2002)
Conventional Data-Based SVM
Knowledge-Based SVM
via Polyhedral Knowledge Sets
Incoporating Knowledge Sets
Into an SVM Classifier
è ?
é
 Suppose that the knowledge set: x ? Bx 6 b
belongs to the class A+. Hence it must lie in the
halfspace :
è
é
x j x 0w> í + 1
 We therefore have the implication:
Bx 6 b )
x w> í + 1
0
 This implication is equivalent to a set of
constraints that can be imposed on the classification
problem.
Numerical Testing
The Promoter Recognition Dataset
 Promoter: Short DNA sequence that
precedes a gene sequence.
 A promoter consists of 57 consecutive
DNA nucleotides belonging to {A,G,C,T} .
 Important to distinguish between
promoters and nonpromoters
 This distinction identifies starting locations
of genes in long uncharacterized DNA
sequences.
The Promoter Recognition Dataset
Numerical Representation
 Simple “1 of N” mapping scheme for converting
nominal attributes into a real valued representation:
 Not most economical representation, but commonly
used.
The Promoter Recognition Dataset
Numerical Representation
 Feature space mapped from 57-dimensional nominal
space to a real valued 57 x 4=228 dimensional space.
57 nominal values
57 x 4 =228
binary values
Promoter Recognition Dataset
Prior Knowledge Rules
 Prior knowledge consist of the following 64 rules:
2
3
R1
6 or 7
6
7
6 R2 7 V
6
7
6 or 7
6
7
6 R3 7
4
5
or
R4
2
3
R5
6 or 7
6
7
6 R6 7 V
6
7
6 or 7
6
7
6 R7 7
4
5
or
R8
2
3
R9
6 or 7
6
7
6 R10 7
6
7 = ) PROM OTER
6 or 7
6
7
6 R11 7
4
5
or
R12
Promoter Recognition Dataset
Sample Rules
R4 : (pà 36 = T) ^ (pà 35 = T) ^ (pà 34 = G)
^ (pà 33 = A ) ^ (pà 32 = C);
R8 : (pà 12 = T) ^ (pà 11 = A ) ^ (pà 07 = T);
R10 : (pà 45 = A ) ^ (pà 44 = A ) ^ (pà 41 = A );
where pj denotes position of a nucleotide, with
respect to a meaningful reference point starting at
position pà 50 and ending at position p7:
Then:
R4 ^ R8 ^ R10 = )
PROM OTER
The Promoter Recognition Dataset
Comparative Algorithms
 KBANN Knowledge-based artificial neural network
[Shavlik et al]
 BP: Standard back propagation for neural networks
[Rumelhart et al]
 O’Neill’s Method Empirical method suggested by
biologist O’Neill [O’Neill]
 NN: Nearest neighbor with k=3 [Cost et al]
 ID3: Quinlan’s decision tree builder[Quinlan]
 SVM1: Standard 1-norm SVM [Bradley et al]
The Promoter Recognition Dataset
Comparative Test Results
Principal Topics
Breast cancer prognosis & chemotherapy
Kaplan-Meier Curves for Overall Patients:
With & Without Chemotherapy
Breast Cancer Prognosis & Chemotherapy
Good, Intermediate & Poor Patient Groupings
(6 Input Features : 5 Cytological, 1 Histological)
(Clustering: Utilizes 2 Histological Features &Chemotherapy)
253 Patients
(113 NoChemo, 140 Chemo)
Good1:
Lymph=0 AND Tumor<2
Compute Median Using 6 Features
Compute Initial
Cluster Centers
Poor1:
Lymph>=5 OR Tumor>=4
Compute Median Using 6 Features
Cluster 113 NoChemo Patients
Cluster 140 Chemo Patients
Use k-Median Algorithm with Initial Centers:
Use k-Median Algorithm with Initial Centers:
Medians of Good1 & Poor1
Medians of Good1 & Poor1
69 NoChemo Good
Good
44 NoChemo Poor
67 Chemo Good
Intermediate
73 Chemo Poor
Poor
Kaplan-Meier Survival Curves
for Good, Intermediate & Poor Patients
82.7% Classifier Correctness via 3 SVMs
Kaplan-Meier Survival Curves for Intermediate Group
Note Reversed Role of Chemotherapy
Multiple Myeloma Detection
Multiple Myeloma is cancer of the plasma cell
Plasma cells normally produce antibodies
Out of control plasma cells produce tumors
When tumors appear in multiple sites they are called
Multiple Myeloma
Dataset
105 patients: 74 with MM, 31 healthy
Each patient is represented by 7008 gene measurements
taken from plasma cell samples
For each one of the 7008 gene measurements
Absolute Call (AC):
Absent (A), Marginal (M) or Present (P)
Average Difference (AD):
Positive or negative number
Multiple Myeloma Data Representation
A1 0 0
M0 1 0
P 0 0 1
AMP  7008 X 3 = 21024
AD  7008
Total = 28,032 per patient
104 Patients: 74 MM + 31 Healthy
104 X 28,032 Data Matrix A
Multiple Myeloma 1-Norm SVM Linear Classifier
Leave-one-out-correctness (looc) = 100%
Average number of features used = 7 per fold
Total computing time for 105 folds = 7892 sec.
 Overall number of features used in 105 folds= 7
Breast Cancer Treatment Response
Joint with ExonHit - Paris (Curie Dataset)
35 patients treated by a drug cocktail
9 partial responders; 26 nonresponders
25 gene expressions out of 692, selected by Arnaud Zeboulon
Most patients had 3 replicate measurements
1-Norm SVM classifier selected 14 out of 25 gene expressions
 Leave-one-out correctness was 80%
Greedy combinatorial approach selected 5 genes out of 14
Separating plane obtained in 5-dimensional gene-expression
space
 Replicates of all patients except one used in training
Average of replicates of patient left out used for testing
Leave-one-out correctness was 33 out of 35, or 94.2%
Separation of Convex Hull of Replicates of:
10 Synthetic Nonresponders & 4 Synthetic Partial Responders
Linear Classifier in 3-Gene Space
35 Patients with 93 Replicates
26 Nonresponders & 9 Partial Responders
Conclusion
New approaches for SVM-based classification
Algorithms capable of classifying data with few examples in
very large dimensional spaces
Typical of microarray classification problems
Classifiers based on both abstract prior knowledge as well
as conventional datasets
Identification of breast cancer patients that can benefit from
chemotherapy
 Useful tool for drug discovery