Probability Density Based Indexing for High

Download Report

Transcript Probability Density Based Indexing for High

Support Vector Machines:
Hype or Hallelujah?
Kristin Bennett
Math Sciences Dept
Rensselaer Polytechnic Inst.
http://www.rpi.edu/~bennek
October 2-4, 2000
M2000
1
Outline
Support Vector Machines for Classification
Linear Discrimination
Nonlinear Discrimination
Extensions
Application in Drug Design
Hallelujah
Hype
October 2-4, 2000
M2000
2
Support Vector Machines
(SVM)
A methodology for inference based on Vapnik’s
Statistical Learning Theory.
Key Ideas:
“Maximize Margins”
“Do the Dual”
“Construct Kernels”
October 2-4, 2000
M2000
3
Best Linear Separator?
October 2-4, 2000
M2000
4
Best Linear Separator?
October 2-4, 2000
M2000
5
Best Linear Separator?
October 2-4, 2000
M2000
6
Best Linear Separator?
October 2-4, 2000
M2000
7
Best Linear Separator?
October 2-4, 2000
M2000
8
Find Closest Points in
Convex Hulls
d
c
October 2-4, 2000
M2000
9
Plane Bisect Closest Points
x wb
w  d c
d
c
October 2-4, 2000
M2000
10
Find using quadratic
program
min
1
2
c    i xi
i1
s.t.

i1
i
1
cd
d 
2
 x
i1

i1
i  0
i
i
i
1
i  1,..., 
Many existing and new solvers.
October 2-4, 2000
M2000
11
Best Linear Separator:
Supporting Plane Method
Maximize distance
Between two paral
supporting planes
x w  b 1
Distance
= “Margin”
= 2
x w  b 1
October 2-4, 2000
|| w ||
M2000
12
Maximize margin using
quadratic program
min
w ,b
s.t
October 2-4, 2000
1
2
2
|| w ||
xi w  b  1 i  Class 1
xi w  b  1 i  Class  1
M2000
13
Dual of Closest Points Method
is Support Plane Method
min

s.t.
2
1
yi i xi ||
||

min
2 i 1
w ,b
 i  1   i  1  s.t.
2
1
w||
||
2
yi  xi w  b   1
 0
i  1,..,
i1
i1
 0
Solution only depends on support vectors:
i
w   yi i xi
i 1
October 2-4, 2000
 1 i  Class 1 
yi : 

1 i  Class  1
M2000
14
Statistical Learning
Theory
Misclassification error and the function
complexity bound generalization error.
Maximizing margins minimizes complexity.
“Eliminates” overfitting.
Solution depends only on Support Vectors
not number of attributes.
October 2-4, 2000
M2000
15
Margins and Complexity
Skinny margin
is more flexible
thus more comple
October 2-4, 2000
M2000
16
Margins and Complexity
Fat margin
is less complex.
October 2-4, 2000
M2000
17
Linearly Inseparable Case
Convex Hulls Intersec
Same argument
won’t work.
October 2-4, 2000
M2000
18
Reduced Convex Hulls
Don’t Intersect
d

iClass1

iClass1
 i xi
i  1
0  i  D
D
October 2-4, 2000
Reduce by adding
upper bound D
1
2
M2000
19
Find Closest Points
Then Bisect
min
s.t.
1
2
||   i xi    i xi ||2
i1

i1
i1
i
1

i1
i
1
0  i  D
No change except for D.
D determines number of Support
Vectors.
October 2-4, 2000
M2000
20
Linearly Inseparable Case:
Supporting Plane Method
Just add non-negative error
vector z.
min
w ,b , z
|| w ||  C  zi
2
i 1
yi  xi w  b   zi  1
s.t
October 2-4, 2000
1
2
zi  0 i  1,..,
M2000
21
Dual of Closest Points Method
is Support Plane Method
min

s.t.
2
1
yi i xi ||
||

min
2 i 1
w ,b
 i  1   i  1  s.t.
2
1
w||  C  zi
||
2
i 1
yi  xi w  b   zi  0
D  i  0
zi  0
i1
i1
i  1,..,
i  0
Solution only depends on support vectors:
w   yi i xi
i 1
October 2-4, 2000
M2000
22
Nonlinear Classification
x   a, b 
x w  w1a  w2b

 ( x)   a, b, ab, a , b 
2
2
 ( x) w  w1a  w2b  w3ab  w4 a  w5b
2
October 2-4, 2000
M2000
2
23
Nonlinear Classification: Map
to higher dimensional space
IDEA: Map each point to higher dimensional feature space and
construct linear discriminant in the higher dimensional space.
Define :  ( x) :

n
n '  n
n'
Dual SVM becomes:
min

s.t.
1
2
  y y    (x )  (x
i 1
j 1
 y
i 1
i
i
i
j
i
j
i
j
)  i
i 1
0
 i  0 i  1,..,
October 2-4, 2000
M2000
24
Generalized Inner Product
By Hilbert-Schmidt Kernels (Courant and Hilbert 1953)
 (u )  (v)  K (u, v)
for certain  and K, e.g.
 (u )
K (u, v)
(u v  1) d
Degree d polynomial
Radial Basis Function Machine
Two Layer Neural Network
October 2-4, 2000
M2000
 || u  v ||2 
exp  

2



sigmoid ( (u v)  c)
25
Final Classification via
Kernels
The Dual SVM becomes:
min

s.t.
1
2
 y y  
i 1 j 1
 y
i 1
i
i
i
j
i
j
K ( xi , x j )    i
i 1
0
 i  0 i  1, ..,
October 2-4, 2000
M2000
26
October 2-4, 2000
M2000
27
Final SVM Algorithm
Solve Dual SVM QP
Recover primal variable b
Classify new x


f ( x)  sign  i yi K ( x, xi )  b 
 i 1

i : 0
Solution only depends on support vectors
October 2-4, 2000
M2000
28
Support Vector Machines
(SVM)
Key Formulation Ideas:
“Maximize Margins”
“Do the Dual”
“Construct Kernels”
Generalization Error Bounds
Practical Algorithms
October 2-4, 2000
M2000
29
SVM Extensions
Regression
Variable Selection
Boosting
Density Estimation
Unsupervised Learning
Novelty/Outlier Detection
Feature Detection
Clustering
October 2-4, 2000
M2000
30
Example in Drug Design
Goal to predict bio-reactivity of molecules
to decrease drug development time.
Target is to predict the logarithm of
inhibition concentration for site "A" on
the Cholecystokinin (CCK) molecule.
Constructs quantitative structure activity
relationship (QSAR) model.
October 2-4, 2000
M2000
31
SVM Regression:
-insensitive loss function
( x w  b)  y  
( x w  b)  y
-
( x w  b)  y  
October 2-4, 2000
M2000
+
32

SVM Minimizes
Underestimate+Overestimate
( x w  b)  y  
min*
w ,b , z , z
( x w  b)  y  
October 2-4, 2000
s.t
M2000
C   zi  zi*  || w ||1
i 1


   j K ( xi , x j )  b  y   zi  
 j 1



   j K ( xi , x j )  b  y   zi  
 j 1

zi , zi*  0 i  1,..,
33
LCCKA Problem
Training data – 66 molecules
323 original attributes are wavelet
coefficients of TAE Descriptors.
39 subset of attributes selected by linear
1-norm SVM (with no kernels).
For details see DDASSL project link off of
http://www.rpi.edu/~bennek.
Testing set results reported.
October 2-4, 2000
M2000
34
LCCK Prediction
LCCKATest Set Estimates
1
Predicted Value
0.8
0.6
0.4
0.2
0
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
True Value
1
Q2=.25
October 2-4, 2000
M2000
35
Many Other Applications
Speech Recognition
Data Base Marketing
Quark Flavors in High Energy Physics
Dynamic Object Recognition
Knock Detection in Engines
Protein Sequence Problem
Text Categorization
Breast Cancer Diagnosis
See: http://www.clopinet.com/isabelle/Projects/
SVM/applist.html
October 2-4, 2000
M2000
36
Hallelujah!
Generalization theory and practice meet
General methodology for many types of
problems
Same Program + New Kernel = New method
No problems with local minima
Few model parameters. Selects capacity.
Robust optimization methods.
Successful Applications
BUT…
October 2-4, 2000
M2000
37
HYPE?
Will SVMs beat my best hand-tuned method Z
for X?
Do SVM scale to massive datasets?
How to chose C and Kernel?
What is the effect of attribute scaling?
How to handle categorical variables?
How to incorporate domain knowledge?
How to interpret results?
October 2-4, 2000
M2000
38
Support Vector Machine
Resources
http://www.support-vector.net/
http://www.kernel-machines.org/
Links off my web page:
http://www.rpi.edu/~bennek
October 2-4, 2000
M2000
39