Transcript file

Nycomed Chair for
Bioinformatics and Information Mining
Kernel Methods for Classification
From Theory to Practice
14. Sept 2009
Iris Adä, Michael Berthold, Ulrik Brandes, Martin Mader, Uwe Nagel
Goals of the Tutorial
At lunch time on Tuesday, you will
• Have learned about Linear Classifiers and SVMs
• Have improved a kernel based classifier
• Will know what Finnish looks like
• Have a hunch what a kernel is
• Had a chance at winning a trophy.
GK-Tutorial "Kernel Methods for Classification" - Adä, Berthold, Brandes, Mader, Nagel
#2
Outline – Monday (13:15 – 23:30)
• The Theory:
– Motivation: Learning Classifiers from Data
– Linear Classifiers
• Delta Learning Rule
– Kernel Methods & Support Vector Machines
• Dual Representation
• Maximal Margin
• Kernels
• The Environment:
– KNIME: A short Intro
• Practical Stuff:
– How to develop nodes in KNIME
– Install on your laptop(s)
• You work, we rest:
– Invent a new (and better) Kernel
• Dinner
• (Invent an even better Kernel…)
GK-Tutorial "Kernel Methods for Classification" - Adä, Berthold, Brandes, Mader, Nagel
#3
Outline – Tuesday (9:00 – 12:00)
•
•
•
•
~9-11: refine your kernel
11:00 score test data set
11:13 winning kernel(s) presented
12:00 Lunch and Award “Ceremony”
GK-Tutorial "Kernel Methods for Classification" - Adä, Berthold, Brandes, Mader, Nagel
#4
Learning Models
Assumptions:
• no major influence of non-observed inputs
other
inputs
observed
inputs
System
Data
observed
outputs
GK-Tutorial "Kernel Methods for Classification" - Adä, Berthold, Brandes, Mader, Nagel
Model
#5
Predicting Outcomes
Assumptions:
• static system
new
inputs
Model
predicted
outputs
GK-Tutorial "Kernel Methods for Classification" - Adä, Berthold, Brandes, Mader, Nagel
#6
Learning Classifiers from Data
• Training data consists of input with labels, e.g.
–
–
–
–
Credit card transactions (fraud: no/yes)
Hand written letter (“A”, … “Z”)
Drug candidate classification (toxic, non toxic)
…
• Multi-label classification problems can be reduced to a binary
yes/no classification
• Many, many algorithms around.
Why?
– Choice of algorithm influences generalization capability
– There is no best algorithm for all classification problems
GK-Tutorial "Kernel Methods for Classification" - Adä, Berthold, Brandes, Mader, Nagel
#7
Linear Discriminant
• Simple linear, binary classifier:
n

 
f ( x )  w, x  b   xi wi  b
i 1
– Class A if f(x) positive
– Class B if f(x) negative


e.g. h( x )  sgn  f x  is the decision function.
GK-Tutorial "Kernel Methods for Classification" - Adä, Berthold, Brandes, Mader, Nagel
#8
Linear Discriminant
n

 
 
 
f ( x )  w, x  b   xi wi  b  b  x w cosw, x 
i 1

w
 
cos(w, x)  0
 
cos(w, x)  0
• Linear discriminants represent hyperplanes in feature space
GK-Tutorial "Kernel Methods for Classification" - Adä, Berthold, Brandes, Mader, Nagel
#9
Primal Perceptron
• Rosenblatt (1959) introduced simple learning algorithm for
linear discriminants (“perceptron”):
GK-Tutorial "Kernel Methods for Classification" - Adä, Berthold, Brandes, Mader, Nagel
#10
Rosenblatt Algorithm
• Algorithm is
– On-line (pattern by pattern approach)
– Mistake driven (updates only in case of wrong classification)
• Algorithm converges guaranteed if a hyperplane exists which
classifies all training data correctly (data is linearly separable)
• Learning rule:



 wt  1  wt   yi  xi
 
IF yi  xi , w  b   0 THEN 
2
b
(
t

1
)

b
(
t
)

y

R
i

• One observation:
– Weight vector (if initialized properly) is simply a weighted sum of input
vectors
(b is even more trivial)
GK-Tutorial "Kernel Methods for Classification" - Adä, Berthold, Brandes, Mader, Nagel
#11
Dual Representation
• Weight vector is a weighted sum of input vectors:
 m

w   j  y j  x j
j 1
• “difficult” training patterns have larger alpha
• “easier” ones have smaller or zero alpha
GK-Tutorial "Kernel Methods for Classification" - Adä, Berthold, Brandes, Mader, Nagel
#12
Dual Representation
Dual Representation of the linear discriminant function:
 m

 
 
f x   w, x  b    j  y j  x j , x
 j 1

b


GK-Tutorial "Kernel Methods for Classification" - Adä, Berthold, Brandes, Mader, Nagel
#13
Dual Representation
• Dual Representation of Learning Algorithm
GK-Tutorial "Kernel Methods for Classification" - Adä, Berthold, Brandes, Mader, Nagel
#14
Dual Representation
• Learning Rule
 n

 αi t  1  αi t   1
 
IF yi    j y j xi ,x j  b   0 THEN 
2
b(t

1
)

b(t)

y

R
i

 j 1

• Harder to learn examples have larger alpha (higher information
content)
• The information about training examples enters algorithm only
through the inner products (which we could pre-compute!)
GK-Tutorial "Kernel Methods for Classification" - Adä, Berthold, Brandes, Mader, Nagel
#15
Dual Representation in other spaces
• All we need for training:
– Computation of inner products of all training examples
• If we train in a different space:
– Computation of inner products in the projected space
GK-Tutorial "Kernel Methods for Classification" - Adä, Berthold, Brandes, Mader, Nagel
#16
Kernel Functions
• A kernel allows us (via K) to compute the inner product of two
points x and y in the projected space without ever entering that
space...
GK-Tutorial "Kernel Methods for Classification" - Adä, Berthold, Brandes, Mader, Nagel
#17
…in Kernel Land…
• The discriminant function in our project space:
n



f(x)   j y j x ,x j   b
j 1
• And, using a kernel:
n

 
f(x)   j y j K x, x j   b
j 1
GK-Tutorial "Kernel Methods for Classification" - Adä, Berthold, Brandes, Mader, Nagel
#18
The Gram Matrix
• All data necessary for
– The decision function
– The training of the coefficients
can be pre-computed using a Kernel or Gram Matrix:
(If K is symmetric and positive semi-definite then K() is a Kernel.)
GK-Tutorial "Kernel Methods for Classification" - Adä, Berthold, Brandes, Mader, Nagel
#19
Kernels
• A simple kernel is
 
 
K x , y   x , y
2
• And the corresponding projected space:


2
2
( x1 , x2 )  x   x1 , x2 , 2 x1 x2

• Why?
GK-Tutorial "Kernel Methods for Classification" - Adä, Berthold, Brandes, Mader, Nagel
#20
Kernels
• A few (slightly less) simple kernels are
 
 
K x , y   x , y
d
• And the corresponding projected spaces are of dimension
 n  d  1


 d 
…computing the inner products in the projected space becomes
pretty expensive rather quickly…
GK-Tutorial "Kernel Methods for Classification" - Adä, Berthold, Brandes, Mader, Nagel
#21
Kernels
• Gaussian Kernel:
 
 xy 
 

K x , y   exp  
2 
 2 
• Polynomial Kernel of degree d:
 
 
d
K  x , y    x , y  1
GK-Tutorial "Kernel Methods for Classification" - Adä, Berthold, Brandes, Mader, Nagel
#22
Why?
• Great:
we can also apply Rosenblatt’s algorithm to other spaces
implicitly.
• So what?
GK-Tutorial "Kernel Methods for Classification" - Adä, Berthold, Brandes, Mader, Nagel
#23
Transformations…
 

2 x 
x2
x1
GK-Tutorial "Kernel Methods for Classification" - Adä, Berthold, Brandes, Mader, Nagel

1 x 
#24
Polynomial Kernel
 

2 x 
x2
x1
GK-Tutorial "Kernel Methods for Classification" - Adä, Berthold, Brandes, Mader, Nagel

1 x 
#25
Gauss Kernel
GK-Tutorial "Kernel Methods for Classification" - Adä, Berthold, Brandes, Mader, Nagel
#26
Kernels
• Note that we do not need to know the projection Φ,
it is sufficient to prove that K(.) is a Kernel.
• A few notes:
– Kernels are modular and closed: we can compose new Kernels based on
existing ones
– Kernels can be defined over non numerical objects
• text: e.g. string matching kernel
• images, trees, graphs, …
• Note also: A good Kernel is crucial
– Gram Matrix diagonal: classification easy and useless
– No Free Kernel: too many irrelevant attributes: Gram Matrix diagonal.
GK-Tutorial "Kernel Methods for Classification" - Adä, Berthold, Brandes, Mader, Nagel
#27
Finding Linear Discriminants
• Finding the hyperplane (in any space) still leaves lots of room
for variations – does it?
• We can define “margins” of individual training examples:
 
 i  yi  w, xi  b
(appropriately normalized this is a “geometrical” margin)
• The margin of a hyperplane (with respect to a training set):
min i 1... n  i
• And a maximal margin of all training examples indicates the
maximum margin over all hyperplanes.
GK-Tutorial "Kernel Methods for Classification" - Adä, Berthold, Brandes, Mader, Nagel
#28
(maximum) Margin of a Hyperplane
GK-Tutorial "Kernel Methods for Classification" - Adä, Berthold, Brandes, Mader, Nagel
#29
Support Vector Machines
• Dual Representation
– Classifier as weighted sum over inner products of training pattern
(or only support vectors) and the new pattern.
– Training analog
• Kernel-Induced feature space
– Transformation into higher-dimensional space
(where we will hopefully be able to find a linear separation plane).
– Representation of solution through few support vectors (alpha>0).
• Maximum Margin Classifier
– Reduction of Capacity (Bias) via maximization of margin
(and not via reduction of degrees of freedom).
– Efficient parameter estimation: see IDA book.
GK-Tutorial "Kernel Methods for Classification" - Adä, Berthold, Brandes, Mader, Nagel
#30
Soft and Hard Margin Classifiers
• What can we do if no linear separating hyperplane exists?
• Instead of focusing on find a hard margin, allow minor violations
– Introduce (positive) slack variables (patterns with slack are allowed to lie
in margin)
– Misclassifications are allowed if slack can be negative.
GK-Tutorial "Kernel Methods for Classification" - Adä, Berthold, Brandes, Mader, Nagel
#31
Kernel Methods: Summary
• Main idea of Kernel Methods:
– Embed data into a suitable vector space
– Find linear classifier (or other linear pattern of interest) in the new space
• Needed: Mapping (implicit or explicit)
 : x  X   ( x)  F
• Key Assumptions
– Information about relative position is often all that is needed by learning
methods
– The inner products between points in the projected space can be
computed in the original space using special functions (kernels).
GK-Tutorial "Kernel Methods for Classification" - Adä, Berthold, Brandes, Mader, Nagel
#32
Support Vector Machines
• Powerful classifier
• computation of optimal classifier is possible
• Choice of kernel is critical
GK-Tutorial "Kernel Methods for Classification" - Adä, Berthold, Brandes, Mader, Nagel
#33
KNIME
• Coffee Break.
• And then:
– KNIME, the Konstanz Information Miner
– SVMs (and other classifiers) in KNIME
GK-Tutorial "Kernel Methods for Classification" - Adä, Berthold, Brandes, Mader, Nagel
#34