Transcript KernelIntro
Welcome to the Kernel-Class
•
•
•
•
My name: Max (Welling)
Book:
There will be class-notes/slides.
Homework: reading material, some exercises,
some MATLAB implementations.
• I like: an active attitude in class.
ask questions! respond to my questions.
• Enjoy.
1
Primary Goal
• What is the primary goal of:
- Machine Learning
- Data Mining
- Pattern Recognition
- Data Analysis
- Statistics
Automatic detection of non-coincidental structure in data.
2
Desiderata
• Robust algorithms are insensitive to outliers and wrong
model assumptions.
• Stable algorithms: generalize well to unseen data.
• Computationally efficient algorithms are necessary to handle
large datasets.
3
Supervised & Unsupervised
Learning
• supervised: classification, regression
• unsupervised: clustering, dimensionality reduction, ranking,
outlier detection. f (x ) wT (x ) b
• primal vs. dual problems: generalized linear models vs.
kernel classification.
f (x ) i K (x , xi )
i
this is like nearest neighbor
classification.
4
Feature Spaces
: x ( x), R F
d
non-linear mapping to F
1. high-D space L2
2. infinite-D countable space :
3. function space (Hilbert space)
example:
( x, y ) ( x , y , 2 xy )
2
2
5
Kernel Trick
Note: In the dual representation we used the Gram matrix
to express the solution.
Kernel Trick:
Replace : x ( x),
kernel
Gij xi , x j Gij ( xi ), ( x j ) K ( xi , x j )
If we use algorithms that only depend on the Gram-matrix, G,
then we never have to know (compute) the actual features
This is the crucial point of kernel methods
6
Properties of a Kernel
Definition: A finitely positive semi-definite function k : x y R
is a symmetric function of its arguments for which matrices formed
by restriction on any finite subset of points is positive semi-definite.
T K 0
Theorem: A function k : x y R can be written
as k ( x, y ) ( x), ( y ) where ( x) is a feature map
x ( x) F iff k(x,y) satisfies the semi-definiteness property.
Relevance: We can now check if k(x,y) is a proper kernel using
only properties of k(x,y) itself,
7
i.e. without the need to know the feature map!
Modularity
Kernel methods consist of two modules:
1) The choice of kernel (this is non-trivial)
2) The algorithm which takes kernels as input
Modularity: Any kernel can be used with any kernel-algorithm.
some kernels:
k ( x, y ) e( || x y||
2
/ c)
k ( x, y ) ( x, y ) d
k ( x, y ) tanh( x, y )
1
k ( x, y )
|| x y ||2 c 2
some kernel algorithms:
- support vector machine
- Fisher discriminant analysis
- kernel regression
- kernel PCA
- kernel CCA
8
Goodies and Baddies
Goodies:
•Kernel algorithms are typically constrained convex optimization
problems solved with either spectral methods or convex optimization tools.
• Efficient algorithms do exist in most cases.
• The similarity to linear methods facilitates analysis. There are strong
generalization bounds on test error.
Baddies:
• You need to choose the appropriate kernel
• Kernel learning is prone to over-fitting
• All information must go through the kernel-bottleneck.
9
Regularization
• regularization is very important!
• regularization parameters determined by out of sample.
measures (cross-validation, leave-one-out).
Demo Trevor Hastie.
10
Learning Kernels
• All information is tunneled through the Gram-matrix information
bottleneck.
• The real art is to pick an appropriate kernel.
2
e.g. take the RBF kernel: k ( x, y) e( || x y|| / c)
if c is very small: G=I (all data are dissimilar): over-fitting
if c is very large: G=1 (all data are very similar): under-fitting
We need to learn the kernel. Here is some ways to combine
kernels to improve them:
k1 cone
k1 ( x, y ) k2 ( x, y ) k ( x, y ) , 0
k2
k ( x, y ) k ( x, y ) k ( x , y )
1
2
k1 (( x), ( y )) k ( x, y )
any positive
polynomial
11