Introduction to Kernels
Download
Report
Transcript Introduction to Kernels
(chapters 1,2,3,4)
Introduction to Kernels
Max Welling
October 1 2004
1
Introduction
• What is the goal of (pick your favorite name):
- Machine Learning
- Data Mining
- Pattern Recognition
- Data Analysis
- Statistics
Automatic detection of non-coincidental structure in data.
• Desiderata:
- Robust algorithms insensitive to outliers and wrong
model assumptions.
- Stable algorithms: generalize well to unseen data.
- Computationally efficient algorithms: large datasets.
2
Let’s Learn Something
Find the common characteristic (structure) among the following
statistical methods?
1. Principal Components Analysis
2. Ridge regression
3. Fisher discriminant analysis
4. Canonical correlation analysis
Answer:
We consider linear combinations of input vector:
f ( x) wT x
Linear algorithm are very well understood and enjoy strong guarantees.
(convexity, generalization bounds).
Can we carry these guarantees over to non-linear algorithms?
3
Feature Spaces
: x ( x), R F
d
non-linear mapping to F
1. high-D space L2
2. infinite-D countable space :
3. function space (Hilbert space)
example:
( x, y ) ( x , y , 2 xy )
2
2
4
Ridge Regression (duality)
problem:
min w ( yi wT xi ) 2 || w ||2
i 1
target
solution:
w ( X T X I d ) 1 X T y
dxd inverse
X T ( XX T I ) 1 y
X T (G I ) 1 y
inverse
Gij xi , x j
xi i
i 1
linear comb. data
regularization
input
Dual Representation
Gram-matrix
5
Kernel Trick
Note: In the dual representation we used the Gram matrix
to express the solution.
Kernel Trick:
Replace : x ( x),
kernel
Gij xi , x j Gij ( xi ), ( x j ) K ( xi , x j )
If we use algorithms that only depend on the Gram-matrix, G,
then we never have to know (compute) the actual features
This is the crucial point of kernel methods
6
Modularity
Kernel methods consist of two modules:
1) The choice of kernel (this is non-trivial)
2) The algorithm which takes kernels as input
Modularity: Any kernel can be used with any kernel-algorithm.
some kernels:
k ( x, y ) e( || x y||
2
/ c)
k ( x, y ) ( x, y ) d
k ( x, y ) tanh( x, y )
1
k ( x, y )
|| x y ||2 c 2
some kernel algorithms:
- support vector machine
- Fisher discriminant analysis
- kernel regression
- kernel PCA
- kernel CCA
7
What is a proper kernel
Definition: A finitely positive semi-definite function k : x y R
is a symmetric function of its arguments for which matrices formed
by restriction on any finite subset of points is positive semi-definite.
T K 0
Theorem: A function k : x y R can be written
as k ( x, y ) ( x), ( y ) where ( x) is a feature map
x ( x) F iff k(x,y) satisfies the semi-definiteness property.
Relevance: We can now check if k(x,y) is a proper kernel using
only properties of k(x,y) itself,
8
i.e. without the need to know the feature map!
Reproducing Kernel Hilbert Spaces
The proof of the above theorem proceeds by constructing a very
special feature map (note that more feature maps may give rise to a kernel)
: x ( x) k ( x,.)
i.e. we map to a function space.
definition function space:
reproducing property:
m
f (.) i k ( xi ,.) any m,{xi }
i 1
m
f , g i j k ( xi , x j )
f , ( x) f , k ( x,.)
k
i k ( xi ,.), k ( x,.)
i 1
i 1 j 1
k
f , f i j k ( xi , x j ) 0
k ( x , x) f ( x)
( finite positive semi definite)
( x), ( y ) k ( x, y ) 9
m
i 1 j 1
i 1
i
i
Mercer’s Theorem
Theorem: X is compact, k(x,y) is symmetric continuous function s.t.
Tk f k (., x) f ( x) dx is a positive semi-definite operator: Tk 0
i.e.
k ( x, y) f ( x) f ( y) dxdy 0 f L2 ( X )
then there exists an orthonormal feature basis of eigen-functions
such that:
k ( x, y ) i ( x ) j ( y )
i 1
Hence: k(x,y) is a proper kernel.
Note: Here we construct feature vectors in L2, where the RKHS
construction was in a function space.
10
Learning Kernels
• All information is tunneled through the Gram-matrix information
bottleneck.
• The real art is to pick an appropriate kernel.
2
e.g. take the RBF kernel: k ( x, y) e( || x y|| / c)
if c is very small: G=I (all data are dissimilar): over-fitting
if c is very large: G=1 (all data are very similar): under-fitting
We need to learn the kernel. Here is some ways to combine
kernels to improve them:
k1 cone
k1 ( x, y ) k2 ( x, y ) k ( x, y ) , 0
k2
k ( x, y ) k ( x, y ) k ( x , y )
1
2
k1 (( x), ( y )) k ( x, y )
any positive
polynomial
11
Stability of Kernel Algorithms
Our objective for learning is to improve generalize performance:
cross-validation, Bayesian methods, generalization bounds,...
Call ES [ f ( x)] 0 a pattern a sample S.
Is this pattern also likely to be present in new data: EP [ f ( x)] 0 ?
We can use concentration inequalities (McDiamid’s theorem)
to prove that:
Theorem: Let S {x1 ,..., x } be a IID sample from P and define
the sample mean of f(x) as: f 1 f ( xi ) then it follows that:
i 1
P(|| f EP [ f ] ||
R
1
(2 2ln( )) 1
R sup x || f ( x) ||
12
(prob. that sample mean and population mean differ less than is more than ,independent of P!
Rademacher Complexity
Prolem: we only checked the generalization performance for a
single fixed pattern f(x).
What is we want to search over a function class F?
Intuition: we need to incorporate the complexity of this function class.
Rademacher complexity captures the ability of the function class to
fit random noise. ( i 1 uniform distributed)
i 1
f1
(empirical RC)
R ( F ) E [sup |
2
f F
R ( F ) ES E [sup |
f F
i 1
2
i
f ( xi ) |,| x1 ,..., x ]
i 1
i
f ( xi ) |]
xi
13
f2
Generalization Bound
Theorem: Let f be a function in F which maps to [0,1]. (e.g. loss functions)
Then, with probability at least 1 over random draws of size
every f satisfies:
2
E p [ f ( x)] Edata [ f ( x)] R ( F )
ln( )
2
2
Edata [ f ( x)] R ( F ) 3
ln( )
2
Relevance: The expected pattern E[f]=0 will also be present in a new
data set, if the last 2 terms are small:
- Complexity function class F small
14
- number of training data large
Linear Functions (in feature space)
FB { f : x w, ( x) , || w || B}
Consider the
function class:
with
and a sample:
S {x1 ,..., x }
Then, the empirical
RC of FB is bounded by:
k ( x, y ) ( x), ( y )
R ( FB )
2B
tr ( K )
Relevance: Since: {x i k ( xi , x) , T K B} FB
it follows that
if we control the norm i 1 T K || w ||2 in kernel algorithms, we control
the complexity of the function class (regularization).
15
Margin Bound (classification)
Theorem: Choose c>0 (the margin).
F : f(x,y)=-yg(x), y=+1,-1
S: {( x1 , y1 ),...,( x , y )} IID sample
: (0,1) : probability of violating bound.
2
1
4
Pp [ y sign( g ( x))] i
tr ( K ) 3
c i 1
c
(prob. of misclassification)
ln( )
2
i (c yi g ( xi )) ( slack variable)
( f ) f if f 0 and 0 otherwise
Relevance: We our classification error on new samples. Moreover, we have a
strategy to improve generalization: choose the margin c as large possible such
16
that all samples are correctly classified: 0 (e.g. support vector machines).
i