Transcript Overview

Overview of Kernel Methods
(Part 2)
Steve Vincent
November 17, 2004
1
Overview



Kernel methods offer a modular framework.
In a first step, a dataset is processed into a kernel
matrix. Data can be of various types, and also
heterogeneous types.
In a second step, a variety of kernel algorithms can
be used to analyze the data, using only the
information contained in the kernel matrix
2
What will be covered today

PCA vs. Kernel PCA




Algorithm
Example
Comparison
Text Related Kernel Methods




Bag of Words
Semantic Kernels
String Kernels
Tree Kernels
3
PCA algorithm
1.
2.
3.
4.
5.
Subtract the mean from all the data points
c
T
Compute the covariance matrix S=
xx
n 1 n n
Diagonalize S to get its eigenvalues and
eigenvectors
Retain c eigenvectors corresponding to the c
c
N
largest eigenvalues such that  j 1  j /  j 1  j
equals the desired variance to be captured
Project the data points on the eigenvectors

1

a n  E  xk 
N

T



x
i
i 1

N
4
Kernel PCA algorithm (1)
1.
2.
3.
4.
5.
Given N data points in d dimensions let
X={x1|x2|..|xN} where each column represents one
data point
Subtract the mean from all the data points
Choose an appropriate kernel k
Form the NxN Gram matrix K|ij=[k(xi,xj)}
Form the modified Gram matrix
T
~  1NxN   1NxN 
K  I 
 K I 

N  
N 

where 1NxN is an NxN matrix with all entries equal to
1
5
Kernel PCA algorithm (2)
6.
7.
8.
Diagonalize K to get its eigenvalues n and its
eigenvectors an
Normalize an  an / n
Retain c eigenvectors corresponding to c largest
C
N
eigenvalues such that
 / 

j 1
9.
j

j 1
j
equals desired variance to be captured
Project the data points on the eigenvectors
  k ( x1 , x) 

1NxN  
1NxN 

T
y  a I 
  ...   K

N 
N 

 k ( x N , x)

6
Data Mining Problem

Data Source



Computational Intelligence and Learning Cluster
Challenge 2000,
http://www.wi.leidenuniv.nl/~putten/library/cc2000/in
dex.html
Supplied by Sentient Machine Research,
http://www.smr.nl
Problem Definition:

Given data which incorporates both socio-economic and various
insurance policy ownership attributes, can we derive models
which help in determining factors or attributes which may
influence or signify individuals who purchase a caravan
insurance policy.
7
Data Selection



5,822 records for training
4,000 records for evaluation
86 attributes

Attributes 1 through 43:


Attributes 44 through 85:


Socio-demographic data derived from zip code areas
Product ownership for customers
Attribute 86

Purchased caravan insurance
8
Data Transformation and
Reduction
Principal Components Analysis (PCA)

[From MatLab]
#
Attributes
%
Variance
#
Attributes
%
Variance
25
73.29
55
98.20
30
80.66
60
98.86
35
86.53
65
99.35
40
91.25
70
99.65
45
94.94
75
99.83
50
97.26
80
99.96
9
Relative Performance


PCA run time:
Kernel PCA run time:

6.138
5.668
Used Radial Basis Function Kernel
  u v
K (u, v)  exp 




2




Matlab Code for PCA and Kernel PCA
algorithm can be supplied if needed
10
Modeling, Test and Evaluation
Manually Reduced Dataset:
* Legend: a – no, b yes
Naïve Bayes  Overall – 82.79% Correctly Classified
a
b
3155
544
a  14.71% False Positive
132
98
b  42.61% Correctly Classified
PCA Reduced Dataset:
Naïve Bayes  Overall – 88.45% Correctly Classified
a
b
3507
255
a  6.77% False Positive
207
31
b  13.03% Correctly Classified
Kernel PCA Reduced Dataset:
Naïve Bayes  Overall – 82.22% Correctly Classified
a
b
3238 541
a  14.3% False Positive
175
74
b  29.7% Correctly Classified
11
Overall Results



KPCA and PCA had similar time
performance
KPCA is much gives results closer to
manually reduced dataset
Future Work:



Examine other Kernels
Vary the parameters for the Kernels
Use other Data Mining Algorithms
12
‘Bag of words’ kernels (1)



Document seen as a vector d, indexed by all the
elements of a (controlled) dictionary. The entry
is equal to the number of occurrences.
A training corpus is therefore represented by a
Term-Document matrix,
noted D=[d1 d2 … dm-1 dm]
From this basic representation, we will apply a
sequence of successive embeddings, resulting in
a global (valid) kernel with all desired properties
13
BOW kernels (2)

Properties:




All order information is lost (syntactical relationships, local
context, …)
Feature space has dimension N (size of the dictionary)
Similarity is basically defined by:
k(d1,d2)=d1•d2= d1t.d2
or, normalized (cosine similarity):
k (d1 , d 2 )
ˆ
k (d1 , d 2 ) 
k (d1 , d1 ).k (d 2 , d 2 )
Efficiency provided by sparsity (and sparse dot-product
algorithm): O(|d1|+|d2|)
14
Latent concept Kernels

Basic idea :
F1
terms
terms
terms
terms
terms
documents
Size d
K(d1,d2)=?
Size t
F2
Size k <<t
Concepts space
15
Semantic Kernels (1)

k(d1,d2)=(d1)SS’(d2)’


where S is the semantic matrix
S can be defined as S=RP where



R is a diagonal matrix giving the term weightings or relevances
P is the proximity matrix defining the semantic spreading between
the different terms of the corpus
The measure for the inverse document frequency for a
term t is given by:
  

w(t )  ln 
 df (t ) 

l=# of documents
df(t)=# of documents
containing term t
The matrix R is diagonal with entries:
Rtt=w(t)
16
Semantic Kernels (2)

The associated kernel is:
~
k (d1 , d 2 )   (d1 ) RR ' (d 2 )'

For the proximity matrix (P) the associated
kernel is:
~
k (d1 , d 2 )   (d1 ) PP' (d 2 )'   (d1 )i Qij (d 2 ) j
i, j
Where Qij encodes the amount of semantic
relation between terms i and j.
17
Semantic Kernels (3)

Most natural method of incorporating
semantic information is be inferring the
relatedness of terms from an external
source of domain knowledge


Example: WordNet Ontology
Semantic Distance


Path length of hierarchical tree
Information content
18
Latent Semantic Kernels (LSK)/
Latent Semantic Indexing (LSI)


Singular Value Decomposition (SVD):
D
'

U
S
V
'
where S is a diagonal matrix of the same dimensions as
D, and U and V are unitary matrices whose columns are
the eigenvectors of D’D and DD’ respectively
LSI projects the documents into the space spanned by
the first k columns of U, suing the new k-dimensional
vectors for subsequent processing
d   (d )U k
where Uk is the matrix containing the first k columns of
U
19
Latent Semantic Kernels (LSK)/
Latent Semantic Indexing (LSI)



New kernel becomes that of Kernel PCA
~

k (d1 , d 2 )   (d1 )U kU k  (d 2 )
LSK is implemented by projecting onto the
features:
k

 1/ 2

 (d )U k   i  (vi ) j k (d j , d ) 
j 1

i 1
where k is the base kernel, and i vi are
eigenvalue, eigenvector pairs of the kernel matrix
Can represent the LSK’s with the proximity matrix
P  U kU k

20
String and Sequence

An alphabet is a finite set S of |S|
symbols.




A string s=s1…s|s| is any finite sequence of
symbols from S, including the empty
sequence.
We denote Sn the set of all finite strings of
length n
String matching: implies contiguity
Sequence matching : only implies order
21
p-spectrum kernel (1)




Features of s = p-spectrum of s =
histogram of all (contiguous) substrings of
length p
Feature space indexed by all elements of
Sp
u(s)=number of occurrences of u in s
The associated kernel is defined as
k p ( s, t ) 

uSp
p
u
( s) (t )
p
u
22
p-spectrum kernel example

Example: 3-spectrum kernel


s=“statistics’ and t=“computation”
The two strings contain the following
substrings of length 3:



“sta”,”tat”, “ati”, “tis”, “ist”, “sti”, “tic”, “ics”
“com”, “omp”, “mpu”, “put”, “uta”, “tat”, “ati”,
“tio”, “ion”
Common substrings of “tat” and “ati”, so
the inner product k(s,t)=2
23
p-spectrum Kernels Recursion

k-suffix kernel is defined by
k
1
if
s

s
u
,
t

t
u
,
for
u

S

1
1
k ks ( s, t )  
otherwise
 0

p-spectrum kernel can be evaluated using the equation:
k p ( s, t ) 

|s| p 1 |t | p 1

i 1
s
k
 p (s(s(i : i  p), t ( j : j  p)
j 1
in O(p |s| |t|) operations
The evaluation of one row of the table for the p-suffix
kernel corresponds to performing a search in the string t for
the p-suffix of a prefix in s.
24
All-subsequences kernels




Feature mapping defined by all contiguous or
non-contiguous subsequences of a string
Feature space indexed by all elements of
S*={e}U S U S2U S3U…
u(s)=number of occurrences of u as a (noncontiguous) subsequence of s
Explicit computation rapidly infeasible
(exponential in |s| even with sparse rep.)
25
Recursive implementation


Consider the addition of one extra symbol a to s:
common subsequences of (sa,t) are either in s or
must end with symbol a (in both sa and t).
Mathematically,
k (s, e )  1
k (sa, t )  k (s, t ) 
 k (s, t (1 : j  1))
j : t j a

This gives a complexity of O(|s||t|2)
26
Fixed-length subsequence kernels



Feature space indexed by all elements of Sp
u(s)=number of occurrences of the p-gram u
as a (non-contiguous) subsequence of s
Recursive implementation (will create a series of p tables)
k0 (s, e )  1
k p (s, e )  0
for p  0
k p (sa, t )  k p (s, t ) 

k
j : t j a
p 1
(s, t (1 : j  1))
Complexity: O(p|s||t|) , but we have the k-length subseq.
kernels (k<=p) for free  easy to compute k(s,t)=Salkl(s,t)
27
Gap-weighted subsequence
kernels (1)



Feature space indexed by all elements of Sp
u(s)=sum of weights of occurrences of the pgram u as a (non-contiguous) subsequence of s,
the weight being length penalizing: length(u)) [NB:
length includes both matching symbols and gaps]
Example (1)

The string “gon” occurs as a subsequence of the
strings “gone”, “going” and “galleon”, but we consider
the first occurrence as more important since it is
contiguous, while the final occurrence is the weakest
of all three
28
Gap-weighted subsequence
kernels (2)

Example(2)







D1 : ATCGTAGACTGTC
D2 : GACTATGC
(D1)CAT = 28+210 and (D2)CAT = 4
k(D1,D2)CAT=212+214
Naturally built as a dot product  valid kernel
For alphabet of size 80, there are 512,000
trigrams
For alphabet of size 26, there are 12 x 106 5grams
29
Gap-weighted subsequence
kernels (3)


Hard to perform explicit expansion and
dot-product
Efficient recursive formulation (dynamic
programming type), whose complexity is
O(k |D1| |D2|)
30
Word Sequence Kernels (1)

Here “words” are considered as symbols





Meaningful symbols  more relevant matching
Linguistic preprocessing can be applied to improve performance
Shorter sequence sizes  improved computation time
But increased sparsity (documents are more : “orthogonal”)
Motivation : the noisy stemming hypothesis (important N-grams
approximate stems), confirmed experimentally in a categorization
task
31
Word Sequence Kernels (2)

Link between Word Sequence Kernels and other
methods:



For k=1, WSK is equivalent to basic “Bag Of Words” approach
For =1, close relation to polynomial kernel of degree k, WSK
takes order into account
Extension of WSK:



Symbol dependant decay factors (way to introduce IDF concept,
dependence on the POS, stop words)
Different decay factors for gaps and matches (e.g. noun<adj
when gap; noun>adj when match)
Soft matching of symbols (e.g. based on thesaurus, or on
dictionary if we want cross-lingual kernels)
32
Tree Kernels


Application: categorization [one doc=one tree], parsing
(disambiguation) [one doc = multiple trees]
Tree kernels constitute a particular case of more general kernels
defined on discrete structure (convolution kernels). Intuitively, the
philosophy is




to split the structured objects in parts,
to define a kernel on the “atoms” and a way to recursively combine
kernel over parts to get the kernel over the whole.
Feature space definition: one feature for each possible proper
subtree in the training data; feature value = number of occurrences
A subtree is defined as any part of the tree which includes more
than one node, with the restriction there is no “partial” rule
production allowed.
33
Trees in Text : example
S

Example :
NP
S
VP
Mary
VP
NP
VP
VP
John
N
V
V
N
V
N
N
loves
loves
A Parse Tree
loves
Mary
… a few among
the many subtrees
of this tree!
Mary
VP
V
N
34
Tree Kernels : algorithm




Kernel = dot product in this high dimensional feature space
Once again, there is an efficient recursive algorithm (in
polynomial time, not exponential!)
Basically, it compares the production of all possible pairs of
nodes (n1,n2) (n1T1, n2 T2); if the production is the same,
the number of common subtrees routed at both n1 and n2 is
computed recursively, considering the number of common
subtrees routed at the common children
Formally, let kco-rooted(n1,n2)=number of common subtrees
rooted at both n1 and n2
k (T1 , T2 ) 
 k
n1T1 n2 T2
co  rooted
(n1 , n2 )
35
All sub-tree kernel



Kco-rooted(n1,n2)=0 if n1 or n2 is a leaf
Kco-rooted(n1,n2)=0 if n1 and n2 have different
production or, if labeled, different label
Else Kco-rooted(n1,n2)=  (1  kco  rooted (ch(n1 , i),ch(n2 , i)))
children i


“Production” is left intentionally ambiguous to
both include unlabelled tree and labeled tree
Complexity s O(|T1|.|T2|)
36
References




J. Shawe-Tayor and N. Cristianini, Kernel
Methods for Pattern Analysis, 2004 (Chapter 10
and 11)
J. Tian, “PCA/Kernel PCA for Image Denoising”,
September 16, 2004
T. Gartner, “A Survey of Kernels for Structured
Data”, ACM SIGKDD Explorations Newsletter,
July 2003
N. Cristianini, “Latent Semantic Kernels”,
Proceedings of ICML-01, 18th International
Conference on Machine Learning, 2001
37