ppt - 서울대 : Biointelligence lab

Download Report

Transcript ppt - 서울대 : Biointelligence lab

Mathematical Models of Learning
Learning Kernel Classifiers
Chapter 4
Part I
Dong-Yeon Cho
Contents

Generative vs. Discriminative Models
 PAC and VC Frameworks
 Classical PAC and VC analysis
 Growth Function and VC Dimension
 Structural Risk Minimization
© 2005 SNU CSE Biointelligence Lab
2
Generative vs. Discriminative
Models (1/3)

Generative (or parametric) statistic approach
 Our model is given by
 Decision function
.
(4.1)
 This function has minimal expected risk.
 For the case of zero-one loss
 Generative
 The
model contains different descriptions of the generation of
the training sample.
© 2005 SNU CSE Biointelligence Lab
3
Generative vs. Discriminative
Models (2/3)

Discriminative (or machine learning) approach
 The model is given by
.
Y
 A model of the conditional
distribution of classes y
X
given objects
H x  by assuming that
P.
 The model is a subset of the more general model
used in classical statistics.
 Discriminative
H
 The
model consists of different descriptions of the
discrimination of the sample.
© 2005 SNU CSE Biointelligence Lab
4
Generative vs. Discriminative
Models (3/3)

Estimator
A(z ) 2 H
 A machine learning method selects
Z m one hypothesis
given by a training sample z  .
 The corresponding selection mechanism of a probability
measure
given the training sample z is called an
estimator
© 2005 SNU CSE Biointelligence Lab
5
Conceptual Difference - Type of
Convergence Results (1/2)

Convergence of the estimated measure


is a metric in the space
P
of measures.
Convergence of the expected risk of the learned
function
(4.4)
 This convergence is a special case of the convergence
Q
H
of the probability measures when identifying and
and using
.
© 2005 SNU CSE Biointelligence Lab
6
Conceptual Difference - Type of
Convergence Results (2/2)

The interesting question
 Does the convergence of probability measures always
imply a convergence of risks when using equation (4.1)
regardless of  ?
© 2005 SNU CSE Biointelligence Lab
7
Examples
© 2005 SNU CSE Biointelligence Lab
8

© 2005 SNU CSE Biointelligence Lab
If we are interested in
the convergence of the
expected risk we should
not resort to the
convergence of
probability measures
because the latter might
not imply the former or
might be a weaker
convergence than
required.
9
PAC and VC Framework (1/3)
A
(z)

Generalization error of an

VC (Vapnil-Chervonenkis) or PAC (Probably
Approximately Correct) framework
ERM
 Uniform convergence of training error to expected
errors over all hypotheses
© 2005 SNU CSE Biointelligence Lab
10
PAC and VC Framework (2/3)

Difference between PAC and VC
 PAC considers only data distributions
PZ where
h¤ 2 H
for some
.
© 2005 SNU CSE Biointelligence Lab
11
PAC and VC Framework (3/3)
© 2005 SNU CSE Biointelligence Lab
12
Classical PAC and VC Analysis

To bound the probability of bad training samples
 Zero-one loss
H
 A hypothesis h  where the deviation between the
empirical risk and the expected risk is larger than some
prespecified   [0, 1].
© 2005 SNU CSE Biointelligence Lab
13
© 2005 SNU CSE Biointelligence Lab
14
Infinite Number of Hypotheses (1/4)

The approach can be decomposed into three steps.
 Basic lemma (Symmetrization
by2ma ghost sample)
2Z
z~
z
 Consider a double sample
P(jR[h] ¡ R [h; z]j < ²) · 2P(jR
emp
drawn¡iid.
j < ²=2)
~
[h;
z]
R
[h;
z
]
emp
emp
 Symmetrization by permutation or conditioning

Z 2m : {1,…,2m} {1,…,2m} and logical
Any permutation
formula :
 {true, false}
© 2005 SNU CSE Biointelligence Lab
15
Infinite Number of Hypotheses (2/4)
 Bounding the number of permutation
are at most 22m different
~ hypotheses w.r.t. the empirical
z
risk Remp[h’, z] and Remp[h’, ].
N
H (2m)

: the maximum number of such equivalent classes
 We can again use a combination of the union bound and
jHj to bound the generalization error.
Hoeffding’s inequality
N (2m)
 The cardinality
of the hypothesis
space in the finite case
H
has been replaced by the number
.
 There
© 2005 SNU CSE Biointelligence Lab
16
Infinite Number of Hypotheses (3/4)
© 2005 SNU CSE Biointelligence Lab
17
Infinite Number of Hypotheses (4/4)

Confidential Intervals

Some interesting conclusions
 The hypothesis space is too rich.
 This
is meaningless bound as 0 < R[h]  1
 We can tighten bounds by magnitude if we can achieve
zero training error.
 In the case of finite cardinality of the hypothesis space
jHin jm,
the size of decision trees often grows exponentially
techniques like pruning effectively limit the number m and
thus guarantee a small generalization error.
 As
© 2005 SNU CSE Biointelligence Lab
18
Growth Function and VC Dimension
(1/6)
 The growth function depends
H only on the sample size m
and the hypothesis space .
 Tight upper bounds on the growth function
 It
is generally not possible to determine the exact value of the
growth function for an arbitrary hypothesis space and any m.
© 2005 SNU CSE Biointelligence Lab
19
Growth Function and VC Dimension
(2/6)
 VC Adimension
is 2also
shatter coefficient.
ff(x; y)
Zjl called
j 2 Hg
H =
¡ (h(x); y) = 1 h
0 1
 The
#
AH
VC dimension of
is the
# largest natural number such
that2#there exist a sample of size which can be
subdivided in
AH
all different ways by (set) intersection with
.
© 2005 SNU CSE Biointelligence Lab
20
Growth Function and VC Dimension
(3/6)
 For all training
# sample size m, the growth function is
sublinear in .
(4.19)
G
H
· #(ln( m ) + 1)
#
© 2005 SNU CSE Biointelligence Lab
21
Growth Function and VC Dimension
(4/6)
© 2005 SNU CSE Biointelligence Lab
22
Growth Function and VC Dimension
(5/6)

Curse of Dimensionality
 An important property of the VC dimension is that it does
not necessarily coincide with the number of parameters used.
© 2005 SNU CSE Biointelligence Lab
23
Growth Function and VC Dimension
(6/6)

Example 4.13 (VC dimension and parameters)
 VC dimension is one.
 All
functions are monotonically increasing and have exactly
one zero.
 The VC dimension of linear classifier equals the
number n of parameters.
 The VC dimension is infinite though we have only one
parameter.
© 2005 SNU CSE Biointelligence Lab
24
Structural Risk Minimization (1/4)

Minimization of VC dimension
 Ideally, we would like to make the VC dimension itself a
quantity that can be minimized by a learning algorithm.
 A minimization of the VC dimension in parallel to the
training error is theoretically not justified.

Structural risk minimization (SRM)
 One
method
of overcoming this problem
S possible
fH
H
g
=
1
;:::;
s
 The
idea of SRM is to compute a set
of
H which minimize the training error in the hypothesis
hypotheses
i
space .
© 2005 SNU CSE Biointelligence Lab
25
Structural Risk Minimization (2/4)
© 2005 SNU CSE Biointelligence Lab
26
Structural Risk Minimization (3/4)
 This simple lemma is directly applicable to Theorem 4.7.
© 2005 SNU CSE Biointelligence Lab
27
Structural Risk Minimization (4/4)
 We can simply stop increasing complexity
as soon as
H
we have found a hypothesis space i containing a
hypothesis having zero training error at a price of -ln
.
© 2005 SNU CSE Biointelligence Lab
28