Theory of VC bounds

Download Report

Transcript Theory of VC bounds

VC Bound
presented by Ruihua Cheng
Introduction
Empirical observations, ( x1 , y1 ),...( xm , ym )  X  Y
simplicity, XYR
Assume the data are generated by sampling from an
unknown underlying distribution P( x, y ).
Definition: the risk, i.e., the expected loss on the test data,
R[ f ]  E[ Rtest [ f ]]  E[c( x, y, f ( x))] 
 c( x, y, f ( x))dP( x, y).
X Y
here, the c( x, y, f ( x)) is the loss function. In the case of
pattern recognition, where y  {1} , a common choice is
their misclassification error,
1
c( x, y, f ( x))  | f ( x)  y | .
2
(1)
Empirical risk: use the training sample to approximate the
(1) by a finite sum,
Remp
1 m
  c( xi , yi , f ( xi )). (2)
m i 1
Empirical risk minimization(ERM) induction principle:
choose a f that minimize.
Convergence of Means to Expectations
Law of large numbers:
Detailed Analysis:
Chernoff's Bound
Characterize how the empirical mean converges to the
expected value of 
1 m
P{|   i  E[ ] |  }  2 exp( 2m 2 ). (3)
m i 1
here, P refers to the probability
of getting a sample
m
with the property | 1
  E[ ] | 

m
i 1
1 ,... m
i
The Chernoff's bound inequality translated back into
machine learning terminology: the probability of obtaining
an m-sample where the training error and test error differ
by more than   0 is bounded by
P{| Remp [ f ]  R[ f ] |  }  2 exp( 2m ).
2
(4)
Chernoff's Bound contd
Inequality(4) refers to one fixed f
but the learning machine can implement many functions for
a specific sample, inequality(4) do not rule out the
presence of cases where the deviation at a certain f that
the learning machine can implement is large.
Not suitable as a bound on the test error of a learning
algorithm using empirical risk minimization.
Need a uniform version of the law of large numbers.
Goal: Over all the functions f that learning machine can
implement, denoted by f  F , (4) holds.
Uniform Convergence and Consistency
denote the minimizer of R by fopt,
minimizer of Remp by fm,
We have
m
opt
R[ f ]  R[ f
]  0,
Remp [ f opt ]  Remp [ f m ]  0.
For consistency, would like the LHS of both to converge to
0 in probability.
If the sum of the two converges to 0, we are done.
The sum of these two inequalities satisfies
0  R[ f m ]  R[ f opt ]  Remp [ f opt ]  Remp [ f m ]
 R[ f m ]  Remp [ f m ]  Remp [ f opt ]  R[ f opt ]
 sup ( R[ f ]  Remp [ f ])  ( Remp [ f opt ]  R[ f opt ]).
f F
Second term of RHS: fopt is fixed (independent of training
sample), hence by Chernoff: for all   0
lim P{| Remp [ f
m  
opt
]  R[ f
opt
] |  }  0.
(5)
Uniform Convergence
If the first term of RHS also converges to zero (in
probability), i.e.,
lim P{sup ( R[ f ]  Remp [ f ])   }  0
m  
for all 
f F
 0 , then
R[ f ]  R[ f
m
Remp [ f
opt
opt
]  0,
]  Remp [ f ]  0.
m
in probability, in this case, empirical risk minimization can
be seen to be consistent.
Thereom(Vapnik & Chervonenkis)
Necessary and sufficient conditions for nontrival
consistency of empirical risk minimization(ERM):
one-sided convergence, uniformly over all functions that
can be implemented by the learning machine.
lim P{sup ( R[ f ]  Remp [ f ])   }  0
m  
for all
f F
  0.
Note: this takes into account the whole set of functions that
can be implemented by the learning machine.
How to derive a VC bound
take a look at
P{sup ( R[ f ]  Remp [ f ])   }
f F
to explore the properties of sets of functions which ensure
uniform convergence of risk. i.e.,
If the function class F contains only one function, then
Chernoff's bound suffices:
P{sup ( R[ f ]  Remp [ f ])   }  2 exp( 2m ).
2
f F
if there are finitely many functions, we use the 'union
bound'
even if there are infinitely many, then on any finite sample,
there are effectively only finitely many(symmetrization and
capacity concepts)
The case of Two functions
Suppose F={f1, f2}. Rewrite
P{sup ( R[ f ]  Remp [ f ])   }  P(C1  C2 )
f F
where the
Ci : {( x1 , y1 ),...( xm , ym ) | ( R[ fi ]  Remp [ fi ])   }
denotes the event that the risks of fi differ by more than 
The RHS equals
P(C  C )  P(C )  P(C )
1
2
1
2
Hence by Chernoff's bound
P{sup ( R[ f ]  Remp [ f ])   }  P(C1 )  P(C2 )  2  2 exp( 2m 2 ).
f F
Similarly, if F={f1,...fn}, we have
n
P{sup ( R[ f ]  Remp [ f ])   }  P(C    C )   P(Ci )  n  2 exp( 2m 2 ).
1
f F
i
i 1
The case for infinite function class
By a lemma(Vapnik 7 Chervonenkis), for
m  2
2
Here, the first P refers to the distribution of iid samples of
size m, while the second P refers to iid samples of size 2m.
In the latter case, Remp measures the loss on the first half of
the sample, the R'emp measures on the second half
It indicates that if the empirical error rates on two
indepndent m-samples are close to each other, then they
should be close to the true error rate.
Shattering Coefficient
Then the problem is translated into consider the maximum
size of F on 2m points. Say it N(F, 2m).
N(F, 2m) = maxmum number of different outputs ( y1 ,... y2 m )
that the function class can generate on 2m points, in other
words, the maximum number of different ways that function
class can separate 2m points into two classes.
N(F,2m)  22 m
2m
N(F,2m)

2
If
, then the function class is called
shatterring 2m points.
Use symmetrization, shattering coefficient, and union
bound, to get, conditioned on the 2m-samples has been
drawed, say Z 2 m  (( x1 , y1 ),...( x2 m , y2 m )) , then
P{sup( R[ f ]  Remp [ f ])   }
f F
 2 P{sup( R[ f | Z 2 m ]  R 'emp [ f | Z 2 m ])   / 2}
f F

N ( F ,2 m )

n 1
2 P{(Remp [ f n | Z 2 m ]  R 'emp [ f n | Z 2 m ])   / 2}. (6)
Apply a collary of Chernoff's bound for each term:
1 m
1 2m
P{   i    i   }  2 exp( m 2 / 2).
m i 1
m i  m 1
Inequality is derived as:
P{sup ( R[ f ]  Remp [ f ])   }  4 E2 m ( N ( F ,2m)) exp( m 2 / 8). (7)
f F
where the expectation is taken over the random drawing of Z 2 m
Rewrite the bound:
let the right side of inequality (7) amount to  >0, then
solving for  , the result is
With a probability at least 1- ,
R[ f ]  Remp [ f ] 
8
4
(ln E2 m ( N ( F , 2m))  ln
m

(8)
VC Dimension
Growth Function:
GF (m)  ln N ( F , Z m )
(9)
If GF (m)  m ln( 2) , that means m points can be
chosen such that by using functions of the learning
machine, they can be separated in all 2m possible ways.
But there are the cases that F is not rich in the learning
machine given a sample size, in other words, there exists
some maximal m for GF (m)  m ln( 2) is satisfied.This
number is VC dimension, denoted by h.
By a theorem, For m>h,
m
GF (m)  h(ln  1)
h
Therefore,
m
GF (m)  ln N ( F , Z m )  h(ln  1)
h
So the inequality (8) is derived to be
8
4
R[ f ]  Remp [ f ] 
(ln E2 m ( N ( F , 2m))  ln
m

8
m
4
 Remp [ f ] 
h(ln  1)  ln
m
h

(10)
Picture for the bound
h: VC dimension.
VC-Dimension: Example
Finding a Good Function Class
Theorem for VC dimension:
Proof:
That means, given the knowledge that x1,...Xr are
shattered, no mather what kind of set of lables y1,…,yr are,
we can find a function from F that learning machine could
implement offers an estimation for this set of lebles.
If r points that are shattered by cannonimcal
hyperplane is constrained by above inequality,
then the VC dimension h also satisifes, since it
corresponds to the maximum number of points
that can be shattered.
Margin Error Bound
Theorem (Margin Error Bound):
Consider the set of decision functions f(x)=sgn<w,x> with
||w||   , and ||x||  R, and margin smaller than ρ.
For all distributions P generating the data, with probability
1-δ over the drawing of the m training samples, for any ρ>0,
and δ belong to (0,1), the probability that a test sample
drawn from P will be misclassified is bounded from above,
by
c R22 2
v
( 2 ) ln m  ln(1/  )
m 
where v is the margin error.
(11)
Proof:
Inequality(9) is
8
4
R[ f ]  Remp [ f ] 
(ln E2 m ( N ( F , 2m))  ln
m

8
m
4
 Remp [ f ] 
h(ln  1)  ln
m
h

(10)
And from last theorem, the condition is
here, the 1 replaced by ρ, then we derive the result:
(r  ) 2
 rR 2

therefore, h 
i.e. , r 
R22

2
R22
2
(12)
Combine (10)and (12), the result of theorem can follow
after further certain term constraint.