Transcript Lecture19

1. Stat 231. A.L. Yuille. Fall 2004.

Sketch of Proof of VC Convergence Bound.
Lecture notes for Stat 231: Pattern Recognition and Machine Learning
2. Expected Empirical Risk




Empirical Risk for classifier
There is a distribution
that generates the samples,
but we don’t know it.
The loss function takes binary values
By Law of Large Numbers, the expected value of the empirical
risk for classifier a is equal to the risk for classifier “a”.
Lecture notes for Stat 231: Pattern Recognition and Machine Learning
3. Fluctuations from Empirical Risk





We can bound the probability of fluctuations (Sanov’s Theorem –
lecture 6 notes – probability of rare events – Chernoff Bound).
Note: the bound is independent of the classifier
and independent of the distribution
If we only have a fixed finite no. of classifiers, then we can
bound the probability that all the classifiers have their empirical
risk close to their true risk.
Union Bound
Lecture notes for Stat 231: Pattern Recognition and Machine Learning
4. Infinite No. Classifiers



But if we have an infinite number of classifiers – e.g. the set of
linear hyperplanes – then we don’t know that all the classifiers
will have small fluctuations of their risks.
The probability that any one classifier has a big riskfluctuation is
small, but there are an infinite number of them – so one of them
may have a gigantic risk fluctuation.
Vapnik’s Idea: for a finite number of samples, the number of
classifiers is effectively finite.
Lecture notes for Stat 231: Pattern Recognition and Machine Learning
5. Infinite to Finite.




You care about the worst fluctuation for all
Where subscript N, N1,N2 denote the empirical risk of different
sets of N samples.
Intuition: if the empirical risk of N samples is close to the true
risk – then it must be close to the empirical risk of any other
choice of N samples. Conversely, if the two empirical risks are
close then they must be close to the true risk (cross-validation).
Lecture notes for Stat 231: Pattern Recognition and Machine Learning
6. Finite Set of Samples




So we only have to bound the probability of large fluctuations on
a set of 2N samples.
There are an infinite number of classifiers, but at most
possible dichotomies of the
samples.
So the effective number of classifiers must be smaller than
But this is still a very large number of classifiers. The union
bound gives a probabilistic bound
which tends to infinity with N and is useless for small
Lecture notes for Stat 231: Pattern Recognition and Machine Learning
7. Shattering Coefficient




Define the shattering coefficient : the maximum number
of functions in that can be distinguished by their values on
(like number of dichotomies).
Claim: then
Where the expectation is taken over random samples from
We can bound the max fluctuation, provided
doesn’t grow exponentially with N
Lecture notes for Stat 231: Pattern Recognition and Machine Learning
8. Introduce




Set
and solve for
Then with probability greater than
The annealed entropy
is bounded by the growth function
Lecture notes for Stat 231: Pattern Recognition and Machine Learning
9. Growth Function to VC

Growth Function:

Vapnik shows that either


Or there is a maximum N for which this holds. This is the
VC dimension h (shattering – all dichotomies).
Then, for
Lecture notes for Stat 231: Pattern Recognition and Machine Learning
10. VC Proof Summary





For each classifier, use Sanov’s Thm (Chernoff bound) to bound
the probability of fluctuations of the risk (rare events).
But need to bound the biggest fluctuation of all the classifiers
(typically infinite). Union Bound.
Reduce the problem to bounding the empirical risk for 2N
datapoints.
This reduces to an exponential no. of classifiers – still too many.
But if N is suff. large that our classifiers can’t shatter the data,
then we define the VC dimension and can bound the fluctuations.
Lecture notes for Stat 231: Pattern Recognition and Machine Learning