Transcript week3_1

Machine Learning
Week 3
Lecture 1
Programming Competition
http://users-cs.au.dk/jasn/algocomp/programming_contest.pdf
Hand In
Student Comparison
Efter 5 minutter:
Train:
MCC: 0.9481 0.9458 0.8737 0.8686
0.9017 0.8407 0.9316 0.9088 0.8493 0.8644
[[5693 0 23 19 13 44 48 6 68 9]
[ 1 6508 33 33 6 38 7 19 87 10]
[ 47 64 5192 100 123 23 103 108 163 35]
[ 27 32 142 5355 5 252 39 70 141 68]
[ 12 31 45 6 5378 8 61 13 43 245]
[ 91 46 44 220 70 4522 109 29 210 80]
[ 42 18 55 7 54 77 5605 3 54 3]
[ 42 63 108 29 76 10 4 5718 18 197]
[ 29 139 84 154 25 146 48 31 5095 100]
[ 38 31 37 91 211 43 4 192 63 5239]]
Test:
MCC: 0.9525 0.9643 0.8855 0.8848 0.9060
0.8585 0.9308 0.9068 0.8543 0.8777
[[ 957 0 2 2 0 5 9 1 4 0]
[ 0 1102 2 4 0 2 4 1 20 0]
[ 11 4 903 15 15 1 14 18 42 9]
[ 4 0 21 909 0 32 3 11 20 10]
[ 1 2 6 1 913 1 11 1 8 38]
[ 11 2 5 41 11 750 16 11 38 7]
[ 13 3 4 3 15 11 906 1 2 0]
[ 3 12 25 7 9 0 0 936 3 33]
[ 9 8 9 25 8 20 12 15 859 9]
[ 11 8 5 11 42 10 0 20 8 894]]
Today
•
•
•
•
Recap
Learning with infinite hypothesis sets
Bias Variance
The end of learning theory….
Recap
The Test Set – Take a look at Eout
Fixed hypothesis h, N independent data points, and any ε>0
•
•
•
•
•
Split your data into two parts D-train,D-test
Train on D-train and select hypothesis h
Test h on D-test, error
Apply Hoeffding bound to
Cannot be used for anything else
Generalization Bound
Goal: Extend to infinite hypothesis spaces
Dichotomies
Dichotomy
bit string of length N
Fixed set of N points X = (x1,..,xN) Hypothesis set
Each
gives a dichotomy
Growth, Shattering, Breaks
If
then we say that
shatters (x1,…,xN)
If no data set of size K can be shattered by
then K is a break point for
Revisit Examples
• Positive Rays
• Intervals
a1
• Convex sets
Impossible
Dichotomy
a
a2
2D Hyperplanes 3 is not a break point.
Shatter Some Point Set of size 3
2D Hyperplanes 4 is break point
Impossible Dichotomies
Triangle With Point Inside
3 Points on a line
Else
Very Important Theorem
If
has a break point then the
growth function is polynomial
Polynomial of max degree k-1
Growth Function Generalization Bound
VC Theorem
Proof: Book Appendix (Intuition in book)
If growth function is polynomial
we can learn with infinite hypothesis sets!!!
VC Dimension
The VC Dimension of a hypothesis set
is maximal number of points it can shatter,
e.g. max N such that
It is denoted dvc
The smallest break point minus one
Revisit Examples
Unbreakable
• Positive Rays
a
• Intervals
a1
VC Dim.
1
a2
2
• Convex sets
∞
• 2D Hyperplanes
3
VC Dimension For Linear Classification
VC Dimension for d dimensional inputs
(d+1 parameters with the bias) is d+1
Proof Coming Up
Show dvc ≥ d+1
Show dvc < d+2
dvc ≥ d+1
Find a set of d+1 points we can shatter
Idea: Make points (Vectors) independent,
by using one dimension for each point
X is a Matrix
(d+1)x(d+1)
Rows = Points
Invertible
dvc ≥ d+1
Consider Dichotomy
Find θ such that
Solve
Solve
dvc < d+2
For any d+2 points we must prove there is a
dichotomy hyperplanes can not capture,
Consider
d+1 dimensional points (vectors)
They must be linearly dependent
(more vectors than dimensions)
Ignore zero terms
dvc < d+2
Dichotomy:
Claim. This dichotomy is impossible to
capture with hyperplanes
dvc = d+1 For Hyperplanes
dvc= number of free parameters
Vapnik-Chervonenskis (VC)
Generalization Bound
For any tolerance δ >0
With Probability at least 1-δ
Quote Book “The VC Generalization bound is the most
important mathematical result in the theory of learning”
VC Generaliation Bound
• Independent of learning algorithm, target
function, P(x), and out of sample error
– General Indeed
• We need to bound Growth function for all of
our hypothesis spaces. We showed #free
parameters for hyperplanes.
Exercise 2.5
with
N=100, Probability that
a
to be within 0.1 of
?
Happens with probability 1-δ < 0 e.g. Ridiculous
Cost of Generality
• Growth Function was really worst case
• Independent of P(x), target, out of sample
error
Sample Complexity
Fix tolerance δ , (success probability ≥ 1-δ)
Consider generalization error at most ε. How big N?
Upper bound it more
by using VC dim polynomial
Sampling Complexity
Plug in ε,δ = 0.1
Lets Plot
function sc_vcdim()
dat = 5000:2500:100000;
hold off;
plot(dat,dat,'r-','linewidth',3);
hold on
for i=3:9
tmp =
800*log(40*((2.*dat).^i+1));
plot(dat,tmp,'b-','linewidth',3);
end
Sampling Complexity
Book Statement. In Practice
VC Interpretation
• We can learn with infinite hypothesis sets.
• VC Dimension captures Effective number of
parameters/ degrees of Freedom
In Sample Error + Model Complexity
As a figure
In Sample Error + Model Complexity
Out of sample error
Model Complexity
In Sample Error
VC Dimension
Balance These
Model Selection
t Models m1,…,mt
Which is better?
Ein(m1) + Ω(m1)
Ein(m2) + Ω(m2)
.
.
.
Ein(mt) + Ω(mt)
Pick the minimum one
Problem, Slack in Ω
Vapnik-Chervonenkis (VC) Theorem
For any delta > 0
With probability 1-delta
Test Set Estimate (M=1) a lot tighter
Learning Summary
•
•
•
•
There is theory for regression as well.
We will not go there
Move on to bias variance
Last learning theory in this course.
Bias Variance Decomposition
• Consider Least Squares Error Measure again
and see if we can understand out of sample
Error.
• For simplicity, assume our target function is
noiseless, e.g. it is an exact function of the
input.
Experiments
• Take two functions ax+b, cx2+ dx+e
• Target function which is ¼ < x < 3/4
• Repeatedly pick 3 random data points
(x,target(x)) and fit our models
• Plot it and see
Bias Variance
• Out of sample error we get depends on
hypothesis.
• Hypothesis is result of learning algorithm which is
a function of the training data
• Training data effects out of sample error
• Think of data as a random variable and analyze
what happens.
• What happens if we repeatedly sample data and
run our learning algorithm.
Notation
= Hypothesis learned on D
Bias Variance
Variance
Bias
Bias Variance
Learning Curves
Plot of In sample and out of sample error as
a function of input size