Transcript Slides
Distribution Specific Learnability –
an Open Problem in
Statistical Learning Theory
M. Hassan Zokaei-Ashtiani
December 2013
Overview
•
•
•
•
•
•
•
•
Statistical Learning Theory
No-Free-Lunch Theorem
Uniform Convergence
Fundamental Theorem of SLT
Learnability w.r.t. a distribution
Learnability w.r.t. a set of distributions
Semi-supervised learning, a case study
Summary
Statistical Learning Theory (SLT)
• Machine learning: automated detection of
meaningful patterns from data
• Analysis of learning methods
– Obtaining guarantees on the performance of
learning methods
– Overfitting…
• Statistical assumptions on data
– E.g., iid assumption
• Our focus: classification
Preliminaries and Notations
•
•
•
•
Domain set: 𝑋
Label set: Y = +1, −1
A labeling function 𝑙 is a function 𝑙: 𝑋 → 𝑌
Samples are generated iid from an unknown
distribution 𝐷, labeled by a labeling function 𝑓
• Training set: 𝑆 = 𝑥1 , 𝑦1 , . . , 𝑥𝑁 , 𝑦𝑁
• An algorithm 𝐴 takes the training sample 𝑆
and outputs a predictor
Preliminaries and Notations
• The expected error rate of a predictor ℎ is
– 𝐿𝐷,𝑓 (ℎ) = 𝑃𝑥~𝐷 [ℎ(𝑥) ≠ 𝑓(𝑥)]
• For an algorithm 𝐴
• 𝐿𝐷,𝑓 (𝐴) = 𝐿𝐷,𝑓 (𝐴(𝑆))
Empirical Risk Minimization (ERM)
• Given a training set 𝑆, the train-error is
defined by
– 𝐿𝑆 ℎ =
|{𝑖∈ 𝑚 :ℎ(𝑥𝑖 )≠𝑙(𝑥𝑖 )}|
𝑁
• ERM minimizes the train-error over a
predefined set of predictors
– 𝑎𝑟𝑔𝑚𝑖𝑛ℎ∈𝐻 𝐿𝑆 (ℎ)
PAC Model (Leslie Valiant)
• Given a train-set 𝑆, can we guarantee anything
about the performance of 𝐴 on?
• For example, can we say
– 𝐿𝐷,𝑓 𝐴(𝑆) < 𝜖?
– Even if we see that 𝐿𝑆 𝐴 𝑆
= 0?
PAC Model (Leslie Valiant)
• Given a train-set 𝑆, can we guarantee anything
about the performance of 𝐴 on?
• For example, can we say
– 𝐿𝐷,𝑓 𝐴(𝑆) < 𝜖?
– Even if we see that 𝐿𝑆 𝐴 𝑆
• Usually impossible!
= 0?
PAC Model
• We may be able to guarantee something like
this:
– With probability at least 1 − 𝛿, 𝐿𝐷,𝑓 𝐴(𝑆) < 𝜖
• Probably Approximately Correct (PAC)
• How can we do this?
Concentration Inequalities
• Provide bounds on how a random variable
deviates from a value (such as its expectation)
• Hoeffding’s inequality
– For a random variable 𝑋 ∈ [0,1]
– Pr 𝑋𝑁 − 𝐸[𝑋] > 𝜖 < exp(−2𝑁𝜖 2 )
– Where 𝑋𝑁 =
𝑥1 +𝑥2 +⋯+𝑥𝑁
𝑁
and 𝑥𝑖 s are independent
An Example: Cross Validation
• Assume that using an algorithm 𝐴, and a trainset 𝑆, we have obtained ℎ = 𝐴(𝑆)
• Now we are given a new test-set 𝑆𝑇 (with
𝑆𝑇 = 𝑁) which is independent of 𝑆
• Using Hoeffding’s inequality
• Pr 𝐿𝑆𝑇 (𝐴(𝑆)) − 𝐿𝐷,𝑓 (𝐴(𝑆)) > 𝜖 < exp −2𝑁𝜖 2
– Or in other words, with probability at least 1 − 𝛿
2
• 𝐿𝐷,𝑓 𝐴 𝑆
< 𝐿𝑆𝑇 𝐴 𝑆
+
𝑙𝑜𝑔𝛿
2𝑁
PAC Setting
• So if we have enough samples, we can
estimate the error rate of a specific predictor
• The bound seems to be very useful, with
exponential decay
• Can we test all of the possible predictors, and
select the best predictor?
– ERM?
No-Free-Lunch Theorem
(simplified version)
• Let 𝑋 be an infinite domain set
• Let 𝐴 be “any” learning algorithm
• Then there exists (𝐷, 𝑓) such that
– 𝐿𝐷 𝑓 = 0
– With probability of at least 1/7 (over the choice of
𝑆):
𝐿𝐷 𝐴 𝑆
> 1/8
• Independent of the number of train samples!
No-Free-Lunch Theorem
• NFL suggests that learning without some form
of prior knowledge is impossible
• The most natural way to incorporate prior
knowledge is to use a hypothesis class
• We assume that we have a hypothesis class 𝐻,
for which there exists h∗ ∈ 𝐻 with low error
PAC Learnability
• Let 𝑋 be a domain and 𝐻 be a set of
predictors (hypothesis class) over it
• We say that 𝐻 is PAC learnable if and only if
– There exists an algorithm 𝐴 that
– For every 𝜖, 𝛿 ∈ (0,1) and unknown (𝐷, 𝑓)
– There exists 𝑚𝐻 𝜖, 𝛿 < ∞ such that if we
generate 𝑚 > 𝑚𝐻 (𝜖, 𝛿) samples 𝑆 according to 𝐷
and label it by 𝑓
– Then Pr 𝐿𝐷,𝑓 𝐴 𝑆 − 𝐿𝐷,𝑓 (ℎ∗ ) > 𝜖 < 𝛿
Uniform Convergence
• 𝐻 has uniform convergence property
–
–
–
–
–
If there exists 𝑚𝑈𝐶 (𝜖, 𝛿)
Such that for all 𝜖, 𝛿 ∈ 0,1 and ℎ ∈ 𝐻
If we generate a sample 𝑆 with 𝑆 > 𝑚𝑈𝐶 𝜖, 𝛿
then we have:
For all ℎ ∈ 𝐻
• 𝑃 |𝐿𝑆 ℎ − 𝐿𝐷,𝑓 ℎ | > 𝜖 < 𝛿
– Uniform convergence is sufficient for learning
• using ERM
– It can be proved that it is also necessary!
Fundamental Theorem of ML
• These statements about 𝐻 are equivalent
– 𝐻 is PAC learnable
– 𝐻 has uniform convergence property
– ERM algorithm on 𝐻 will “work”
– 𝐻 has finite VC-Dimension
Prior Knowledge
• In the classic learning theory, our prior
knowledge is mostly used in selecting 𝐻
• But sometimes, we have some knowledge
about the probability distribution
– For example, we may know the marginal
distribution 𝐷 a priori
Learnability w.r.t. a Single Distribution
• Assume we have 𝐷 a priori
• PAC learnability of (𝐻, 𝐷)?
– 𝐻 should have a finite 𝜖-covers w.r.t. 𝐷
Learnability w.r.t. a Single Distribution
• Assume we have 𝐷 a priori
• PAC learnability of (𝐻, 𝐷)?
– 𝐻 should have a finite 𝜖-covers w.r.t. 𝐷
• Not realistic
Learnability w.r.t. a Set of Distributions
• Assume that we know 𝐷 ∈ Ω
• What is the necessary and sufficient condition
for learnability of (𝐻, Ω)
Learnability w.r.t. a Set of Distributions
• Assume that we know 𝐷 ∈ Ω
• What is the necessary and sufficient condition
for learnability of (𝐻, Ω)
• Still an open problem!
Learnability w.r.t. a Set of Distributions
• Assume that we know 𝐷 ∈ Ω
• What is the necessary and sufficient condition
for learnability of (𝐻, Ω)
• Still an open problem!
• Conjecture: 𝐻 should have an 𝜖-cover w.r.t. Ω
with finite VC-dimension
• More general version: ℎ, 𝐷 ∈ 𝑅
Learnability w.r.t. a Set of Distributions
• Why is it important?
– Theoretical aspects
– Unexplained phenomena in ML
•
•
•
•
Semi-supervised learning
Domain adaptation
Deep learning methods
…
Semi-Supervised Learning
• We are given both some labeled samples and a
lot of unlabeled samples
• In practice, unlabeled samples can help a lot
• In theory, worst-case analysis shows that
unlabeled data is not helpful [1]!
• Theory-practice gap!
• Researchers are looking for useful and realistic
assumptions to explain this
•
[1] Ben-david, Shai, Tyler Lu, and Dávid Pál. "D.: Does unlabeled data provably help? worstcase analysis of the sample complexity of semi-supervised learning." In: 21st Annual
Conference on Learning Theory. 2008.
Summary
• In classic ML theory, the prior knowledge is only used
in selecting 𝐻
– For this setting, necessary and sufficient conditions for
learnability are known
• Other forms of prior knowledge, including knowledge
of distribution may be available
• Learnability w.r.t. a set of distributions is an open
problem
– May help in understanding semi-supervised learning,
domain adaptation, etc.
• May drastically change our current models for learning