F13-02-BayesLearnRevised

Download Report

Transcript F13-02-BayesLearnRevised

Other kinds of supervised
learning
• Reinforcement learning - learning a policy for
influencing or reacting to environment
– No supervised output, but delayed rewards
– Credit assignment problem
– Game playing/robot in a maze, etc.
• On-line learning: predict on each instance in turn
• Semi-supervised learning uses both labeled and
unlabeled data
•
Active learning – request labels for particular instances
1
Unsupervised Learning
•
•
•
•
Learning “what normally happens”
No labels
Clustering: Grouping similar instances
Example applications
–
–
–
–
–
Segmentation in customer relationship mgmt
Image compression: Color quantization
Bioinformatics: Learning motifs
Identifying unusual Airplane landings
Deep learning – learn the “features”
2
Bayesian Learning
(modified from Yahoo class)
3
Probability Review
• Based on experiment: outcome space Ω
containing all possible atomic outcomes
• Each outcome (atom) has probability density or
mass
• Event is a subset of Ω
• P(event) is sum (or integral) over event’s
atoms
• Random variable V maps Ω to (usually) R
• V=value is an event, P(V) is a distribution
4
Example
• Roll a fair 6-sided die and then flip that
many fair coins.
• What is Ω?
5
Example
• Roll a fair 6-sided die and then flip that
many fair coins.
• What is Ω?
• Ω={(1,H), (1,T), (2, HH), (2, HT), …,
(6,TTTTTT)}
• Number of heads is a random variable
• What is expected number of heads?
Expectation of V is
P(a)V (a)

atoms a
6
• Events A and B independent iff:
P(A and B) = P(A) • P(B)
• Probability of A given B, (cond. probability)
P(A | B) = P(A and B) ⁄ P(B)
• So,
P(A and B) = P(A | B) • P(B)
P(B and A) = P(B | A) • P(A)
• Bayes Rule:
P(A | B) = P(B | A) P(A) ⁄ P(B)
7
Example
What is expected number of heads?
def. expectation: E(V ) 
 P(a)V (a)
atoms a
• Expectations add: E(v1 + v2) = E(v1) + E(v2)
• Rule of conditioning: (sum rule)
if events e
partition Ω then:
1,e2 ,...,ek
P(event) = ∑ P( e ) P(event | ei )
i
= ∑ P( ei and event)
E(randVar) = ∑ P( e ) E(randVar | ei )
i

8

Expected number of heads
6
• E(# heads)   P(roll  r)E(# heads | roll  r)
r1
1 1 2  3  4  5  6 
 


6 
2
21
  1.75
12
9
Bayes Rule for Learning
RVs
• Assume joint distribution P(X=x, T=t)
• Want P(t | x) to predict label (target) for
new instance x (here (x,t) is an atom)
• P(t | x) = P(x | t) • P(t) ⁄ P(x) Bayes rule
• P(t | x) proportional to P(x | t) • P(t)
• From data, learn P(x | t) and P(t)
• Predict label t with largest product
10
How to learn probabilities
• Street hustler takes bets on coin flips
• You see HTH, what is probability that
next flip is H? What is θ=P(H) for coin?
(don’t be shy)
11
How to learn probabilities
• Street hustler takes bets on coin flips
• You see HTH, what is probability that
next flip is H? What is h=P(H) for coin?
• What is the experiment?
– (What is x and t in the learning context?)
12
Frequentist
• Street hustler takes bets on coin
• You see event HTH, what is probability that
next flip is H?
• Frequentist: 2/3 maximizes likelihood,
the θ=P(H) value maximizing θ2(1-θ)
solve “derivative = 0” for θ.
• Likelihood function L(HTH | θ) vs. the
probability P(HTH | θ=2/3)
13
Bayesian
• Have prior distribution P(θ) on θ =P(H);
two phase experiment, pick θ then flip 3
times
• Posterior on θ is distribution P(θ | HTH)
or
P(HTH | θ) • P(θ) / P(HTH)
• In this case,
θ2(1-θ)P(θ) / normalization
14
Bayesian examples
• Prior: P(θ=0) = P( θ=1/2) = P(θ=1) = 1/3:
θ2(1-θ) P(θ) is 0, 1/24, and 0 for these
three cases, posterior P(θ=1/2 | HTH) = 1
• Prior density: P(θ) = 1 for 0≤θ≤1:
θ2(1-θ) P(θ) is θ2(1-θ) for 0≤θ≤1
posterior P(θ | HTH) is 12 θ2(1-θ)
15
Posterior plot
•
•
•
•
Max at 2/3
Average is 3/5
3/5 = (2+1)/(3+2)
Not a coincidence!
Laplace’s rule of
succession - add
one fictitious
observation of each
class
16
Bayes’ Estimation
• Treat parameter θ as a random var with the prior
distribution P(θ), use fixed sample X (RV S)
• Maximum Likelihood (ML):
– θML = argmaxθ’ P(S=X | θ =θ’) = argmaxθ’ L(θ’ |S=X)
• Maximum a Posteriori (MAP):
•
– θMAP = argmaxθ’ P(θ=θ’ | S=X)
= argmaxθ’ P(S=X | θ=θ’) P(θ=θ’) / P(S=X)
Mean a’Post.: θmean = E[θ|S=X] = ∫ θ’ P(θ’|S=X) dθ’
• Predictive distribution (full Bayes):
P(T=t|S=X) = ∫ P(T=t|θ=θ’) P(θ=θ’|S=X) dθ’
17
Use for learning
RVs
• Draw enough data so that P(T=t | X=x)
estimated for every (x,t) pair
• This takes lots of data – curse of
dimensionality …rote learning
• Another approach: a hypothesis class
• Think of each hypothesis h as a model
for generating the training set X of (x,t)
pairs
18
Compound Experiment
• Prior P(H=h) on hypothesis space
• Model gives P(X=X | H=h) (here data X is both t’s and x’s)
Joint experiment (if data i.i.d. given h)
• P({(xi,ti)}, h) = P(h) ∏i ( P(xi|h) P(ti | xi, h) )
– Factors many ways, but the above is useful
– if xi ~ P(x), indep. of h, then we can factor out P(x)
P({(xi,ti)}, h) = P(h) ( ∏iP(xi) ) (∏i P(ti | xi, h))
P({(xi,ti)}, h) proportional to P(h) ∏i P(ti | xi, h)
Each hypothesis has same probability of
generating the measurements, can treat x’s as
fixed and concentrate on just the t’s
19
•
•
•
•
•
Prior P(h) on hypothesis space
Model gives P(X | h) (X is t|x’s or both t’s and x’s)
Posterior P(h | X) = P(X | h) P(h) / P(X)
Max. likelihood: h having max P(X | h)
Max. a’posteriori: h having max P(h | X)
• Predictive distribution (full Bayes): predict
with average of h’s weighted by posteriors
P(h | X)
20
Movie example
21
Discriminative and Generative
models
• Generative model: P((x, t) | h)
– Tells how to generate examples (both
instances and labels)
• Discriminative model: P(t | h, x)
– Tells how to create labels from instances,
like movie example
• Discriminate function: predict f(x),
often f(x) = argmaxt ft (x).
22
Tasty Coffee
(rectangles in discrete strength-sugar plane)
• Cups of coffee have discrete strength
and sugar features x
• Want to learn significant other’s tasty
(t=1) cups of coffee
• Assume rectangle in discrete strengthsugar plane, h(x) = 1 if x inside h
• Have set X of (x,t) examples
features
Tasty?
23
Ex. 1: uniform prior, noise-free
(rectangles in discrete strength-sugar plane)
• Prior P(h) uniform over rectangles and
assume x’s independent of h (focus on labels)
• P(labels | h,X) = 1 if t=h(x) for each (x,t) in X
0 otherwise
• For each h, Posterior P(h | labels,X)
= P(labels | h, X) P(h | X) / P(labels | X)
= P(labels | h, X) P(h) / normalization
is uniform over consistent rectangles
24
Ex. 2: Coffee w. label noise
• Prior P(h) uniform, iid label noise rate 1/3
• Fix an h and instances (x’s)
– concentrate on labels t
•
•
•
•
c = # examples in sample X where t=h(x)
w = # examples where h wrong
P(labels | h) = (2/3)c • (1/3)w = (2/3)c+w• (1/3)w
Posterior P(h | labels) = P(labels | h) P(h) / P(labels)
proportional to (q/(1-q))w drops as (1/2)w since q=1/3
25
Ex 3: coffee w. non-uniform prior
• Prior P(h) proportional to (1/2)area(h), noise rate
q=1/3
• c = # examples in sample X where t=h(x)
• w = # examples where h wrong
• P(labels | h) = (2/3)c • (1/3)w = (2/3)c+w• (1/2)w
• Posterior P(h | labels) = P(labels | h) P(h) / P(labels)
proportional to (1/2)w (1/2)size(h)
• Minimum mistake h is max likelihood
• Minimum mistake+area h is max a’posteriori
• Full Bayes: Vote h’s with weights (1/2)size+#wrong
26
Ex 4: 10 hypotheses
•
•
•
•
•
•
Label noise rate is 1/3, uniform prior
domain has 9 points (x1, x2, …, x9)
Hypoth h0 is 1 everywhere (with prob 1/3)
For 1≤i≤9, hi(xi)=1 and hi(xj)=0 if i ≠ j
Sample = {(x1,1), (x2,1)} predict on x3
h0 is both max. likelihood and max
a’posteriori hypothesis, but …
• Predictive dist. maximized at 0 ≠ h0(x3) 27
Ex 5: Generative Model
basketball pro vs others
• binary features like:
– tall, job-travel, wears-uniform, blue-eyes
• hypothesis is distribution of each feature
for each class, i.e. learn p(tall | pro) etc.
• Estimate these probabilities, and p(pro),
p(other)
• On new instance x, predict based on
p(class) p(x | class)
28
Key interplay
• Underlying pattern being learned
• Features available
• Hypothesis space
• Number of examples available
The trick is finding the right mix, but…
29
Key interplay
• Underlying pattern being learned
• Features available
• Hypothesis space
• Number of examples available
The trick is finding the right mix, but…
The smaller the hypothesis space, the
luckier we need to be to catch the pattern
30
Triple Trade-Off
•
•
There is a trade-off between three
factors (Dietterich, 2003):
–
Complexity of H
–
–
–
–
Training set size, N,
Generalization error, E, on new data
As N increases,E decreases
As H gets complex,first E drops and then E rises
also PAC theory & VC-style generalization bounds
31
Where we are going:
• Done so far:
– General Bayesian parameter estimation
• Coming up:
– Bayesian decision theory (Duda and Hart ‘73)
– Estimating Gaussians
32
• Predictive distribution is “Bayes optimal”
– minimizes probability of a mistake if
data generated according to
assumptions (prior and models)
• But: all mistakes are not created equal!
33
Bayes decision theory
• Consider an
automatic lotto
payer that pays off
scratcher wins ($5,
$20, $100)
• Have costs for
mistakes, a loss
function:
L(pred, real)
payment
5
Value 20
5
20
100
0
15
95
30
0
80
30
0
100 30
34
• (Bayes) risk of payment t on x is
∑t’L(t, t’)P(t’|x)
• Bayes optimal prediction minimizes risk
• Say read ticket x and find out that:
payment
P(5 | x) = 3/10
5 20 100
P(20 | x) = 1/10
Val. 5 0 15 95
P(100 | x) = 6/10
20 30 0 80
• What should machine do? 100 30 30 0
35
Exercise:
• How big does the loss for an
underpayment ($30’s) have to be so
that paying $100 on the distribution
payment
P(5 | x) = 3/10
5 20 100
P(20 | x) = 1/10
5 0 15 95
Val.
P(100 | x) = 6/10
20 30 0 80
Has the smallest risk?
30 30 0
100
36
More on Generative approach
• Generative approach (pick t then x)
• Learn P(x | t) and use Bayes’ rule
P(t | x) = P(x | t) P(t) / P(x)
• Need model for P(x | t)
• One common assumption:
P(x | t) Gaussian
• How to learn (fit) Gaussian?
37
Another Generative model:
• Have a set of distributions (e.g. gaussians)
• Each hypothesis associates a distribution on X with
each class (or label) (distributions may be restricted e.g. have same variance)
• Put “prior” P(t) on classes/labels
• Assume samples generated by:
– Repeat:
• Pick label t from P(t)
• Pick features x from P(x | t)
• For each class t, estimate P(t) and P(x | t) from
sample
38
1 dimensional Gausians
p(x) 
 (x   )2 
1
exp  
2 2 

2
• Maximum likelihood estimate has
sample mean µ and sample
variance 2= (1/n) ∑i (xi-µ)2 = E[(x - µ)2]
= E[x2-2µx+µ2]
= E[x2] -2µE[x] + µ2
= E[x2] - µ2
• What gaussian best fits -1, 1, 1, 7?
39
40
Multivariate Gaussians
1
1
T 1
P(x) 
exp(
[x


]
 [x   ])
n /2
1/2
(2 ) |  |
2
• Mean vector , covariance matrix , entries of
covariance matrix are variances (or covariances), 2i,j=E[(xi- i) (xj- j)]
(subscripts are indices into vectors)
41
Gaussian Conditionals and
Marginals
• If p(x,y) is Gaussian, then:
– Conditional p(x | Y=y) is Gaussian,
– Marginal p(x) =
Gaussian
 p(x, y)dy
is
42
Estimating Gaussians
(maximum likelihood)
• Estimate  = (∑i xi) / n
• Estimate 2i,j = (∑k (xk,i - i) (xk,j- j) ) / n
• (above covariance estimate is biased: can
use “n-1”)
• If domain d-dimensional:
–
–
–
–
d parameters for 
d(d+1)/2 parameters for 2i,j’s
For each class!
Many parameters “requires” lots of data
43
Common tricks
• Share same  for all classes
– Assumption made by linear
discriminant analysis (LDA, also
Fischer’s linear discriminant)
• Assume diagonal ’s for each class
• Assume  diagonal and shared
44
General decision
boundaries for
Gaussian
generative model
Duda and Hart’73
Also: same covar. Matrix
implies linear boundary
45
Main Points/terms:
•
•
•
•
•
Features, instances, labels, examples
Batch learning, inductive bias
Classification, regression, loss function
Training set, training error, test error
Noise, over-fitting
• Probability: events and RV’s, independence,
sum rule, product rule, Bayes rule
46
Main points/terms (cont.)
• Bayesian parameter estimation: priors
and posteriors, max.likelihood, max
a’posteriori, mean a’posteriori, full Bayesian
prediction (predictive distribution)
• Generative models: Naïve Bayes, classconditional gaussians (estimating
Gaussians)
• Bayes decision theory
47