Transcript Document
CpSc 810: Machine Learning
Bayesian Learning
Copy Right Notice
Most slides in this presentation are
adopted from slides of text book and
various sources. The Copyright belong to
the original authors. Thanks!
2
Two Roles for Bayesian
Methods
Bayesian Decision Theory came long before
Version Spaces, Decision Tree Learning and
Neural Networks. It was studied in the field of
Statistical Theory and more specifically, in the
field of Pattern Recognition.
Provides practical learning algorithms
Naïve Bayes learning
Bayesian belief network learning
Combine prior knowledge (prior probabilities)
with observed data
Requires prior probabilities
Provides useful conceptual framework
3
Provides “gold standard” for evaluating other
learning algorithm
Additional insight into Occam’s razor
Bayes Theorem
Goal: To determine the most probable
hypothesis, given the data D plus any initial
knowledge about the prior probabilities of the
various hypotheses in H.
Prior probability of h, P(h): it reflects any
background knowledge we have about the
chance that h is a correct hypothesis (before
having observed the data).
Prior probability of D, P(D): it reflects the
probability that training data D will be observed
given no knowledge about which hypothesis h
holds.
4
Conditional Probability of observation D, P(D|h):
it denotes the probability of observing data D
given some world in which hypothesis h holds.
Bayes Theorem (Cont’d)
Posterior probability of h, P(h|D): it
represents the probability that h holds given
the observed training data D. It reflects our
confidence that h holds after we have seen
the training data D and it is the quantity that
Machine Learning researchers are
interested in.
Bayes Theorem allows us to compute
P(h|D):
P(h|D)=P(D|h)P(h)/P(D)
5
Choosing Hypotheses:
Maximum A Posteriori (MAP)
Goal: To find the most probable hypothesis h
from a set of candidate hypotheses H given the
observed data D.
MAP Hypothesis, hMAP = argmax hH P(h|D)
= argmax hH P(D|h)P(h)/P(D)
= argmax hH P(D|h)P(h)
If every hypothesis in H is equally probable a
priori, we only need to consider the likelihood of
the data D given h, P(D|h). Then, hMAP becomes
the Maximum Likelihood,
hML= argmax hH P(D|h)
6
Bayes Theorem: a example
Does patient have cancer or not?
A patient takes a lab test and the result comes
back positive. The test returns a correct positive
result in only 98% of the cases in which the
disease is actually present, and a correct
negative result in only 97% of the cased in which
the disease is not present. Furthermore, 0.008 of
the entire population have this cancer
P(cancer)=0.008
P(¬cancer)=0.992
P(+|cancer)=0.98
P(-|cancer)=0.02
P(+|¬cancer)=0.03 P(-|¬cancer)=0.97
7
P(+|cancer)P(cancer)=0.98*0.008=0.0078
P(+|¬cancer)P(¬cancer)=0.03*0.992=0.0298
Basic Formulas for Probabilities
8
Brute Force MAP Hypothesis
Learner
For each hypothesis h in H, calculate the
posterior probability
P(h | D )
P( D | h ) P(h )
P( D )
Output the hypothesis hMAP with the highest
posterior probability
hMAP arg max P( h | D )
hH
9
Relation to Concept Learning
Consider our usual concept learning task
Instance space X, hypothesis space H, training
examples D
Consider the FindsS learning algorithm (outputs
most specific hypothesis from the version space
VSH,D
Does FindS output a MAP hypothesis?
10
Relation to Concept Learning
Assume Fixed set of instance < x1,… xm>
Assume D is the set of classifications
D= <c(x1),…,c(xm)>
Choose P(D|h)
P(D|h) =1 if h consistent with D
P(D|h) =0 if h inconsistent with D
Choose P(h) to be uniform distribution
P(h)= 1/|H| for all h in H
P(D) can be derive from theorem of total
probability:
P( D ) P( D | hi ) P(hi )
hi H
11
hi VSH , D
1
| VS H ,D |
1
1
1
0
1
| H | hi VSH , D | H | hi VSH , D | H |
|H |
Relation to Concept Learning
If h is inconsistent with D
P(h|D)=P(D|h)P(h)/P(D)=0*P(h)/P(D)=0
If h is consistent with D
P(h|D)= P(D|h)P(h)/P(D)=1*(1/|H|)/P(D)
=(1/|H|)*(|H|/| VSH,D |)=1/| VSH,D |
So, every consistent hypothesis is a MAP
hypothesis.
12
Evolution of Posterior Probabilities
for Consistent Learners
P(h)
hypotheses
13
P(h|D1)
hypotheses
P(h|D1,D2)
hypotheses
Inductive Bias: A Second View
14
Maximum Likelihood and
Least-Squared Error
Consider any real –valued target function f
and training examples < xi, di >, where di is
noisy training values
di =f(xi )+ei
ei is random variable (noise) drawn
independently for each xi according to some
Gaussian distribution with mean = 0.
Then, the maximum likelihood hypothesis
hML is the one that minimizes the sum of
squared errors:
m
hML arg min (d i h( xi )) 2
hH i 1
15
Maximum Likelihood and
Least-Squared Error
Assume the training examples are mutually
independent given h,
hML arg max p( D | h )
hH
m
arg max p( d i | h )
hH
i 1
m
arg max
hH
i 1
1
2 2
1 d h ( xi ) 2
( i
)
e 2
Maximize natural log of this instead
m
hML arg max (ln
1
1 d h ( xi ) 2
( i
)
2
2 2
m
1 d h ( xi ) 2
arg max ( ( i
)
2
i 1
hH
hH
i 1
m
arg max ( ( d i h( xi )) 2
hH
16
i 1
m
arg min (d i h ( xi )) 2
hH
i 1
Maximum Likelihood Hypotheses
for Predicting Probability
Considering predicting survival probability
from patient data:
Training examples < xi, di >, where di is 1 or 0.
Want to train neural network to output a
probability given xi (not a 0 or 1)
In this case:
P(di | h, xi ) h( xi )di (1 h( xi ))1di
Then:
m
P( D | h ) h( xi ) d i (1 h( xi ))1di P( xi )
i 1
m
hML arg max h( xi ) d i (1 h( xi ))1d i P( xi )
hH
i 1
m
arg max h( xi ) d i (1 h( xi ))1di
hH
17
i 1
m
arg max d i lnh( xi ) (1 d i ) ln(1 h( xi ))
hH
i 1
Gradient Search to Maximize
Likelihood in a Neural Net
We denote G(h, D) to
m
d i lnh( xi ) (1 d i ) ln(1 h( xi ))
i 1
The negative of G(h,D) is called cross entropy
Then,
G ( h, D ) m G (h, D ) h( xi )
w jk
w jk
i 1 h ( xi )
(d i ln h( xi ) (1 d i ) ln(1 h( xi )) h( xi )
h( xi )
w jk
i 1
m
d i h( xi ) h( xi )
i 1 h ( xi )(1 h ( xi )) w jk
m
Assume our neural network is constructed
from a single layer of sigmoid units, then
h( xi )
' ( xi ) xijk h( xi )(1 h( xi )) xijk
w jk
Thus,
18
G(h, D) m
(d i h( xi )) xijk
w jk
i 1
Gradient Search to Maximize
Likelihood in a Neural Net
Weight update rule for a sigmoid unit:
w jk w jk w jk
Where
m
w jk (d i h( xi )) xijk
i 1
Compare to BP:
m
w jk h( xi )(1 h( xi ))( d i h( xi )) xijk
i 1
19
The rule that minimizes cross entropy seeks
the maximum likelihood hypothesis under
the assumption that the observed boolean
value is a probabilistic function of the input
space
Minimum Description Length
Principle
Occam’s razor: prefer the shortest
hypothesis
Rewrite the hMAP
hMAP arg max P( D | h) P(h)
hH
arg max log 2 P( D | h) log 2 P(h)
hH
arg min log 2 P( D | h) log 2 P(h)
hH
Interesting fact from information theory:
20
The optimal (shortest expected coding length)
code for an event with probability p is -log2p bits
Minimum Description Length
Principle
So interpret:
–log2(P(h)) is the description length LC1(h) of h under
the optimal coding C1 for the hypothesis space H;
–log2(P(D | h)) is the description length LC2(D | h) of
the data D given hypothesis h under its optimal
coding C2.
Prefer the hypothesis that minimizes
length(h)+length(mis-clasifications)
Minimum Description Length Principle (MDL):
prefer the hypothesis h that minimizes
hMDL arg min LC1 ( h ) LC2 ( D | h )
hH
21
Where LC(x) is the description length of x under
encoding C.
hMAP equals hMDL under optimal encodings.
Minimum Description Length
Principle
Example: H=decision trees, D=training data
labels
LC1(h) is the # bits to describe tree h
LC2(D|h) is the # bits to describe D give h
Note LC2(D|h) =0 if examples classified perfectly by h.
need only describe exceptions
Hence, hMDL trades off tree size for training
errors.
22
Most Probable Classification of
New Instances
So far we have sough the most probable
hypothesis given the data D (i.e. hMAP)
Given new instance s, what is its most
probable classification?
hMAP(x) is not the most probable classification
Consider:
Three possible hypotheses:
P(h1|D) =0.4, P(h2|D)=0.3, P(h3|D)=0.3
Given new instance x
h1(x) =+, h2(x)=-, h3(x)=What is most probable classification of x?
23
Bayes Optimal Classifier
One great advantage of Bayesian Decision
Theory is that it gives us a lower bound on the
classification error that can be obtained for a
given problem.
Bayes Optimal Classification: The most
probable classification of a new instance is
obtained by combining the predictions of all
hypotheses, weighted by their posterior
probabilities:
argmaxvjVhi HP(vh|hi)P(hi|D)
where V is the set of all the values a
classification can take and vj is one possible
such classification.
24
Bayes Optimal Classifier
Properties of the Optimal Bayes Classifier:
No other classification method using the same
hypothesis space and same prior knowledge can
outperform this classifier on average!
Bayes optimal classifier may not be in H!
Example:
P(h1|D) =0.4, P(-|h1)=0, P(+|h1)=1
P(h2|D)=0.3, P(-|h2)=1, P(+|h2)=0
P(h3|D)=0.3, P(-|h3)=1, P(+|h3)=0
Therefore:
hi HP(+|hi)P(hi|D)=0.4
hi HP(-|hi)P(hi|D)=0.6
And
25
argmaxvjVhi HP(vh|hi)P(hi|D)= -
Gibbs Classifier
Bayes optimal classifier provides best
result, but can be expensive if many
hypotheses.
Gibbs Algorithms
1. Choose one hypothesis at random, according
to P(h|D)
2. use this to classify new instance
Surprising fact: assume target concepts are
drawn at random from H according to priors
on H, then:
E[errorGibbs] 2E[errorBayesOptimal]
26
Naïve Bayes Classifier
Let each instance x of a training set D be
described by a conjunction of n attribute values
<a1,a2,..,an> and let f(x), the target function, be
such that f(x) V, a finite set.
Bayesian Approach:
vMAP = argmaxvj V P(vj|a1,a2,..,an)
= argmaxvj V P(a1,a2,..,an|vj) P(vj)/P(a1,a2,..,an)
= argmaxvj V P(a1,a2,..,an|vj) P(vj)
Naïve Bayesian Approach: We assume that the
attribute values are conditionally independent
so that P(a1,a2,..,an|vj) =i P(a1|vj) [and not too
large a data set is required.]
Naïve Bayes Classifier:
27
vNB = argmaxvj V P(vj) i P(ai|vj)
Naïve Bayes: Example
Consider the weather data and we have to
classify the instance:
< Outlook = sunny, Temp = cool, Hum =
high, Wind = strong>
The task is to predict the value (yes or no) of
the concept PlayTennis. We apply the naïve
bayes rule:
P(a | v )
vMAP arg max P(v j )
vj{ yes ,no}
i
j
i
arg max P(v j ) P(Outlook sunny | v j ) P(Temp cool | v j )
vj{ yes ,no}
28
P( Hum high | v j ) P(Wind strong | v j )
Naïve Bayes: Example
Outlook
sunny
sunny
overcast
rain
rain
rain
overcast
sunny
sunny
rain
sunny
overcast
overcast
rain
Temperature Humidity Windy Class
hot
high
false
N
hot
high
true
N
hot
high
false
P
mild
high
false
P
cool
normal false
P
cool
normal true
N
cool
normal true
P
mild
high
false
N
cool
normal false
P
mild
normal false
P
mild
normal true
P
mild
high
true
P
hot
normal false
P
mild
high
true
N
P(yes) = 9/14
P(no) = 5/14
29
Outlook
P(sunny|yes) = 2/9
P(sunny|no) = 3/5
P(overcast|yes) = 4/9 P(overcast|no) = 0
P(rain|yes) = 3/9
P(rain|no) = 2/5
Temp
P(hot|yes) = 2/9
P(hot|no) = 2/5
P(mild|yes) = 4/9
P(mild|no) = 2/5
P(cool|yes) = 3/9
P(cool|no) = 1/5
Hum
P(high|yes) = 3/9
P(high|no) = 4/5
P(normal|yes) = 6/9
P(normal|no) = 2/5
Windy
P(true|yes) = 3/9
P(true|no) = 3/5
P(false|yes) = 6/9
P(false|no) = 2/5
Naïve Bayes: Example
classify the instance:
< Outlook = sunny, Temp = cool, Hum =
high, Wind = strong>
P( yes ) P(sunny | yes ) P(cool | yes ) P(high | yes ) P( strong | yes ) .0053
P(no) p( sunny | no) P(cool | no) P(high | no) P( strong | no) .0206
Thus, the naïve Bayes classifier assigns the
value no to PlayTennis!
30
Naïve Bayes: Estimating Probabilities
To estimate the probability of an
attribute-value A = v for a given class C
we use:
Relative frequency: nc/n,
where nc is the number of training instances that belong
to the class C and have value v for the attribute A, and
n is the number of training instances of the class C;
m-estimate of accuracy: (nc+ mp)/(n+m),
where nc is the number of training instances that belong
to the class C and have value v for the attribute A, n is
the number of training instances of the class C, p is the
prior probablity of the class C, and m is the weight of p.
31
Learning to Classify Text
Simplifications:
each document is represented by a vector of words;
the words are considered as numerical attributes wk;
the values of the word attributes wk are the
frequencies the words occur in the text.
To estimate the probability P(wk | v) we use:
nk 1
n | Vocabulary |
32
where n is the total number of word positions in all
the documents (instances) whose target value is v, nk
is the number of times word wk is found in these n
word positions, and |Vocabulary| is the total number
of distinct words found in the training data.
Naïve Bayes: subtleties
Conditional independence assumption is
often violated
P(a1,a2,..,an|vj) =i P(a1|vj)
But it works surprisingly well anyway. Note
do not need estimated posteriors P(vj|x) to
be correct; need only that
argmaxvj V P(vj) i P(ai|vj)
= argmaxvj V P(vj) P(ai.,…,an|vj)
33
Bayesian Belief Networks
The Bayes Optimal Classifier is often too
costly to apply.
The Naïve Bayes Classifier uses the
conditional independence assumption to
defray these costs. However, in many cases,
such an assumption is overly restrictive.
Bayesian belief networks provide an
intermediate approach which allows stating
conditional independence assumptions that
apply to subsets of the variable.
Allows combining prior knowledge about
dependencies among variables with observed
training data
34
Conditional Independence
We say that X is conditionally independent
of Y given Z if the probability distribution
governing X is independent of the value of Y
given a value for Z.
i.e., (xi,yj,zk) P(X=xi|Y=yj,Z=zk)=P(X=xi|Z=zk)
or, P(X|Y,Z)=P(X|Z)
This definition can be extended to sets of
variables as well: we say that the set of
variables X1…Xl is conditionally independent of
the set of variables Y1…Ym given the set of
variables Z1…Zn , if
35
P(X1…Xl|Y1…Ym,Z1…Zn)=P(X1…Xl|Z1…Zn)
Bayesian Belief Networks
Network represents a set of conditional
independence assertions.
Each node is asserted to be conditionally
independent of its non descendants, given its
immediate predecessors.
Directed acyclic graph
36
Bayesian Belief Networks
Represent joint probability distribution over
all variables
E.g. P(Storm, BusTourGroup,… ForestFire)
In general, P(y1,…, yi) = i P(yi|Parents(Yi)),
where Parents(Yi) denotes immediate
predecessors of Yi in graph
So, joint distribution is fully defined by graph plus
the P(yi|Parents(Yi))
37
Inference in Bayesian Networks
A Bayesian Network can be used to compute
the probability distribution for any subset of
network variables given the values or
distributions for any subset of the remaining
variables.
Unfortunately, exact inference of probabilities in
general for an arbitrary Bayesian Network is
known to be NP-hard.
In practice, can succeed in many case
Exact inference methods work well for some
network structures
Monte Carlo methods “simulate” the network
randomly to calculate approximate solutions.
38
Learning Bayesian Belief Networks
The network structure is given in advance
and all the variables are fully observable in
the training examples. ==> Trivial Case: just
estimate the conditional probabilities.
The network structure is given in advance
but only some of the variables are
observable in the training data. ==> Similar
to learning the weights for the hidden units
of a Neural Net: Gradient Ascent Procedure
The network structure is not known in
advance. ==> Use a heuristic search or
constraint-based technique to search
through potential structures.
39
Gradient Ascent for Bayes Nets
Let wijk denote one entry in the conditional
probability table for variable Y in the network
wijk =P(Yi=yij|Parents(Yi)=the list of uik values)
E.g. if Yi= Campfire, the might be
<Storm=T, BusTourGroup =F>
Perform gradient ascent by repeatedly
Update all using training data D
wijk wijk
dD
Ph ( yij , uik | d )
wijk
Then, renormalize the to assure
Σ wijk =1
40
0≤ wijk ≤1
More on Learning Bayes Nets
EM algorithm can also be used, Repeatedly
Calculate probabilities of unobserved variables,
assume h
Calculate new to maximize E[lnP(D|h)] where D
now includes both observed and unobserved
variables
When structure unknown
Algorithm use greedy search to add/substract
edges and nodes
Active research topic.
41
Summary: Bayesian Belief Networks
Combine prior knowledge with observed
data
Impact of prior knowledge is to lower the
sample complexity
42
Expectation Maximization (EM)
When to use:
Data is only partially observable
Unsupervised clustering (target value
unobservable)
Supervised learning (some instance attributes
unobservable)
Some Uses:
Train Bayesian Belief Network
Unsupervised clustering
Learning hidden Markov models
43
Generating Data from Mixture of k
Gaussians
Each instance x is generated by
Choosing one of the k Gaussians with uniform
probability
Generating an instance at random according to
that Gaussian
44
EM for Estimating k Means
Given:
Instances from X generated by mixture of k
Gaussian distributions
Unknown means <1, 2 ,.., k> of the k
Gaussians
Do not know which instance xi was generate by
which Gaussian
Determine:
Maximum likelihood estimates of <1, 2 ,.., k>
Think of full description of each instance as
yi = <xi, zi1,..,zik >, where
45
zij is 1 if xi generated by jth Gaussian
xi observable
zij unobservable
EM for Estimating k Means
EM algorithm: pick random initial h=<1, 2
,.., k>, then iterate
E Step: Calculate the expected value E[zij ] of each
hidden variable zij , assuming that the current
hypothesis h=<1, 2 ,.., k> holds.
E [ zij ]
p( x xi | u ui )
k
p( x xi | u ui )
n 1
e
2
2
k
1
e
n 1
46
1
2
( xi ui ) 2
2
( xi ui ) 2
EM for Estimating k Means
M Step: Calculate a new maximum likelihood
hypothesis h’=<1’, 2’ ,.., k’>, assuming the value
taken on by each hidden variable zij is its expected
value E[zij] calculated in E step. Then replace the
hypothesis h=<1, 2 ,.., k> by the new hypothesis
h’=<1’, 2’ ,.., k’> and iterate.
k
uj
E[ zij ]xi
i 1
k
E[ zij ]
i 1
47
EM Algorithm
Converge to local maximum likelihood h and
provides estimates of hidden variables zij
In fact, local maximum in E[lnP(Y|h)]
Y is complete data (observable plus
unobservable variables)
Expected value is taken over possible values of
unobserved variables in Y
48
General EM Problem
Given:
Observed data X={x1, x2 ,.., xm}
Unobserved data Z={z1, z2 ,.., zm}
Parameterized probability distribution P(Y|h),
where
Y={y1, y2 ,.., ym} is the full data y i xi zi
h are the parameters.
Determine:
h that locally maximize E[lnP(Y|h)]
49
General EM Method
Define likelihood function Q(h’|h) which
calculate Y=X U Z using observed X and
current parameters h to estimate Z
Q(h’|h)<- E[lnP(Y|h’)|h,X]
EM algorithm:
Estimation step: calculate Q(h’|h) using the
current hypothesis h and the observed data X to
estimate the probability distribution over Y
Q(h’|h)<- E[lnP(Y|h’)|h,X]
Maximization step: replace hypothesis h by the
hypothesis h’ that maximize this Q function
h-<- argmaxh’Q(h’|h)
50