Bayesian Learning - Northwestern University

Download Report

Transcript Bayesian Learning - Northwestern University

Machine Learning
Probability and Bayesian Networks
Doug Downey (adapted from Bryan Pardo, Northwestern University)
An Introduction
• Bayesian Decision Theory came long before
Version Spaces, Decision Tree Learning and
Neural Networks. It was studied in the field of
Statistical Theory and more specifically, in the
field of Pattern Recognition.
Doug Downey (adapted from Bryan Pardo, Northwestern University)
An Introduction
• Bayesian Decision Theory is at the basis of
important learning schemes such as…
– Naïve Bayes Classifier
– Bayesian Belief Networks
– EM Algorithm
• Bayesian Decision Theory is also useful as it
provides a framework within which many nonBayesian classifiers can be studied
– See [Mitchell, Sections 6.3, 4,5,6].
Doug Downey (adapted from Bryan Pardo, Northwestern University)
Discrete Random Variables
• A is a Boolean random variable if it
denotes an event where there is
uncertainty about whether it occurs
• Examples
– The next US president will be Barack Obama
– You will get an A in the course
• P(A) = probability of A = the fraction of
all possible worlds where A is true
Doug Downey (adapted from Bryan Pardo, Northwestern University)
Vizualizing P(A)
All Possible Worlds
Worlds where A is True
P( A) 
area of yellow oval
area of blue rectangle
0  P ( A)  1
If a value is over 1
or under 0, it isn' t
a probabilit y
area of yellow oval
P( A) 
area of blue rectangle
Doug Downey (adapted from Bryan Pardo, Northwestern University)
Axioms of Probability
• Let there be a space S
composed of a countable
number of events
S  {e1 , e2 , e3 ,....e n}
• The probability of each
event is between 0 and 1
0  P(e1 )  1
• The probability of the
whole sample space is 1
P( S )  1
• When two events are
mutually exclusive,
their probabilities are
additive
P(e1  e2 )  P(e1 )  P(e2 )
Doug Downey (adapted from Bryan Pardo, Northwestern University)
Vizualizing Two Boolean RVs
A
B
P( A) 
area of yellow oval
area of blue rectangle
P( A  B)  P( A)  P( B)  P( A  B)
Doug Downey (adapted from Bryan Pardo, Northwestern University)
Conditional Probability
A
B
The conditional probability
of A given B is represented
by the following formula
P( A  B)
P( A | B) 
P( B)
P( A) 
area of yellow oval
area of blue rectangle
NOT Independent
Can we do the following?
P( A  B) P( A) P( B)
P( A | B) 

P( B)
P( B)
Only if A and B are independent
Doug Downey (adapted from Bryan Pardo, Northwestern University)
Independence
• variables A and B are said to be
independent if knowing the value of A
gives you no knowledge about the
likelihood of B…and vice-versa
P(A|B) = P(A) and P(B|A) = P(B)
Doug Downey (adapted from Bryan Pardo, Northwestern University)
An Example: Cards
• Take a standard deck of 52 cards.
• On the first draw I pull the Ace of Spades.
• I don’t replace the card.
– What is the probability I’ll pull the Ace of Spades on
the second draw?
• Now, I replace the Ace after the 1st draw,
shuffle, and draw again.
– What is the chance I’ll draw the Ace of Spades on the
2nd draw?
Doug Downey (adapted from Bryan Pardo, Northwestern University)
Discrete Random Variables
• A is a discrete random variable if it takes a
countable number of distinct values
• Examples
– Your grade G in the course
– The number of heads k in n coin flips
• P(A=k) = the fraction of all possible worlds
where A equals k
• Notation: PD(A = k) prob. relative to a
distribution D
– Pfair grading(G = “A”), Pcheating(G = “A”)
Doug Downey (adapted from Bryan Pardo, Northwestern University)
Bayes Theorem
• Definition of
Conditional
Probability
• Corollary:
The Chain Rule
• Bayes Rule
(Thomas Bayes, 1763)
P( A, B)
P( A | B) 
P( B)
P( A | B) P( B)  P( A, B)
P( A, B)
P( B | A) 
P( A)
P( A | B) P( B)

P( A)
Doug Downey (adapted from Bryan Pardo, Northwestern University)
ML in a Bayesian Framework
• Any ML technique can be expressed as
reasoning about probabilities
• Goal: Find hypothesis h that is most
probable given training data D
• Provides a more explicit way of describing
& encoding our assumptions
Some Definitions
• Prior probability of h, P(h):
– The background knowledge we have about the chance that h is a
correct hypothesis (before having observed the data).
• Prior probability of D, P(D):
– the probability that training data D will be observed given no
knowledge about which hypothesis h holds.
• Conditional Probability of D, P(D|h):
– the probability of observing data D given that hypothesis h holds.
• Posterior probability of h, P(h|D):
– the probability that h is true, given the observed training data D.
– the quantity that Machine Learning researchers are interested in.
Doug Downey (adapted from Bryan Pardo, Northwestern University)
Maximum A Posteriori (MAP)
• Goal: To find the most probable hypothesis h from a set
of candidate hypotheses H given the observed data D.
• MAP Hypothesis, hMAP
hmap  arg max ( P(h | D))
hH
 P ( D | h) P ( h) 

 arg max 
P( D)
hH


 arg max ( P( D | h) P(h))
hH
Doug Downey (adapted from Bryan Pardo, Northwestern University)
Maximum Likelihood (ML)
• ML hypothesis is a special case of the MAP hypothesis
where all hypotheses are equally likely to begin with
hmap  arg max ( P( D | h) P(h))
hH
Assume...
1
P ( h) 
h  H
|H |
Then...
hml  arg max ( P( D | h))
hH
Doug Downey (adapted from Bryan Pardo, Northwestern University)
Example: Brute Force MAP Learning
• Assumptions
– The training data D is noise-free
di  c( xi )
– The target concept c is in the hypothesis set H
c H
– All hypotheses are equally likely
1
P ( h) 
|H |
• Choice: Probability of D given h
1 if d  D, h(d )  c(d )
P ( D | h)  
0 else
Doug Downey (adapted from Bryan Pardo, Northwestern University)
Brute Force MAP (continued)
P ( D | h) P ( h)
P( D)
1
P ( D | h) 
|H |

P( D)
P(h | D) 
Bayes Theorem
Given our assumptions
If the data is not consistent
P(h | D)  0
If the data is consistent
1
|H |

P( D)
1
VSH,D is the version space
1
|H |

| VS H , D |
|H |
1
Doug Downey (adapted from Bryan Pardo, Northwestern University)
1

| VS H , D |
Find-S as MAP Learning
• We can characterize the FIND-S learner
(chapter 2) in Bayesian terms
– Again P(D | h) is 1 if h is consistent on D, and
0 otherwise
– P(h) increases with…
• specificity of h
– Then: MAP hypothesis = output of Find-S
Doug Downey (adapted from Bryan Pardo, Northwestern University)
Neural Nets in a Bayesian Framework
• Under certain assumptions regarding noise
in the data, minimizing the mean squared
error (what multilayer perceptrons do)
corresponds to computing the maximum
likelihood hypothesis.
Doug Downey (adapted from Bryan Pardo, Northwestern University)
Least Squared Error = ML
recall
hml  arg max p ( D | h)
hH
let' s learn a target function f ( xi )
...represe nted by examples drawn from an error prone
example set D, where the ith exmple is...
d i  f ( xi )  e
Assume e is drawn from a normal distribution
f
e
Doug Downey (adapted from Bryan Pardo, Northwestern University)
hML
Least Squared Error = ML
we express the maximal likelihood hypothesis as
m
hml  arg max  p (d i | h)
hH
i 1
If the error is normally distribute d we can express this as a
normal distributi on with va riance  2 and mean   h( xi )
m
hml  arg max 
hH
i 1
m
 arg max 
hH
i 1
1
2 2
1
2 2
e
e


1
 d i   2
2
1
 d i  h  xi 2
2
Doug Downey (adapted from Bryan Pardo, Northwestern University)
Least Squared Error = ML
m
hML  arg max 
hH
i 1
m
hML
1
2
2
e

2
1
di  h  xi  

2
2
1
 arg max  ln

di  h  xi  

2
2
hH
i 1
2
m
1
hML  arg min   di  h  xi  
hH
2
(take the log)
(remove constants)
i 1
Thus, the ML hypothesis minimizes the sum of the squared error
Doug Downey (adapted from Bryan Pardo, Northwestern University)
Decision Trees in Bayes Framework
• Decent choice for P(h): simpler
hypotheses have higher probability
– Occam’s razor
• This can be encoded in terms of finding
the “Minimum Description Length”
encoding
– Provides a way to “trade off” hypothesis size
for training error
– Potentially prevents overfitting
Doug Downey (adapted from Bryan Pardo, Northwestern University)
Most Compact Coding
• Lets minimize the bits used to encode a message
• Idea:
– Assign shorter codes to more probable messages
• According to Shannon & Weaver
– An optimal code assigns –log2P(i) bits to encode item i
• thus…
optimal length    P(i) log 2 P(i)
i
Doug Downey (adapted from Bryan Pardo, Northwestern University)
Minimum Description Length (MDL)
recall...
hmap  arg max ( P ( D | h) P (h))
hH
so...
hmap  arg max log 2 P ( D | h)  log 2 P (h) 
hH
equivalent ly....
hmap  arg min  log 2 P ( D | h)  log 2 P (h) 
hH
|
Doug Downey (adapted from Bryan Pardo, Northwestern University)
Minimum Description Length (MDL)
hmap  arg min  log 2 P( D | h)  log 2 P(h) 
hH
which starts to look like entropy...
  log 2 P(h) is the descriptio n length of h under the optimal
encoding. We notate this...
LC H (h)   log 2 P(h)
  log 2 P( D | h) is the descriptio n length of the training data,
given we use the optimal encoding for hypothesis h. Notate this..
LC D|H ( D | h)   log 2 P( D | h)
Doug Downey (adapted from Bryan Pardo, Northwestern University)
Minimum Description Length (MDL)
Thus...
hmap  arg min  log 2 P( D | h)  log 2 P(h) 
hH
can be written.. .
hmap  arg min LCD|H ( D | h)  LCH (h)
hH


So the Minimum Descriptio n Length Principle is...
Choose hMDL such that. ..

hMDL  arg min LCD|H ( D | h)  LC H (h)
hH

and if we have optimal encoding, hMDL  hmap
Doug Downey (adapted from Bryan Pardo, Northwestern University)
What does all that mean?
• The “optimal” hypothesis is the one that is
the smallest when we count…
– How long the hypothesis description must be
– How long the data description must be, given
the hypothesis
• Key idea: since we’re given h, we need only
encode h’s mistakes
Doug Downey (adapted from Bryan Pardo, Northwestern University)
What does all that mean?
• If the hypothesis is perfect, we don’t need
to encode any data.
• For each misclassification, we must
– say which item is misclassified
• Takes log2m bits, where m = size of the dataset
– Say what the right classification is
• Takes log2k bits, where k = number of classes
Doug Downey (adapted from Bryan Pardo, Northwestern University)
The best MDL hypothesis
• The best hypothesis is the best tradeoff
between
– Complexity of the hypothesis description
– Number of times we have to tell people where
it screwed up.
Doug Downey (adapted from Bryan Pardo, Northwestern University)
Is MDL always MAP?
• Only given significant assumptions:
– If we know a representation scheme such
that size of h in H is -log2P(h)
– Likewise, the size of the exception
representation must be –log2P(D|h)
– THEN
• MDL = MAP
Doug Downey (adapted from Bryan Pardo, Northwestern University)
Making Predictions
• The reason we learned h to begin with
• Does it make sense to choose just one h?
h1 : Looks
matter
h2 : Money
matters
We want a prediction: yes or no?
h3 : Ideas
matter
Obama Elected President
Doug Downey (adapted from Bryan Pardo, Northwestern University)
Maximum A Posteriori (MAP)
• Find most probable hypothesis
hmap  arg max ( P( D | h) P(h))
hH
• Use the predictions of that hypothesis
h1 : Looks
matter
h2 : Money
matters
h3 : Ideas
matter
…. do we really want to ignore the other hypotheses?
Imagine 8 hypotheses. Seven of them say “yes” and
have a probability of 0.1 each. One says “no” and
has a probability of 0.3. Who do you believe?
Doug Downey (adapted from Bryan Pardo, Northwestern University)
Bayes Optimal Classifier
• Bayes Optimal Classification: The most probable
classification of a new instance is obtained by combining
the predictions of all hypotheses, weighted by their
posterior probabilities:
arg max  P(v | h) P(h | D)
vV
hH
…where V is the set of all the values a classification can
take and v is one possible such classification.
No other method using the same H and prior knowledge is
better (on average).
Doug Downey (adapted from Bryan Pardo, Northwestern University)
Naïve Bayes Classifier
• Unfortunately, Bayes Optimal Classifier is usually too
costly to apply! ==> Naïve Bayes Classifier
We’ll be seeing more of these…
Doug Downey (adapted from Bryan Pardo, Northwestern University)
The Joint Distribution
• Make a truth table
listing all
combinations of
variable values
• Assign a probability to
each row
• Make sure the
probabilities sum to 1
A
0
0
0
0
1
1
1
1
B
0
0
1
1
0
0
1
1
C
0
1
0
1
0
1
0
1
Doug Downey (adapted from Bryan Pardo, Northwestern University)
Prob
0.1
0.2
0.1
0.05
0.05
0.2
0.25
0.05
Using The Joint Distribution
• Find P(A)
• Sum the probabilities
of all rows where A=1
P(A=1) = 0.05+ 0.2 +
0.25+ 0.05
= 0.55
A
• P(A) =
0
1
P(A)
0.45
0.55
A
0
0
0
0
1
1
1
1
B
0
0
1
1
0
0
1
1
C
0
1
0
1
0
1
0
1
Doug Downey (adapted from Bryan Pardo, Northwestern University)
Prob
0.1
0.2
0.1
0.05
0.05
0.2
0.25
0.05
Using The Joint Distribution
• Find P(A|B)
• P(A=1 | B=1)
=P(A=1, B=1)/P(B=1)
=(0.25+0.05)/
(0.25+0.05+0.1+0.05)
A
B
P(A|B)
1
1
0.67
0
1
0.33
1
0
0.45
0
0
0.55
A
0
0
0
0
1
1
1
1
B
0
0
1
1
0
0
1
1
C
0
1
0
1
0
1
0
1
Doug Downey (adapted from Bryan Pardo, Northwestern University)
Prob
0.1
0.2
0.1
0.05
0.05
0.2
0.25
0.05
Using The Joint Distribution
• Are A and B
Independent?
P ( A  1, B  1)  0.3
P ( A  1)  0.55
P ( B  1)  0.45
P ( A  1) P ( B  1)  0.55  0.45
 0.2475
P ( A, B )  P ( A) P ( B )
NO. They are NOT independent
A
0
0
0
0
1
1
1
1
B
0
0
1
1
0
0
1
1
C
0
1
0
1
0
1
0
1
Doug Downey (adapted from Bryan Pardo, Northwestern University)
Prob
0.1
0.2
0.1
0.05
0.05
0.2
0.25
0.05
Why not use the Joint Distribution?
• Given m boolean variables, we need to
estimate 2m-1 values.
• 20 yes-no questions = a million values
• How do we get around this combinatorial
explosion?
– Assume independence of variables!!
Doug Downey (adapted from Bryan Pardo, Northwestern University)
…back to Independence
• The probability I have an apple in my lunch bag
is independent of the probability of a blizzard in
Japan.
• This is DOMAIN Knowledge, typically supplied
by the problem designer
P( A | B)  P( A)
Doug Downey (adapted from Bryan Pardo, Northwestern University)
Naïve Bayes Classifier
• Cases described by a conjunction of attribute values
– These attributes are our “independent” hypotheses
• The target function has a finite set of values, V
vMAP  arg max P(v j | a1  a2 ...  an )
v j V
• Could be solved using the joint distribution table
• What if we have 50,000 attributes?
– Attribute j is a Boolean signaling presence or absence of the jth
word from the dictionary in my latest email.
Doug Downey (adapted from Bryan Pardo, Northwestern University)
Naïve Bayes Classifier
vMAP  arg max P (v j | a1  a2 ...  an )
v j V
 arg max
v j V
P (a1  a2 ...  an | v j ) P (v j )
P (a1  a2 ...  an )
 arg max P (a1  a2 ...  an | v j ) P (v j )
v j V
Doug Downey (adapted from Bryan Pardo, Northwestern University)
Naïve Bayes Continued
vMAP  arg max P(a1  a2 ...  an | v j ) P(v j )
v j V
Conditional independence step
v NB  arg max P(a1 | v j ) P(a2 | v j )......P(an | v j ) P(v j )
v j V
 arg max P(v j ) P(ai | v j )
v j V
i
Instead of one table of size 250000 we have 50,000 tables of size 2
Doug Downey (adapted from Bryan Pardo, Northwestern University)
Bayesian Belief Networks
• Bayes Optimal Classifier
– Often too costly to apply (uses full joint probability)
• Naïve Bayes Classifier
– Assumes conditional independence to lower costs
– This assumption often overly restrictive
• Bayesian belief networks
– provide an intermediate approach
– allows conditional independence assumptions that
apply to subsets of the variable.
Doug Downey (adapted from Bryan Pardo, Northwestern University)
Example
• I'm at work, neighbor John calls to say my alarm is
ringing, but neighbor Mary doesn't call. Sometimes it's
set off by minor earthquakes. Is there a burglar?
• Variables: Burglary, Earthquake, Alarm, JohnCalls,
MaryCalls
• Network topology reflects "causal" knowledge:
–
–
–
–
A burglar can set the alarm off
An earthquake can set the alarm off
The alarm can cause Mary to call
The alarm can cause John to call
Doug Downey (adapted from Bryan Pardo, Northwestern University)
Example contd.
Doug Downey (adapted from Bryan Pardo, Northwestern University)
Bayesian Networks
[Pearl 91]
P B , E , A , M , J 
Parents Pa of Alarm
Qualitative part:
Directed acyclic graph (DAG)
• Nodes - random vars.
• Edges - direct influence
Earthquake
Burglary
Alarm
MaryCalls
Together:
Define a unique distribution
in a factored form
E B P(A | B,E)
e b
0.95 0.05
e b
0.94 0.06
e b
0.29 0.01
e b
0.001 0.999
JohnCalls
Quantitative part:
Set of conditional probability
distributions
P  B , E , A , M , J   P  E   P  B   P  A B , E   P M A   P  J A 
Doug Downey (adapted from Bryan Pardo, Northwestern University)
Traditional
Approaches
Compactness
• A CPT for Boolean Xi with k Boolean parents has 2k
rows for the combinations of parent values
• Each row requires one number p for Xi = true
(the number for Xi = false is just 1-p)
• If each variable has no more than k parents, the
complete network requires O(n · 2k) numbers
• I.e., grows linearly with n, vs. O(2n) for the full
joint distribution
• For burglary net, 1 + 1 + 4+ 2 + 2 = 10 numbers
(vs. 25-1 = 31)
Doug Downey (adapted from Bryan Pardo, Northwestern University)
Semantics
The full joint distribution is defined as the product of the local
conditional distributions:
P (X1, … ,Xn) = πin= 1 P (Xi | Parents(Xi))
Example:
P(j  m  a  b  e)
= P (j | a) P (m | a) P (a | b, e) P (b) P (e)
Doug Downey (adapted from Bryan Pardo, Northwestern University)
Learning BB Networks: 3 cases
1. The network structure is given in advance and all the
variables are fully observable in the training examples.
Trivial Case: just estimate the conditional probabilities.
2. The network structure is given in advance but only some
of the variables are observable in the training data.
Similar to learning the weights for the hidden units of a Neural Net:
Gradient Ascent Procedure
3. The network structure is not known in advance.
Use a heuristic search or constraint-based technique to search
through potential structures.
Doug Downey (adapted from Bryan Pardo, Northwestern University)
Constructing Bayesian networks
• 1. Choose an ordering of variables X1, … ,Xn
• 2. For i = 1 to n
– add Xi to the network
– select parents from X1, … ,Xi-1 such that
P (Xi | Parents(Xi)) = P (Xi | X1, ... Xi-1)
This choice of parents guarantees:
P (X1, … ,Xn)
= πi =1 P (Xi | X1, … , Xi-1) (chain rule)
= πni =1P (Xi | Parents(Xi)) (by construction)
n
Doug Downey (adapted from Bryan Pardo, Northwestern University)
Example
• Suppose we choose the ordering M, J, A, B, E
P(J | M) = P(J)?
Doug Downey (adapted from Bryan Pardo, Northwestern University)
Example
• Suppose we choose the ordering M, J, A, B, E
P(J | M) = P(J)? No
P(A | J, M) = P(A | J)?
Doug Downey (adapted from Bryan Pardo, Northwestern University)
Example
• Suppose we choose the ordering M, J, A, B, E
P(J | M) = P(J)? No
P(A | J, M) = P(A | J)? P(A | J, M) = P(A)? No
P(B | A, J, M) = P(B | A)?
P(B | A, J, M) = P(B)?
Doug Downey (adapted from Bryan Pardo, Northwestern University)
Example
• Suppose we choose the ordering M, J, A, B, E
P(J | M) = P(J)? No
P(A | J, M) = P(A | J)? P(A | J, M) = P(A)? No
P(B | A, J, M) = P(B | A)? Yes
P(B | A, J, M) = P(B)? No
P(E | B, A ,J, M) = P(E | A)?
P(E | B, A, J, M) = P(E | A, B)?
Doug Downey (adapted from Bryan Pardo, Northwestern University)
Example
• Suppose we choose the ordering M, J, A, B, E
P(J | M) = P(J)?
No
P(A | J, M) = P(A | J)? P(A | J, M) = P(A)? No
P(B | A, J, M) = P(B | A)? Yes
P(B | A, J, M) = P(B)? No
P(E | B, A ,J, M) = P(E | A)? No
P(E | B, A, J, M) = P(E | A, B)? Yes
Doug Downey (adapted from Bryan Pardo, Northwestern University)
Example contd.
• Deciding conditional
independence is hard in
noncausal directions
– Causal models and
conditional independence
seem hardwired for
humans!
• Network is less compact
Doug Downey (adapted from Bryan Pardo, Northwestern University)
Inference in BB Networks
• A Bayesian Network can be used to compute the
probability distribution for any subset of network
variables given the values or distributions for any
subset of the remaining variables.
• Unfortunately, exact inference of probabilities in
general for an arbitrary Bayesian Network is known
to be NP-hard (#P-complete)
• In theory, approximate techniques (such as Monte
Carlo Methods) can also be NP-hard, though in
practice, many such methods are shown to be useful.
Doug Downey (adapted from Bryan Pardo, Northwestern University)
Expectation Maximization Algorithm
• Learning unobservable relevant variables
• Example:Assume that data points have been uniformly
generated from k distinct Gaussian with the same known
variance. The problem is to output a hypothesis
h=<1, 2 ,.., k> that describes the means of each of
the k distributions. In particular, we are looking for a
maximum likelihood hypothesis for these means.
• We extend the problem description as follows: for each
point xi, there are k hidden variables zi1,..,zik such that
zil=1 if xi was generated by normal distribution l and
ziq= 0 for all ql.
Doug Downey (adapted from Bryan Pardo, Northwestern University)
The EM Algorithm (Cont’d)
• An arbitrary initial hypothesis h=<1, 2 ,.., k> is chosen.
• The EM Algorithm iterates over two steps:
– Step 1 (Estimation, E): Calculate the expected value E[zij] of
each hidden variable zij, assuming that the current hypothesis
h=<1, 2 ,.., k> holds.
– Step 2 (Maximization, M):
Calculate a new maximum likelihood hypothesis h’=<1’, 2’ ,..,
k’>, assuming the value taken on by each hidden variable zij is its
expected value E[zij] calculated in step 1.
Then replace the hypothesis h=<1, 2 ,.., k> by the new
hypothesis h’=<1’, 2’ ,.., k’> and iterate.
The EM Algorithm can be applied to more general problems
Doug Downey (adapted from Bryan Pardo, Northwestern University)
Gibbs Classifier
• Bayes optimal classification can be too hard to compute
• Instead, randomly pick a single hypothesis (according to
the probability distribution of the hypotheses)
• use this hypothesis to classify new cases
arg max P(v | h) P(h | D)
vV
h2
h1
h3
Doug Downey (adapted from Bryan Pardo, Northwestern University)