Transcript Document

Bayesian Learning
Evgueni Smirnov
Overview
• Bayesian Theorem
• Maximum A Posteriori Hypothesis
• Naïve Bayes Classifier
• Learning Text Classifiers
Thomas Bayes (1702- 1761)
Bayesian theory of probability was set out in 1764. His
conclusions were accepted by Laplace in 1781, rediscovered
by Condorcet, and remained unchallenged until Boole
questioned them.
Bayes Theorem
Goal: To determine the posterior probability P(h|D) of
hypothesis h given the data D from:
• Prior probability of h, P(h): it reflects any background
knowledge we have about the chance that h is a correct
hypothesis (before having observed the data).
• Prior probability of D, P(D): it reflects the probability that
training data D will be observed given no knowledge about
which hypothesis h holds.
• Conditional Probability of observation D, P(D|h): it
denotes the probability of observing data D given some
world in which hypothesis h holds.
Bayes Theorem
Posterior probability of h, P(h|D): it represents the
probability that h holds given the observed training data D.
It reflects our confidence that h holds after we have seen
the training data D and it is the quantity that Data-mining
researchers are interested in.
Bayes Theorem allows us to compute P(h|D):
P ( D | h) P ( h)
P(h | D) 
P( D)
Maximum a Posteriori
Hypothesis (MAP)
In many learning scenarios, the learner considers a set of hypotheses
H and is interested in finding the most probable hypothesis h  H
given the observed data D. Any such hypothesis is called maximum a
posteriori hypothesis.
hMAP  arg max P(h | D)
hH
P ( D | h) P ( h)
 arg max
P( D)
hH
 arg max P( D | h) P(h)
hH
Example
Consider a cancer test with two outcomes: positive [+] and negative [-]. The
test returns a correct positive result in 98% of the cases in which the disease is
actually present, and a correct negative result in 97% of the cases in which the
disease is not present. Furthermore, .008 of all people have this cancer.
P(cancer) = 0.008
P([+] | cancer) = 0.98
P([+] | cancer) = 0.03
P(cancer) = 0.992
P([-] | cancer) = 0.02
P([-] | cancer) = 0.97
A patient got a positive test [+]. The maximum a posteriori hypothesis is:
hMAP  arg max P(h | [])  arg max P([] | h) P(h)
h{cancer,cancer}
h{cancer,cancer}
P([+] | cancer)P(cancer) = 0.98 x 0.008 = 0.0078
P([+] |  cancer)P( cancer) = 0.03 x 0.992 = 0.0298
hMAP =  cancer
Naïve Bayes Classifier
Let each instance x of a training set D be described by a conjunction
of n attribute values < a1(x), a2(x), … an(x) > and we have a finite
set V of possible classes (concepts).
vMAP  arg max P (v j | a1 ( x ), a2 ( x )....an ( x ))
vjV
 arg max
vjV
P ( a1 ( x ), a2 ( x )....an ( x ) | v j ) P (v j )
P ( a1 ( x ), a2 ( x )....an ( x ))
 arg max P ( a1 ( x ), a2 ( x )....an ( x ) | v j ) P (v j )
vjV
 arg max P (v j ) P ( ai ( x ) | v j )
vjV
i
Naïve Bayes assumption is that attributes are conditionally independent!
Example
Consider the weather data and we have to classify the instance:
< Outlook = sunny, Temp = cool, Hum = high, Wind = strong>
The task is to predict the value (yes or no) of the concept PlayTennis.
We apply the naïve bayes rule:
vMAP
 P(a | v )
vMAP  arg max P(v j )
vj{ yes ,no}
i
j
i
 arg max P(v j ) P(Outlook  sunny | v j ) P(Temp  cool | v j )
vj{ yes ,no}
P( Hum  high | v j ) P(Wind  strong | v j )
Example: Estimating Probabilities
Outlook
Outlook
sunny
sunny
overcast
rain
rain
rain
overcast
sunny
sunny
rain
sunny
overcast
overcast
rain
Temperature Humidity Windy Class
hot
high
false
N
hot
high
true
N
hot
high
false
P
mild
high
false
P
cool
normal false
P
cool
normal true
N
cool
normal true
P
mild
high
false
N
cool
normal false
P
mild
normal false
P
mild
normal true
P
mild
high
true
P
hot
normal false
P
mild
high
true
N
P(yes) = 9/14
P(no) = 5/14
P(sunny|yes) = 2/9
P(sunny|no) = 3/5
P(overcast|yes) = 4/9 P(overcast|no) = 0
P(rain|yes) = 3/9
P(rain|no) = 2/5
Temp
P(hot|yes) = 2/9
P(hot|no) = 2/5
P(mild|yes) = 4/9
P(mild|no) = 2/5
P(cool|yes) = 3/9
P(cool|no) = 1/5
Hum
P(high|yes) = 3/9
P(high|no) = 4/5
P(normal|yes) = 6/9
P(normal|no) = 2/5
Windy
P(true|yes) = 3/9
P(true|no) = 3/5
P(false|yes) = 6/9
P(false|no) = 2/5
Example
P( yes ) P(sunny | yes ) P(cool | yes ) P(high | yes ) P( strong | yes )  .0053
P(no) p( sunny | no) P(cool | no) P(high | no) P( strong | no)  .0206
Thus, the naïve Bayes classifier assigns the value no to
PlayTennis!
Estimating Probabilities
– To estimate the probability P(A=v|C) of an
attribute-value A = v for a given class C we use:
• Relative frequency: nc/n, where nc is the number of training
instances that belong to the class C and have value v for the
attribute A, and n is the number of training instances of the
class C;
• m-estimate of accuracy: (nc+ mp)/(n+m), nc/n, where nc is the
number of training instances that belong to the class C and
have value v for the attribute A, n is the number of training
instances of the class C, p is the prior probablity of P(A=v),
and m is the weight of p.
Learning to Classify Text
• each document is represented by a vector of word numerical
attributes wk;
• the values of the word attributes wk are the frequencies the words
occur in the text.
To estimate the probability P(wk | v) we use:
| Vocabulary |
nk 
nk  mp
nk  1
| Vocabulary |


nm
n | Vocabulary | n  | Vocabulary |
where n is the total number of word positions in all the documents
(instances) whose target value is v, nk is the number of times word
wk is found in these n word positions, and |Vocabulary| is the total
number of distinct words found in the training data.
Summary
• Bayesian methods provide the basis for probabilistic learning
methods that use knowledge about the prior probabilities of
hypotheses and about the probability of observing data given the
hypothesis;
• Bayesian methods can be used to determine the most probable
hypothesis given the data;
• The naive Bayes classifier is useful in many practical
applications.