naïve Bayesian classifier

Download Report

Transcript naïve Bayesian classifier

Bayesian Classification
Bayesian Classification: Why?




A statistical classifier: performs probabilistic
prediction, i.e., predicts class membership probabilities
It is based on the famous Bayes theorem.
A simple Bayesian classifier, naïve Bayesian classifier,
has comparable performance with decision tree and
selected neural network classifiers
Even when Bayesian methods are computationally
intractable, they can provide a standard of optimal
decision making against which other methods can be
measured
Bayes Theorem
•
•
•
Let X be the data record (case) whose class label is unknown. Let H be some
hypothesis, such as "data record X belongs to a specified class C.“
For classification, we want to determine P (H|X) – the probability that the
hypothesis H holds, given the observed data record X.
P (H|X) is the posterior probability of H conditioned on X. For example, the
probability that a fruit is an apple, given the condition that it is red and round.
•
In contrast, P(H) is the prior probability, or a priori probability, of H. In this
example P(H) is the probability that any given data record is an apple,
regardless of how the data record looks.
•
The posterior probability, P (H|X), is based on more information (such as
background knowledge) than the prior probability, P(H), which is independent
of X.
Bayesian Classification: Simple introduction…
Similarly, P (X|H) is posterior probability of X conditioned on H.
That is, it is the probability that X is red and round given that we
know that it is true that X is an apple.
P(X) is the prior probability of X, i.e., it is the probability that a
data record from our set of fruits is red and round.
Bayes theorem is useful in that it
provides a way of calculating the posterior probability, P(H|X),
from P(H), P(X), and P(X|H). Bayes theorem is
P (H|X) = P(X|H) P(H) / P(X)
Bayes Classifier


A probabilistic framework for solving classification
problems
P ( A, C )
Conditional Probability:
P (C | A) 
P ( A)
P ( A, C )
P( A | C ) 
P (C )

Bayes theorem:
P( A | C ) P(C )
P(C | A) 
P( A)
Example of Bayes Theorem

Given:




A doctor knows that meningitis causes stiff neck 50% of the time
Prior probability of any patient having meningitis is 1/50,000
Prior probability of any patient having stiff neck is 1/20
If a patient has stiff neck, what’s the probability he/she has
meningitis?
P( S | M ) P( M ) 0.5 1 / 50000
P( M | S ) 

 0.0002
P( S )
1 / 20
Bayesian Classifiers

Consider each attribute and class label as random variables

Given a record with attributes (A1, A2,…,An)



Goal is to predict class C
Specifically, we want to find the value of C that maximizes
P(C| A1, A2,…,An )
Can we estimate P(C| A1, A2,…,An ) directly from data?
Bayesian Classifiers

Approach:

compute the posterior probability P(C | A1, A2, …, An) for all values of C
using the Bayes theorem
P ( A A  A | C ) P (C )
P (C | A A  A ) 
P( A A  A )
1
1
2
2
n
n
1

2

Choose value of C that maximizes
P(C | A1, A2, …, An)

Equivalent to choosing value of C that maximizes
P(A1, A2, …, An|C) P(C)
How to estimate P(A1, A2, …, An | C ) ?
n
Naïve Bayes Classifier

Assume independence among attributes Ai when class
is given:

P(A1, A2, …, An |C) = P(A1| Cj) P(A2| Cj)… P(An| Cj)

Can estimate P(Ai| Cj) for all Ai and Cj.

New point is classified to Cj if P(Cj)  P(Ai| Cj) is
maximal.
How to Estimate Probabilities from Data?

For continuous attributes:

Discretize the range into bins



k
Two-way split: (A < v) or (A > v)


one ordinal attribute per bin
violates independence assumption
choose only one of the two splits as new attribute
Probability density estimation:



Assume attribute follows a normal distribution
Use data to estimate parameters of distribution
(e.g., mean and standard deviation)
Once probability distribution is known, can use it to estimate
the conditional probability P(Ai|c)
Naïve Bayes Classifier: Example1
Name
human
python
salmon
whale
frog
komodo
bat
pigeon
cat
leopard shark
turtle
penguin
porcupine
eel
salamander
gila monster
platypus
owl
dolphin
eagle
Give Birth
yes
Give Birth
yes
no
no
yes
no
no
yes
no
yes
yes
no
no
yes
no
no
no
no
no
yes
no
Can Fly
no
no
no
no
no
no
yes
yes
no
no
no
no
no
no
no
no
no
yes
no
yes
Can Fly
no
Live in Water Have Legs
no
no
yes
yes
sometimes
no
no
no
no
yes
sometimes
sometimes
no
yes
sometimes
no
no
no
yes
no
Class
yes
no
no
no
yes
yes
yes
yes
yes
no
yes
yes
yes
no
yes
yes
yes
yes
no
yes
mammals
non-mammals
non-mammals
mammals
non-mammals
non-mammals
mammals
non-mammals
mammals
non-mammals
non-mammals
non-mammals
mammals
non-mammals
non-mammals
non-mammals
mammals
non-mammals
mammals
non-mammals
Live in Water Have Legs
yes
no
Class
?
A: attributes
M: mammals
N: non-mammals
6 6 2 2
P( A | M )      0.06
7 7 7 7
1 10 3 4
P( A | N )      0.0042
13 13 13 13
7
P( A | M ) P ( M )  0.06   0.021
20
13
P( A | N ) P( N )  0.004   0.0027
20
P(A|M)P(M) > P(A|N)P(N)
=> Mammals
Naïve Bayes (Summary)




Robust to isolated noise points
Handle missing values by ignoring the instance
during probability estimate calculations
Robust to irrelevant attributes
Independence assumption may not hold for some
attributes
 It makes computation possible
 It yields optimal classifiers when satisfied
 But this is seldom satisfied in practice, as
attributes (variables) are often correlated.