Transcript Document

PGM: Tirgul 11
Na?ve Bayesian Classifier
+
Tree Augmented Na?ve Bayes
(adapted from tutorial by Nir Friedman and Moises Goldszmidt
The Classification Problem
From
a data set describing objects by vectors of features
and a class
Vector1= <49, 0, 2, 134, 271, 0, 0, 162, 0, 0, 2, 0, 3> Presence
Vector2= <42, 1, 3, 130, 180, 0, 0, 150, 0, 0, 1, 0, 3> Presence
Vector3= <39, 0, 3, 94, 199, 0, 0, 179, 0, 0, 1, 0, 3 > Presence
Vector4= <41, 1, 2, 135, 203, 0, 0, 132, 0, 0, 2, 0, 6 > Absence
Vector5= <56, 1, 3, 130, 256, 1, 2, 142, 1, 0.6, 2, 1, 6 > Absence
Vector6= <70, 1, 2, 156, 245, 0, 2, 143, 0, 0, 1, 0, 3 > Presence
Vector7= <56, 1, 4, 132, 184, 0, 2, 105, 1, 2.1, 2, 1, 6 > Absence
a function F: features  class to classify a new
object
Find
Examples
Predicting


Features: cholesterol, chest pain, angina, age, etc.
Class: {present, absent}
Finding



recognition
Features: matrix of pixel descriptors
Class: {1, 2, 3, 4, 5, 6, 7, 8, 9, 0}
Speech


lemons in cars
Features: make, brand, miles per gallon, acceleration,etc.
Class: {normal, lemon}
Digit

heart disease
recognition
Features: Signal characteristics, language model
Class: {pause/hesitation, retraction}
Approaches
Memory


based
Define a distance between samples
Nearest neighbor, support vector machines
Decision


surface
Find best partition of the space
CART, decision trees
Generative


models
Induce a model and impose a decision rule
Bayesian networks
Generative Models
Bayesian


We
We
classifiers
Induce a probability describing the data
P(A1,…,An,C)
Impose a decision rule. Given a new object < a1,…,an >
c = argmaxC P(C = c | a1,…,an)
have shifted the problem to learning P(A1,…,An,C)
are learning how to do this efficiently: learn a Bayesian
network representation for P(A1,…,An,C)
Optimality of the decision rule
Minimizing the error rate...
Let
ci be the true class, and let lj be the class returned by
the classifier.
A decision by the classifier is correct if ci=lj, and in error if
ci lj.
The error incurred by choose label lj is


E (ci | L)    (ci | li ) P(l j | a )  1  P(li | a )
n
j 1
Thus,
had we had access to P, we minimize error rate by
choosing li when


P (li | a )  P (l j | a )j  i
which is the decision rule for the Bayesian classifier
Advantages of the Generative Model
Approach
Output:
Rank over the outcomes---likelihood of present
vs. absent
Explanation: What is the profile of a “typical” person with
a heart disease
Missing values: both in training and testing
Value of information: If the person has high cholesterol
and blood sugar, which other test should be conducted?
Validation: confidence measures over the model and its
parameters
Background knowledge: priors and structure
Evaluating the performance of a
classifier: n-fold cross validation
Partition
D1 D2 D3
Dn
Run 1
Run 2
Run 3
Run n
the data set in n
segments
Do n times
 Train the classifier with the
green segments
 Test accuracy on the red
segments
Compute statistics on the n runs
• Variance
• Mean accuracy
Accuracy: on test data of size m
m
Original data set

Acc =
  (c | l )
k 1
k
i
m
j
Advantages of Using a Bayesian Network

Efficiency in learning and query answering
 Combine knowledge engineering and statistical
induction
 Algorithms for decision making, value of information,
diagnosis and repair
Heart disease
Accuracy = 85%
Data source
UCI repository
Outcome
Age
Vessels
MaxHeartRate
STSlope
Angina
ChestPain BloodSugar
Cholesterol
RestBP
Sex
ECG
OldPeak
Thal
Problems with BNs as classifiers
When evaluating a Bayesian network, we examine the
likelyhood of the model B given the data D and try to
maximize it:
N
LL(B | D )   log PB (a1i , , ani , c i )
i 1
When Learning structure we also add penalty for structure
complexity and seek a balance between the two terms
(MDL or variant). The following properties follow:
A Bayesian network minimized the error over all the
variables in the domain and not necessarily the local error
of the class given the attributes (OK with enough data).
Because of the penalty, a Bayesian network in effect
looks at a small subset of the variables that effect a given
node (it’s Markov blanket)
Problems with BNs as classifiers (cont.)
Let’s look closely at the likelyhood term:
N
N
i 1
i 1
LL(B | D )   log PB (c i | a1i , , ani )   log PB (a1i , , ani )
The
first term estimates just what we want: the probability
of the class given the attributes. The second term
estimates the joint probability of the attributes.
When there are many attributes, the second term starts
to dominate (value of log is increased for small values).
Why not use the just the first term? We can no longer
factorize and calculations become much harder.
The Naïve Bayesian Classifier
C
Diabetes in
Pima Indians
(from UCI repository)
F1
pregnant
F2
age
F3
insulin
F4
dpf
F5
mass
F6
glucose
Fixed
structure encoding the assumption that features are
independent of each other given the class.
P (C | F1,  , F6 )  P (F1 | C )  P (F2 | C )    P (F6 | C )  P (C )
Learning
amounts to estimating the parameters for each
P(Fi|C) for each Fi.
The Naïve Bayesian Classifier (cont.)
What do we gain?
We
ensure that in the learned network, the probability
P(C|A1…An) will take every attribute into account.
We will show polynomial time algorithm for learning the
network.
Estimates are robust consisting of low order statistics
requiring few instances
Has proven to be a powerful classifier often exceeding
unrestricted Bayesian networks.
The Naïve Bayesian Classifier (cont.)
C
F1
Common
These
F2
F3
F4
F5
F6
N (ai , c )
practice is to estimate ˆ
a |c 
i
N (c )
estimate are identical to MLE for multinomials
Improving Naïve Bayes
Naïve
Bayes encodes assumptions of independence that may
be unreasonable:
Are pregnancy and age independent given diabetes?
Problem: same evidence may be incorporated multiple times
(a rare Glucose level and a rare Insulin level over penalize the
class variable)
The success of naïve Bayes is attributed to


Robust estimation
Decision may be correct even if probabilities are inaccurate
Idea:
improve on naïve Bayes by weakening the
independence assumptions
Bayesian networks provide the appropriate
mathematical language for this task
Tree Augmented Naïve Bayes (TAN)
C
F1
pregnant
dpf
F2
age
F4
mass
F5
F3
F6
glucose
insulin
P (C | F1,  , F6 )  P (F1 | C )  P (F2 | F1, C )    P (F6 | F3, C )  P (C )
Approximate
the dependence among features with a tree
Bayes net
Tree induction algorithm
 Optimality: maximum likelihood tree
 Efficiency: polynomial algorithm
Robust parameter estimation
Optimal Tree construction algorithm
The procedure of Chow and Lui construct a tree structure
BT that maximizes LL(BT |D)
Compute the mutual information between every pair of
attributes:
I (Ai ;Aj ) 
Build
 P (a , a
ai ,a j
D
i
j
PD ( ai ,a j )
)log PD (ai )PD (aj )
a complete undirected graph in which the vertices
are the attributes and each edge is annotated with the
corresponding mutual information as weight.
Build a maximum weighted spanning tree of this graph.
Complexity: O(n2N) + O(n2) + O(n2logn) = O(n2N) where n
is the number of attributes and N is the sample size
Tree construction algorithm (cont.)
It is easy to “plant” the optimal tree in the TAN by revising
the algorithm to use a revised conditional measure which
takes the conditional probability on the class into account:
I (Ai ;Aj ) 

ai ,a j ,c
PD (ai , a j , c ) log P
PD ( ai ,a j |c )
D ( ai |c ) PD ( a j |c )
This measures the gain in the log-likelyhood of adding Ai
as a parent of Aj when C is already a parent.
Problem with TAN
When evaluating parameters we estimate the conditional
probability P(Ai|Parents(Ai)). This is done by partitionaing
the data according to possible values of Parents(Ai).
When a partition contains just a few instances we get an
unreliable estimate
In Naive Bayes the partition was only on the values of the
classifier (and we have to assume that is adequate)
In TAN we have twice the number of partitions and get
unreliable estimates, especially for small data sets.
Solution:
 * (x | Pax )  PD (x | Pax )  (1   )PD (x )
NPD (Pax )
where 
NPD (Pax )  s
where s is the smoothing bias and typically small.
Performance: TAN vs. Naïve Bayes
100
25 Data sets
from UCI
repository
Medical
Signal
processing
Financial
Games
Naïve Bayes
95
90
85
80
75
70
65
65
70
75
80 85
TAN
90
Accuracy based
on 5-fold crossvalidation
95 100
No parameter
tuning
Performance: TAN vs C4.5
100
25 Data sets
from UCI
repository
Medical
Signal
processing
Financial
Games
95
C4.5
90
85
80
75
70
65
65
70
75
80 85
TAN
90
95 100
Accuracy based
on 5-fold crossvalidation
No parameter
tuning
Beyond TAN
Can
we do better by learning a more flexible structure?
Experiment:
the structure
learn a Bayesian network without restrictions on
Performance: TAN vs. Bayesian
Networks
Bayesian Networks
100
25 Data sets
from UCI
repository
Medical
Signal
processing
Financial
Games
95
90
85
80
75
70
65
65
70
75
80
85 90
TAN
Accuracy based
on 5-fold crossvalidation
95 100 No parameter
tuning
Classification: Summary
Bayesian
networks provide a useful language to
improve Bayesian classifiers

Lesson: we need to be aware of the task at hand, the amount
of training data vs dimensionality of the problem, etc
Additional



Missing values
Compute the tradeoffs involved in finding out feature values
Compute misclassification costs
Recent

benefits
progress:
Combine generative probabilistic models, such as Bayesian
networks, with decision surface approaches such as Support
Vector Machines