Presentation 3

Download Report

Transcript Presentation 3

A survey on using Bayes
reasoning in Data Mining
Directed by : Dr Rahgozar
Mostafa Haghir Chehreghani
Outline
 Bayes Theorem
 MAP, ML hypothesis
 Minimum description length principle
 Bayes optimal classifier
 Naïve Bayes learner
 summery
Two Roles for Bayesian Methods
Provides practical learning algorithms




Naïve Bayes learning
Bayesian belief network learning
Combine prior knowledge with observed data
Requires prior probabilities
Provides useful conceptual framework


Provides gold standard for evaluating other
learning algorithms
Additional insight into Ockham’s razor
Bayes Theorem
P ( D | h) P ( h )
P ( h | D) 
P ( D)
 P(h) = prior probability of hypothesis h
 P(D) = prior probability of training data D
 P(h|D) = probability of h given D
 P(D|h) = probability of D given h
Choosing hypotheses
 Generally want the most probable hypothesis given
the training data
Maximum a posteriori hypothesis hMAP
hMAP  arg max P (h | D )
hH
P ( D | h) P ( h)
 arg max
hH
P( D)
 arg max P ( D | h) P (h)
hH
 If assume P(hi)=P(hj) then can further simplify, and
choose the Maximum likelihood (ML) hypothesis
hML  arg max P( D | hi )
hi H
Why Likelihood Function are Great
 MLEs achieve the Cramer-Rao lower bound

The CRLB of variance is the inverse of the
derivative of the derivative of the log of the
likelihood function. Any estimator β must have a
variance greater than or equal to the CRLB
 The Neyman-Pearson lemma

a likelihood ratio test will have the minimum
possible Type II error of any test with the α that
we selected.
Learning a Real Valued Function
Consider any real-valued target function f
Training examples <xi,di>, where di is noisy training
value
 di=f(xi)+ei
 ei is random variable (noise) drawn
independently for each xi according to some
Gaussian distribution with µ=0
Then the maximum likelihood hypothesis hML is the
one that minimizes the sum of squared errors
hML  arg min
hH
m
2
(
d

h
(
x
))
 i
i
i 1
Learning a Real Valued Function
hMAP  arg max p( D | h)
hH
m
 arg max  p (d i | h)
hH
i 1
m
 arg max 
hH
i 1
 Maximize natural log of this instead…
m
hML  arg max  ln
hH
i 1
1
2
2
e
1  d  h( xi ) 
 arg max    i

hH
2



i 1
m
 arg max   d i  h( xi ) 
 arg min
hH
i 1
m
2


d

h
(
x
)
 i
i
i 1
2
1  d  h( xi ) 
  i

2
2



2
1
m
hH
1  d  h ( xi ) 
  i

2


2
2
2
Learning to Predict Probabilities
 Consider predicting survival probability from patient data



Training examples <xi,di>, where di is 1 or 0
Want to train neural network to output a probability given xi
(not a 0 or 1)
In this case can show
m
hML  arg max  d i ln h( xi )  (1  d i ) ln( 1  h( xi ))
hH
i 1
 Weight update rule for a sigmoid unit:
w jk  w jk  w jk

where
m
w jk    (d i  h( xi )) xijk
i 1
Minimum Description Length Principle
 Ockham’s razor: prefer the shortest hypothesis
 MDL: prefer the hypothesis h that minimizes
hML  arg min LC1 (h)  LC2 ( D | h)
hH
where Lc(x) is the description length of x under encoding C
 Example: H=decision trees D=training data
labels



Lc1(h) is # bits to describe tree h
Lc2(x) is # bits to describe D given h
Hence hMDL trades off tree size for training errors
Minimum Description Length Principle
hMAP  arg max P( D | h) P(h)
hH
 arg max log 2 P( D | h)  log 2 P(h)
hH
 arg min  log 2 P( D | h)  log 2 P(h)
hH
(1)
 Interesting fact from information theory:
 The optimal (shortest expected coding length)
code for an event with probability p is –log2p bits
 So interpret (1):
 –log2P(h) is length of h under optimal code
 –log2P(D|h) is length of D given h under optimal
code
Most Probable Classification of New
Instances
So far we’ve sought the most probable hypothesis
given the data D (ie., hMAP)
Given new instance x what is its most probable
classification?

hMAP is not the most probable classification!
Consider three possible hypotheses:

P(h1 | D)  .4, P(h2 | D)  .3, P(h3 | D)  .3
Given a new instance x

h1 ( x)  , h2 ( x)  , h3 ( x)  
What’s the most probable classification of x?
Bayes Optimal Classifier
arg max
v j V
Example:
 P(v
hi H
j
| hi ) P(hi | D)
P(h1 | D)  .4, P( | h1 )  0, P( | h1 )  1
P(h2 | D)  .3, P( | h2 )  1, P( | h2 )  0
P(h3 | D)  .3, P( | h3 )  1, P( | h3 )  0
therefore
 P( | h ) P(h | D)  .4
hi H
i
i
 P (  | h ) P ( h | D )  .6
hi H
and
arg max
v j V
i
 P(v
hi H
i
j
| hi ) P(hi | D)  
Gibbs Classifier
Bayes optimal classier provides best result, but
can be expensive if many hypotheses
 Gibbs algorithm:
1.
2.
Choose one hypothesis at random, according to
P(h|D)
Use this to classify new instance
Surprising fact: Assume target concepts are drawn
at random from H according to priors on H.
Then:
E[errorGibbs ]  2 E[errorBayesOptimal ]
Naive Bayes Classifier
Along with decision trees, neural networks, nearest
nbr, one of the most practical learning methods

When to use



Moderate or large training set available
Attributes that describe instances are
conditionally independent given classification
Successful applications:


Diagnosis
Classifying text documents
Naive Bayes Classifier
Assume target function f:X→V, where each
instance x described by attributes <a1,a2,…,an>.
Most probable Value of f(x) is:
vMAP  arg max P(v j | a1 , a2 ,..., an )
v j V
vMAP  arg max
P(a1 , a2 ,..., an | v j ) P(v j )
v j V
P(a1 , a2 ,..., an )
 arg max P(a1 , a2 ,..., an | v j ) P(v j )
v j V

Naïve Bayes assumption
P(a1 , a2 ,..., an | v j )   P(ai | v j )
i
Which gives
v NB  arg max P(v j ) P(ai | v j )
v j V
i
Conditional Independence
Definition: X is conditionally independent of Y given
Z if the probability distribution governing X is
independent of the value of Y given the value
of Z; that is if
(xi , y j , zk ) P( X  xi | Y  y j , Z  z k )  P( X  xi | Z  z k )
more compactly we write
P( X | Y , Z )  P( X | Z )
Example: Thunder is conditionally independent of
Rain, given Lightning
Inference in Bayesian Networks
How can one infer the (probabilities of) values of one or
more network variables given observed values of
others?
 Bayes net contains all information needed for this
inference
 If only one variable with unknown value, easy to infer it
 In general case, problem is NP hard
In practice, can succeed in many cases
 Exact inference methods work well for some network
structures
 Monte Carlo methods simulate the network randomly
to calculate approximate solutions
Learning Bayes Nets
Suppose structure known, variables partially
observable
e.g. observe ForestFire, Storm, BusTourGroup,
Thunder, but not Lightning, Campfire,…
 Similar to training neural network with hidden
units
 In fact, can learn network conditional
probability tables using gradient ascent!
 Converge to network h that (locally) maximizes
P(D|h)
More on Learning Bayes Nets
EM algorithm can also be used. Repeatedly:
1.
2.
Calculate probabilities of unobserved variables,
assuming h
Calculate new wijk to maximize E[lnP(D|h)]
where D now includes both observed and
(calculated probabilities of) unobserved
variables
When structure unknown…
 Algorithms use greedy search to add/substract
edges and nodes
 Active research topic
Summary



Combine prior knowledge with observed data
Impact of prior knowledge (when correct!) is to
lower the sample complexity
Active research area





Extend from boolean to real-valued variables
Parameterized distributions instead of tables
Extend to first-order instead of propositional
systems
More effective inference methods
…
Reference:
 Buntine W. L, (1994). Operations for learning with graphical
models. Journal of Artificial Intelligence Research, 2, 159-225.
 R. Agrawal, J. Gehrke, D. Gunopolos, and Prabhakar
Raghavan. Automatic subspace clustering of high dimensional
data for data mining applications. In ACM SIGMOD Conference,
1998.
 LEE, C-Y. and ANTONSSON, E.K. 2000. Dynamic partitional
clustering using evolution strategies. In Proceedings of the 3rd
Asia-Pacific Conference on Simulated Evolution and Learning,
Nagoya, Japan.
 PELLEG, D. and MOORE, A. 2000. X-means: Extending Kmeans with Efficient Estimation of the Number of Clusters. In
Proceedings 17th ICML, Stanford University.
Reference:
 Cédric Archambeau, John A. Lee, Michel Verleysen. On




Convergence Problems of the EM Algorithm for Finite Gaussian
Mixtures. ESANN'2003 proceedings – European Symposium on
Artificial Neural Networks Bruges (Belgium), 23-25 April 2003, dside publi., ISBN 2-930307-03-X, pp. 99-106
P. Langley and S. Sage. Induction of Selective Bayesian
Classifiers. Proc. 10th Conf. on Artificial Intelligence, 1994
J. Bilmes: A Gentle Tutorial of the EM Algorithm and its
Application to Parameter stimation for Gaussian Mixture and
Hidden Markov Models. Technical Report of the International
Computer Science Institute, Berkeley, CA (1998).
Adwait Ratnaparkhi. 1998. Maximum Entropy Models for Natural
Language Ambiguity Resolution. Ph.D. thesis, the University of
Pennsylvania.
[17] Thorsten Joachims. 1999. Transductive inference for text
classification using support vector machines. In Proc. 16th
International Conf. on Machine Learning, pages 200–209.
Morgan Kaufmann, San Francisco, CA.
Any Question ?