Bayesian Networks

Download Report

Transcript Bayesian Networks

Bayesian Networks
Graphical Models
• Bayesian networks
• Conditional random fields
• etc.
Topic=Course
“Grading”
“Instructor”
“Syllabus”
“Exams”
“Vapnik”
P
(
Grading

Instructor

Syllabus

Exams

Vapnik
)

P
(
Course
)
P
(
Grading
|
Course
)
P
(
Instructo
|
Course
)
P
(
Syllab
|
Cour
)
P
(
Exa
|
co
)
P
(
Vapnik
|
Course
)
Naive Bayes:
P
(
C

c
|
X

x
,...,
X

x
)

P
(
C

c
)
P
(
X

x
|
C

c
)

j
1
1
1
n
j
i
i
j
i
Bayesian networks
• Idea is to represent dependencies (or causal relations) for
all the variables so that space and computation-time
requirements are minimized.
Topic=Russian
Mathematicians
Vapnik
Topic=Course
Syllabus
Vapnik
Russian Math
true
false
0.2
R.M
Course true
false
True
True
0.95
0.05
True
False
0.8
0.2
False
True
0.6
0.4
false
false
0.05
0.95
Conditional probability
tables for each node
Course
true
0.01
false
0.99
0.8
Topic=Russian
Mathematicians
Topic=Course
Syllabus
Vapnik
Syllabus
Course
true
false
true
0.9
0.1
false
0.2
0.8
Semantics of Bayesian networks
• If network is correct, can calculate full joint probability
distribution from network.
P(( X 1  x1 )  ( X 2  x2 )  ...  ( X n  xn ))
 P(( X 1  x1 ) | ( X 2  x2 )  ...  ( X n  xn ))
Chain rule
 P(( X 2  x2 ) | ( X 3  x3 )  ...  ( X n  xn ))
 ...
 P ( X n  xn )
n
  P( X i  xi | parents( X i ))
i 1
where parents(Xi) denotes specific values of parents of Xi.
Compare with Naïve Bayes.
Example
• Calculate
P
[(
Russian
h

t
)

(
Cours

f
)

(
Vap

f
)

(
Sy

f
)
Examples
What is unconditional (marginal) probability that Vapnik is
true?
What is the unconditional (marginal) probability that
Russian Mathematicians is true?
Different types of inference in Bayesian Networks
Causal inference
Evidence is cause, inference is probability of effect
Example:
Instantiate evidence Course= true. What is P(Syllabus | Course)?
Diagnostic inference
Evidence is effect, inference is probability of cause
Example: Instantiate evidence Syllabus = true. What is P(Course |
Syllabus)?
Example: What is P(Course|Vapnik)?
Inter-causal inference
Explain away different possible causes of effect
Example: What is P(Course|Vapnik,RussianMath)?
Why is P(Course|Vapnik,RussianMath) <
P(Course|Vapnik)?
Complexity of Bayesian Networks
For n random Boolean variables:
• Full joint probability distribution: 2n entries
• Bayesian network with at most k parents per node:
– Each conditional probability table: at most 2k entries
– Entire network: n 2k entries
What are the advantages
of Bayesian networks?
• Intuitive, concise representation of joint probability
distribution (i.e., conditional dependencies) of a set of
random variables.
• Represents “beliefs and knowledge” about a particular
class of situations.
• Efficient (?) (approximate) inference algorithms
• Efficient, effective learning algorithms
Issues in Bayesian Networks
• Building / learning network topology
• Assigning / learning conditional probability tables
• Approximate inference via sampling
• In general, however, exact inference in Bayesian networks
is too expensive.
Approximate inference in Bayesian networks
Instead of enumerating all possibilities, sample to estimate
probabilities.
...
X1
X2
X3
General question: What is P(X|e)?
Notation convention: upper-case letters refer to random variables;
lower-case letters refer to specific values of those variables
Direct Sampling
•
Suppose we have no evidence, but we want to determine
P(C,S,R,W) for all C,S,R,W.
•
Direct sampling:
– Sample each variable in topological order, conditioned
on values of parents.
– I.e., always sample from P(Xi | parents(Xi))
Example
1. Sample from P(Cloudy). Suppose returns true.
2. Sample from P(Sprinkler | Cloudy = true). Suppose
returns false.
3. Sample from P(Rain | Cloudy = true). Suppose returns
true.
4. Sample from P(WetGrass | Sprinkler = false, Rain = true).
Suppose returns true.
Here is the sampled event: [true, false, true, true]
• Suppose there are N total samples, and let NS (x1, ..., xn) be
the observed frequency of the specific event x1, ..., xn.
N S ( x1 ,..., xn )
lim
 P( x1 ,..., xn )
N 
N
N S ( x1 ,..., xn )
 P ( x1 ,..., xn )
N
• Suppose N samples, n nodes. Complexity O(Nn).
• Problem 1: Need lots of samples to get good probability
estimates.
• Problem 2: Many samples are not realistic; low likelihood.
Markov Chain Monte Carlo Sampling
• One of most common methods used in real applications.
• Uses idea of “Markov blanket” of a variable Xi:
– parents, children, children’s parents
Illustration of “Markov Blanket”
X
• Recall that: By construction of Bayesian network, a node
is conditionaly independent of its non-descendants, given
its parents.
• Proposition: A node Xi is conditionally independent of all
other nodes in the network, given its Markov blanket.
Markov Chain Monte Carlo Sampling
Algorithm
• Start with random sample from variables: (x1, ..., xn). This
is the current “state” of the algorithm.
• Next state: Randomly sample value for one non-evidence
variable Xi , conditioned on current values in “Markov
Blanket” of Xi.
Example
•
Query: What is P(Rain | Sprinkler = true, WetGrass =
true)?
•
MCMC:
–
Random sample, with evidence variables fixed:
[true, true, false, true]
–
Repeat:
1. Sample Cloudy, given current values of its Markov blanket:
Sprinkler = true, Rain = false. Suppose result is false. New
state:
[false, true, false, true]
2.
Sample Rain, given current values of its Markov blanket:
Cloudy = false, Sprinkler = true, WetGrass = true. Suppose
result is true. New state: [false, true, true, true].
• Each sample contributes to estimate for query
P(Rain | Sprinkler = true, WetGrass = true)
• Suppose we perform 100 such samples, 20 with Rain =
true and 80 with Rain = false.
• Then answer to the query is
Normalize (20,80) = .20,.80
• Claim: “The sampling process settles into a dynamic
equilibrium in which the long-run fraction of time spent in
each state is exactly proportional to its posterior
probability, given the evidence.”
• That is: for all variables Xi, the probability of the value xi
of Xi appearing in a sample is equal to P(xi | e).
Claim (again)
• Claim: MCMC settles into behavior in which each state is
sampled exactly according to its posterior probability,
given the evidence.