Directed Graphical Models: Bayes Nets and

Download Report

Transcript Directed Graphical Models: Bayes Nets and

Bayes Nets and Probabilities
Oliver Schulte
Machine Learning 726
Bayes Nets: General Points
 Represent domain knowledge.
 Allow for uncertainty.
 Complete representation of probabilistic knowledge.
 Represent causal relations.
 Fast answers to types of queries:
 Probabilistic: What is the probability that a patient has strep
throat given that they have fever?
 Relevance: Is fever relevant to having strep throat?
2/57
Bayes Net Links
• Judea Pearl's Turing Award
• See UBC’s AISpace
3/57
Probability Reasoning (With Bayes Nets)
4/57
Random Variables
 A random variable has a probability
associated with each of its values.
 A basic statement assigns a value to a random
variable.
Variable
Value
Probability
Weather
Sunny
0.7
Weather
Rainy
0.2
Weather
Cloudy
0.08
Weather
Snow
0.02
Cavity
True
0.2
Cavity
False
0.8
5/57
Probability for Sentences
 A sentence or query is formed by using “and”,
“or”, “not” recursively with basic statements.
 Sentences also have probabilities assigned to
them.
Sentence
Probability
P(Cavity = false AND Toothache = false)
0.72
P(Cavity = true OR Toothache = false)
0.08
6/57
Probability Notation
 Often probability theorists write A,B instead of A  B
(like Prolog).
 If the intended random variables are known, they
are often not mentioned.
Shorthand
Full Notation
P(Cavity = false,Toothache = false)
P(Cavity = false Toothache = false)
P(false, false)
P(Cavity = false Toothache = false)
7/57
Axioms of probability
 For any formula A, B
 0 ≤ P(A) ≤ 1
 P(true) = 1 and P(false) = 0
 P(A  B) =
P(A) + P(B) - P(A  B)
 P(A) = P(B) if A and B are
logically equivalent.


Formulas considered as sets of
complet
8/57
Rule 1: Logical Equivalence
P(NOT (NOT Cavity))
P(Cavity)
0.2
0.2
P(NOT (Cavity AND
Toothache))
P(Cavity = F OR
Toothache = F)
0.88
0.88
P(NOT (Cavity OR
Toothache)
P(Cavity = F AND
Toothache = F)
0.72
0.72
9/57
The Logical Equivalence Pattern
P(NOT (NOT
Cavity))
=
P(Cavity)
0.2
P(NOT (Cavity
AND
Toothache))
0.2
=
P(Cavity = F
OR
Toothache
= F)
0.88
P(NOT (Cavity
OR Toothache)
0.72
Rule 1: Logically
equivalent expressions
have the same
probability.
0.88
=
P(Cavity = F
AND
Toothache =
F)
0.72
10/57
Rule 2: Marginalization
P(Cavity,
Toothache)
P(Cavity,
Toothache = F)
P(Cavity)
0.12
0.08
0.2
P(Cavity = F,
Toothache)
P(Cavity = F,
Toothache = F)
P(Cavity = F)
0.08
0.72
0.8
P(Cavity,
Toothache)
P(Cavity = F,
Toothache)
P(Toothache)
0.12
0.08
0.2
11/57
The Marginalization Pattern
P(Cavity,
Toothache)
+
0.12
P(Cavity = F,
Toothache)
0.08
P(Cavity,
Toothache)
0.12
P(Cavity, Toothache =
F)
=
0.08
+
0.2
P(Cavity = F,
Toothache = F)
=
0.72
+ P(Cavity = F,
Toothache)
0.08
P(Cavity)
P(Cavity = F)
0.8
=
P(Toothache)
0.2
12/57
Prove the Pattern: Marginalization
 Theorem. P(A) = P(A,B) + P(A, not B)
 Proof.
A is logically equivalent to
[A and B)  (A and not B)].
2. P(A) = P([A and B)  (A and not B)]) =
P(A and B) + P(A and not B) –
P([A and B)  (A and not B)]). Disjunction Rule.
3. [A and B)  (A and not B)] is logically equivalent to false, so
P([A and B)  (A and not B)]) =0.
4. So 2. implies P(A) = P(A and B) + P(A and not B).
1.
13/57
Completeness of Bayes Nets
 A probabilistic query system is complete if it can
compute a probability for every sentence.
 Proposition: A Bayes net is complete. Proof has two steps.
Any system that encodes the joint distribution is complete.
2. A Bayes net encodes the joint distribution.
1.
14/57
The Joint Distribution
15/57
Assigning Probabilities to Sentences
 A complete assignment is a conjunctive sentence that
assigns a value to each random variable.
 The joint probability distribution specifies a
probability for each complete assignment.
 A joint distribution determines an probability for every
sentence.
 How? Spot the pattern.
16/57
Probabilities for Sentences: Spot the
Pattern
Sentence
Probability
P(Cavity = false AND Toothache = false)
0.72
P(Cavity = true OR Toothache = false)
0.08
P(Toothache = false)
0.8
17/57
Inference by enumeration
18/57
Inference by enumeration
 Marginalization: For any sentence A, sum the joint
probabilities for the complete assignments where A is true.
 P(toothache) = 0.108 + 0.012 + 0.016 + 0.064 = 0.2.
19/57
Completeness Proof for Joint
Distribution
 Theorem [from propositional logic] Every sentence is
logically equivalent to a disjunction of the form
A1 or A2 or ... or Ak
where the Ai are complete assignments.
1. All of the Ai are mutually exclusive (joint probability 0).
Why?
2. So if S is equivalent to A1 or A2 or ... or Ak, then
P(S) = Σi P(Ai)
where each Ai is given by the joint distribution.
20/57
Bayes Nets and The Joint Distribution
21/57
Example: Complete Bayesian Network
22/57
The Story
 You have a new burglar alarm installed at home.
 It’s reliable at detecting burglary but also responds to earthquakes.
 You have two neighbors that promise to call you at work when they
hear the alarm.
 John always calls when he hears the alarm, but sometimes confuses
alarm with telephone ringing.
 Mary listens to loud music and sometimes misses the alarm.
23/57
Computing The Joint Distribution
 A Bayes net provides a compact factored representation of a joint
distribution.
 In words, the joint probability is computed as follows.
For each node Xi:
Find the assigned value xi.
Find the values y1,..,yk assigned to the parents of Xi.
Look up the conditional probability P(xi|y1,..,yk) in the Bayes
net.
5. Multiply together these conditional probabilities.
1.
2.
3.
4.
24/57
Product Formula Example: Burglary
 Query: What is the joint probability that
all variables are true?
 P(M, J, A, E, B) =
P(M|A) p(J|A) p(A|E,B)P(E)P(B)
= .7 x .9 x .95 x .002 x .001
25/57
Compactness of Bayesian Networks
 Consider n binary variables
 Unconstrained joint distribution requires O(2n) probabilities
 If we have a Bayesian network, with a maximum of k parents for
any node, then we need O(n 2k) probabilities
 Example
 Full unconstrained joint distribution
 n = 30: need 230 probabilities for full joint distribution
 Bayesian network
 n = 30, k = 4: need 480 probabilities
26/57
Summary: Why are Bayes nets useful?
- Graph structure supports
- Modular representation of knowledge
- Local, distributed algorithms for inference and learning
- Intuitive (possibly causal) interpretation
- Factored representation may have exponentially fewer
parameters than full joint P(X1,…,Xn) =>
- lower sample complexity (less data for learning)
- lower time complexity (less time for inference)
27/57
Is it Magic?
 How can the Bayes net reduce parameters?
By exploiting conditional independencies.
 Why does the product formula work?
The Bayes net topological or graphical semantics.
1.

2.
The graph by itself entails conditional independencies.
The Chain Rule.
28/57
Conditional Probabilities and Independence
29/57
Conditional Probabilities: Intro
 Given (A) that a die comes up with an odd number,
what is the probability that (B) the number is
1. a 2
2. a 3
 Answer: the number of cases that satisfy both A and B,
out of the number of cases that satisfy A.
 Examples:
#faces with (odd and 2)/#faces with odd
= 0 / 3 = 0.
2. #faces with (odd and 3)/#faces with odd
= 1 / 3.
1.
30/57
Conditional Probs ctd.
 Suppose that 50 students are taking 310 and 30 are
women. Given (A) that a student is taking 310, what is
the probability that (B) they are a woman?
 Answer:
#students who take 310 and are a woman/
#students in 310
= 30/50 = 3/5.
 Notation: P(A|B)
31/57
Conditional Ratios: Spot the Pattern
 Spot the Pattern
P(die comes up
with odd number)
P(die comes up
with 3)
P(3|odd number)
1/2
1/6
1/3
P(Student takes 310)
P(Student takes 310
and is woman)
P(Student is
woman|Student
takes 310)
=50/15,000
30/15,000
3/5
32/57
Conditional Probs: The Ratio Pattern
 Spot the Pattern
P(die comes
up with odd
number)
/
1/2
P(Student
takes 310)
=50/15,000
P(die comes
up with 3)
=
1/6
/
P(Student
takes 310 and
is woman)
P(3|odd
number)
1/3
=
30/15,000
P(Student is
woman|Stud
ent takes 310)
3/5
P(A|B) = P(A and B)/ P(B) Important!
33/57
Conditional Probabilities: Motivation
 Much knowledge can be represented as implications
B1,..,Bk =>A.
 Conditional probabilities are a probabilistic version of
reasoning about what follows from conditions.
 Cognitive Science: Our minds store implicational
knowledge.
34/57
The Product Rule: Spot the Pattern
P(Cavity)
P(Toothache|Cavity)
P(Cavity,Toothache)
0.12
0.08
0.2
P(Toothache)
P(Cavity|
Toothache)
P(Cavity,Toothache)
0.08
0.72
0.8
P(Cavity =F)
P(Toothache|Cavit
y = F)
P(Toothache,Cavity =F)
0.12
0.08
0.2
35/57
The Product Rule Pattern
P(Cavity)
0.12
P(Toothache)
0.08
P(Cavity =F)
0.12
x
P(Toothache|Cavity)
=
0.08
0.2
x P(Cavity| Toothache)
0.72
x P(Toothache|Cavit
y = F)
0.08
P(Cavity,Toothache)
=
P(Cavity,Toothache)
0.8
=
P(Toothache,Cavity =F)
0.2
36/57
Independence
 A and B are independent iff
P(A|B) = P(A) or P(B|A) = P(B)
or P(A, B) = P(A) P(B)
 Suppose that Weather is independent of the Cavity Scenario. Then the joint
distribution decomposes:

P(Toothache, Catch, Cavity, Weather)
= P(Toothache, Catch, Cavity) P(Weather)
 Absolute independence powerful but rare

 Dentistry is a large field with hundreds of variables, none of which are
independent. What to do?
37/57
Exercise
 Prove that the three definitions of independence are
equivalent (assuming all positive probabilities).
 A and B are independent iff
1.
P(A|B) = P(A)
2. or P(B|A) = P(B)
3. or P(A, B) = P(A) P(B)
38/57
Conditional independence

If I have a cavity, the probability that the probe catches in it doesn't depend on whether I have a toothache:

(1) P(catch | toothache, cavity) = P(catch | cavity)

The same independence holds if I haven't got a cavity:
(2) P(catch | toothache,cavity) = P(catch | cavity)

Catch is conditionally independent of Toothache given Cavity:
P(Catch | Toothache,Cavity) = P(Catch | Cavity)


The equivalences for independence also holds for conditional independence, e.g.:
P(Toothache | Catch, Cavity) = P(Toothache | Cavity)
P(Toothache, Catch | Cavity) = P(Toothache | Cavity) P(Catch | Cavity)

Conditional independence is our most basic and robust form of knowledge about uncertain
environments.

39/57
Bayes Nets Graphical
Semantics
40/57
Common Causes: Spot the Pattern
Cavity
Catch
toothache
 Catch is independent of toothache given Cavity.
41/57
Burglary Example
 JohnCalls, MaryCalls are conditionally
independent given Alarm.
42/57
Spot the Pattern: Chain Scenario
 MaryCalls is independent of Burglary
given Alarm.
 JohnCalls is independent of Earthquake
given Alarm.
43/57
The Markov Condition
 A Bayes net is constructed so that:
each variable is conditionally independent of its nondescendants
given its parents.
 The graph alone (without specified probabilities) entails
conditional independencies.
 Causal Interpretation: Each parent is a direct cause.
44/57
Derivation of the Product
Formula
45/57
The Chain Rule
 We can always write
P(a, b, c, … z) = P(a | b, c, …. z) P(b, c, … z)
(Product Rule)
 Repeatedly applying this idea, we obtain
P(a, b, c, … z) = P(a | b, c, …. z) P(b | c,.. z) P(c| .. z)..P(z)
 Order the variables such that children come before parents.
 Then given its parents, each node is independent of its other ancestors by
the topological independence.
 P(a,b,c, … z) = Πx. P(x|parents)
46/57
Example in Burglary Network
 P(M, J,A,E,B) = P(M| J,A,E,B) p(J,A,E,B)= P(M|A) p(J,A,E,B)
= P(M|A) p(J|A,E,B) p(A,E,B) = P(M|A) p(J|A) p(A,E,B)
= P(M|A) p(J|A) p(A|E,B) P(E,B)
= P(M|A) p(J|A) p(A|E,B) P(E)P(B)
 Colours show applications of the Bayes net topological independence.
47/57
Explaining Away
48/57
Common Effects: Spot the Pattern
• Influenza and Smokes are
independent.
Influenza
• Given Bronchitis, they become
dependent.
• Battery Age and Charging
System are independent.
• Given Battery Voltage, they
become dependent.
Smokes
Bronchitis
Battery
Age
Charging
System OK
Battery Voltage
49/57
Conditioning on Children
• Independent Causes:
A and B are independent.
• “Explaining away” effect:
Given C, observing A makes B
less likely.
• E.g. Bronchitis in UBC “Simple
Diagnostic Problem”.
A
B
C
⇒ A and B are (marginally)
independent, become
dependent once C is known.
50/57
D-separation
• A, B, and C are non-intersecting subsets of nodes in a
directed graph.
• A path from A to B is blocked if it contains a node such that
either
a) the arrows on the path meet either head-to-tail or tailto-tail at the node, and the node is in the set C, or
b) the arrows meet head-to-head at the node, and
neither the node, nor any of its descendants, are in the
set C.
• If all paths from A to B are blocked, A is said to be dseparated from B by C.
• If A is d-separated from B by C, the joint distribution over
all variables in the graph satisfies
.
51/57
D-separation: Example
52/57
Bayes’ Theorem
54/57
Abductive Reasoning
 Implications are often causal, from
cause to effect.
 Many important queries are
diagnostic, from effect to cause.
Burglary
Alarm
Cavity
Toothache
55/57
Bayes’ Theorem: Another Example
 A doctor knows the following.
 The disease meningitis causes the patient to have a stiff neck
50% of the time.
 The prior probability that someone has meningitis is
1/50,000.
 The prior that someone has a stiff neck is 1/20.
 Question: knowing that a person has a stiff neck what is the
probability that they have meningitis?
56/57
Spot the Pattern: Diagnosis
P(Cavity)
P(Toothache|Cavit P(Toothache)
y)
P(Cavity|Toothach
e)
0.2
0.6
0.6
P(Wumpus)
P(Stench|Wumpus P(Stench)
)
P(Wumpus|Stench
)
0.2
0.6
0.2
0.6
P(Meningitis)
P(Stiff Neck|
Meningitis)
P(Stiff Neck)
P(Meningitis|Stiff
Neck)
1/50,000
1/2
1/20
0.6
0.2
57/57
Spot the Pattern: Diagnosis
P(Cavity)
x
P(Toothache|
Cavity)
0.2
P(Wumpus)
0.2
P(Meningitis)
1/50,000
0.6
x
/
P(Toothache)
=
0.2
P(Stench|Wumpus)
0.6
x P(Stiff Neck|
Meningitis)
1/2
P(Cavity|Tootha
che)
0.6
/
P(Stench)
= P(Wumpus|Ste
nch)
0.2
/
P(Stiff Neck)
1/20
0.6
=
P(Meningitis|St
iff Neck)
1/5,000
58/57
Explain the Pattern: Bayes’ Theorem
 Exercise: Prove Bayes’ Theorem
P(A | B) = P(B | A) P(A) / P(B).
59/57
On Bayes’ Theorem
 P(a | b) = P(b | a) P(a) / P(b).
 Useful for assessing diagnostic probability from causal
probability:
 P(Cause|Effect) =
P(Effect|Cause) P(Cause) / P(Effect).
Likelihood: how well
does the cause explain the
effect?
Prior: how
plausible is the
explanation
before any
evidence?
Evidence
Term/Normaliza
tion Constant:
how surprising is
the evidence?
60/57