Transcript Slide 1

Bayesian Networks:
Motivation
• Capture independence and conditional
independence where they exist.
• Among variables where dependencies exist,
encode the relevant portion of the full joint.
• Use a graphical representation for which we
can more easily investigate the complexity
of inference and can search for efficient
inference algorithms.
© Jude Shavlik 2006
David Page 2007
CS 760 – Machine Learning (UW-Madison)
Lecture #5, Slide 1
A Bayesian Network is a
...
• Directed Acyclic Graph (DAG) in which
…
• … the nodes denote random variables
• … each node X has a conditional
probability distribution
P(X|Parents(X)).
• The intuitive meaning of an arc from X
to Y is that X directly influences Y.
© Jude Shavlik 2006
David Page 2007
CS 760 – Machine Learning (UW-Madison)
Lecture #5, Slide 2
Additional Terminology
• If X and its parents are discrete, we can
represent the distribution P(X|Parents(X))
by a conditional probability table (CPT)
specifying the probability of each value of X
given each possible combination of settings
for the variables in Parents(X).
• A conditioning case is a row in this CPT (a
setting of values for the parent nodes).
© Jude Shavlik 2006
David Page 2007
CS 760 – Machine Learning (UW-Madison)
Lecture #5, Slide 3
An Example Bayes Net
P(A)
A 0.1 (1,9)
P(B)
B 0.2 (1,4)
A B P(C)
T T 0.9 (9,1)
T F 0.6 (3,2)
F T 0.3 (3,7)
F F 0.2 (1,4)
C
E
D
C P(D)
T 0.9 (9,1)
F 0.2 (1,4)
© Jude Shavlik 2006
David Page 2007
The numbers (probabilities)
are also called parameters.
In parentheses are the
hyperparameters. These
usually are not shown,
but are important when
doing parameter learning.
C P(E)
T 0.8 (4,1)
F 0.1 (1,9)
CS 760 – Machine Learning (UW-Madison)
Lecture #5, Slide 4
Bayesian Network
Semantics
• A Bayesian Network completely specifies a
full joint distribution over its random
variables, as below -- this is its meaning.
n
• P ( x1,..., xn) =
 P( x | Parents( X ))
i
i
i=1
• In the above, P(x1,…,xn) is shorthand
notation for P(X1=x1,…,Xn=xn).
© Jude Shavlik 2006
David Page 2007
CS 760 – Machine Learning (UW-Madison)
Lecture #5, Slide 5
Conditional
Independence Again
• A node X is conditionally independent
of its predecessors given Parents(X).
• Markov Blanket of X: set consisting of
the parents of X, the children of X, and
the other parents of the children of X.
• X is conditionally independent of all
nodes in the network given its Markov
Blanket.
© Jude Shavlik 2006
David Page 2007
CS 760 – Machine Learning (UW-Madison)
Lecture #5, Slide 6
Learning CPTs from
Complete Settings
• Suppose we are given a set of data, where
each data point is a complete setting for all
the variables.
• One assumption we make is that the data
set is a random sample from the distribution
we’re trying to model.
• For each node in our network, we consider
each entry in its CPT (each setting of values
© Jude Shavlik 2006
David Page 2007
CS 760 – Machine Learning (UW-Madison)
Lecture #5, Slide 7
Learning CPTs
(Continued)
• (Continued)… for its parents). For each
entry in the CPT, we have a prior (possibly
uniform) Dirichlet distribution over its
values. We simply update this distribution
based on the relevant data points (those
that agree on the settings for the parents
that correspond with this CPT entry).
• A second, implicit assumption is that the
© Jude Shavlik 2006
David Page 2007
CS 760 – Machine Learning (UW-Madison)
Lecture #5, Slide 8
Learning CPTs
(Continued)
• (Continued)… distributions over different
rows of the CPT are independent of one
another.
• Finally, it is worth noting that instead of this
last assumption, we might have a stronger
bias over the form of the CPT. We might
believe it is a linear function or a tree, in
which case we could use a learning
algorithm we have seen already.
© Jude Shavlik 2006
David Page 2007
CS 760 – Machine Learning (UW-Madison)
Lecture #5, Slide 9
Simple Example
• Suppose we believe the variables PinLength
and HeadWeight directly influence whether
a thumbtack comes up heads or tails. For
simplicity, suppose PinLength can be long or
short and HeadWeight can be heavy or
light.
• Suppose we adopt the following prior over
the CPT entries for the variable Thumbtack.
© Jude Shavlik 2006
David Page 2007
CS 760 – Machine Learning (UW-Madison)
Lecture #5, Slide 10
Simple Example
(Continued)
HeadWeight
PinLength
heavy
long
Thumbtack
light
short long
short
heads
.9
.8
.8
.7
tails
.1
.2
.2
.3
(9,1) (4,1) (16,4) (7,3)
(Normal roles of row s and columns in CPT reversed just to make it fi
© Jude Shavlik 2006
David Page 2007
CS 760 – Machine Learning (UW-Madison)
Lecture #5, Slide 11
Simple Example
(Continued)
• Notice that we have equal confidence in our
prior (initial) probabilities for the first and
last columns of the CPT, less confidence in
those of the second column, and more in
those of the third column.
• A new data point will affect only one of the
columns. A new data point will have more
effect on the second column than the
others.
© Jude Shavlik 2006
David Page 2007
CS 760 – Machine Learning (UW-Madison)
Lecture #5, Slide 12
More Difficult Case: What
if Some Variables are
Missing
• Recall our earlier notion of hidden variables.
• Sometimes a variable is hidden because it
cannot be explicitly measured. For
example, we might hypothesize that a
chromosomal abnormality is responsible for
some patients with a particular cancer not
responding well to treatment.
© Jude Shavlik 2006
David Page 2007
CS 760 – Machine Learning (UW-Madison)
Lecture #5, Slide 13
Missing Values
(Continued)
• We might include a node for this
chromosomal abnormality in our network
because we strongly believe it exists, other
variables can be used to predict it, and it is
in turn predictive of still other variables.
• But in estimating CPTs from data, none of
our data points has a value for this variable.
© Jude Shavlik 2006
David Page 2007
CS 760 – Machine Learning (UW-Madison)
Lecture #5, Slide 14
General EM Framework
• Given: Data with missing values, Space of
possible models, Initial model.
• Repeat until no change greater than
threshold:
• Expectation (E) Step: Compute expectation over
missing values, given model.
• Maximization (M) Step: Replace current model
with model that maximizes probability of data.
© Jude Shavlik 2006
David Page 2007
CS 760 – Machine Learning (UW-Madison)
Lecture #5, Slide 15
(“Soft”) EM vs. “Hard” EM
• Standard (soft) EM: expectation is a
probability distribution.
• Hard EM: expectation is “all or
nothing”… most likely/probable value.
• Advantage of hard EM is
computational efficiency when
expectation is over state consisting of
values for multiple variables (next
example illustrates).
© Jude Shavlik 2006
David Page 2007
CS 760 – Machine Learning (UW-Madison)
Lecture #5, Slide 16
EM for Parameter Learning:
E Step
• For each data point with missing values,
compute the probability of each possible
completion of that data point. Replace the
original data point with all these
completions, weighted by probabilities.
• Computing the probability of each
completion (expectation) is just answering
query over missing variables given others.
© Jude Shavlik 2006
David Page 2007
CS 760 – Machine Learning (UW-Madison)
Lecture #5, Slide 17
EM for Parameter Learning:
M Step
• Use the completed data set to update
our Dirichlet distributions as we would
use any complete data set, except that
our counts (tallies) may be fractional
now.
• Update CPTs based on new Dirichlet
distributions, as we would with any
complete data set.
© Jude Shavlik 2006
David Page 2007
CS 760 – Machine Learning (UW-Madison)
Lecture #5, Slide 18
EM for Parameter
Learning
• Iterate E and M steps until no changes
occur. We will not necessarily get the global
MAP (or ML given uniform priors) setting of
all the CPT entries, but under a natural set
of conditions we are guaranteed
convergence to a local MAP solution.
• EM algorithm is used for a wide variety of
tasks outside of BN learning as well.
© Jude Shavlik 2006
David Page 2007
CS 760 – Machine Learning (UW-Madison)
Lecture #5, Slide 19
Subtlety for Parameter
Learning
• Overcounting based on number of
interations required to converge to
settings for the missing values.
• After each repetition of E step, reset
all Dirichlet distributions before
repeating M step.
© Jude Shavlik 2006
David Page 2007
CS 760 – Machine Learning (UW-Madison)
Lecture #5, Slide 20
EM for Parameter
Learning
Data
P(A)
A 0.1 (1,9)
P(B)
B 0.2 (1,4)
A B P(C)
T T 0.9 (9,1)
T F 0.6 (3,2)
F T 0.3 (3,7)
F F 0.2 (1,4)
C
E
D
C P(D)
T 0.9 (9,1)
F 0.2 (1,4)
© Jude Shavlik 2006
David Page 2007
C P(E)
T 0.8 (4,1)
F 0.1 (1,9)
A
0
0
1
0
0
0
1
0
0
0
B
0
0
0
0
1
0
1
0
0
0
CS 760 – Machine Learning (UW-Madison)
C
?
?
?
?
?
?
?
?
?
?
D
0
1
1
0
1
0
1
0
1
0
E
0
0
1
1
0
1
1
0
0
1
Lecture #5, Slide 21
EM for Parameter
Learning
Data
P(A)
A 0.1 (1,9)
P(B)
B 0.2 (1,4)
A B P(C)
T T 0.9 (9,1)
T F 0.6 (3,2)
F T 0.3 (3,7)
F F 0.2 (1,4)
C
E
D
C P(D)
T 0.9 (9,1)
F 0.2 (1,4)
© Jude Shavlik 2006
David Page 2007
C P(E)
T 0.8 (4,1)
F 0.1 (1,9)
A
0
0
1
0
0
0
1
0
0
0
B C D E
0 0:1: 0.99
0
0.01 0
0: 0.80
0 1: 0.20 1 0
0 0:1: 0.02
1
0.98 1
0.80
0 0:1: 0.20
0 1
1 0:1: 0.70
0
0.30 1
0 0:1: 0.80
0 1
0.20
1 0:1: 0.003
1 1
0.997
0.99 0
0 0:1: 0.01
0
0 0:1: 0.80
1 0
0.20
0 0:1: 0.80
0 1
0.20
CS 760 – Machine Learning (UW-Madison)
Lecture #5, Slide 22
Multiple Missing Values
Data
P(A)
A 0.1 (1,9)
P(B)
B 0.2 (1,4)
A B P(C)
T T 0.9 (9,1)
T F 0.6 (3,2)
F T 0.3 (3,7)
F F 0.2 (1,4)
C
A B C D E
? 0 ? 0 1
E
D
C P(D)
T 0.9 (9,1)
F 0.2 (1,4)
© Jude Shavlik 2006
David Page 2007
C P(E)
T 0.8 (4,1)
F 0.1 (1,9)
CS 760 – Machine Learning (UW-Madison)
Lecture #5, Slide 23
Multiple Missing Values
Data
P(A)
A 0.1 (1,9)
P(B)
B 0.2 (1,4)
A B P(C)
T T 0.9 (9,1)
T F 0.6 (3,2)
F T 0.3 (3,7)
F F 0.2 (1,4)
C
A
0
0
1
1
B
0
0
0
0
C
0
1
0
1
D E
0 1
0 1
0 1
0 1
0.72
0.18
0.04
0.06
E
D
C P(D)
T 0.9 (9,1)
F 0.2 (1,4)
© Jude Shavlik 2006
David Page 2007
C P(E)
T 0.8 (4,1)
F 0.1 (1,9)
CS 760 – Machine Learning (UW-Madison)
Lecture #5, Slide 24
Multiple Missing Values
P(A)
0.1 (1.1,9.9)
A
Data
B
P(B)
0.17 (1,5)
A B P(C)
T T 0.9 (9,1)
T F 0.6 (3.06,2.04)
F T 0.3 (3,7)
F F 0.2 (1.18,4.72)
C
A
0
0
1
1
B
0
0
0
0
C
0
1
0
1
D E
0 1
0 1
0 1
0 1
0.72
0.18
0.04
0.06
E
D
C P(D)
T 0.88 (9,1.24)
F 0.17 (1,4.76)
© Jude Shavlik 2006
David Page 2007
C P(E)
T 0.81 (4.24,1)
F 0.16 (1.76,9)
CS 760 – Machine Learning (UW-Madison)
Lecture #5, Slide 25
Problems with EM
• Only local optimum (not much way
around that, though).
• Deterministic … if priors are uniform,
may be impossible to make any
progress…
• … next figure illustrates the need for
some randomization to move us off an
uninformative prior…
© Jude Shavlik 2006
David Page 2007
CS 760 – Machine Learning (UW-Madison)
Lecture #5, Slide 26
What will EM do here?
P(A)
0.5 (1,1)
A
A P(B)
T 0.5 (1,1)
F 0.5 (1,1)
B P(C)
T 0.5 (1,1)
F 0.5 (1,1)
B
C
© Jude Shavlik 2006
David Page 2007
Data
A
0
1
0
1
0
1
B
?
?
?
?
?
?
C
0
1
0
1
0
1
CS 760 – Machine Learning (UW-Madison)
Lecture #5, Slide 27
EM Dependent on Initial
Beliefs
P(A)
0.5 (1,1)
A
A P(B)
T 0.6 (6,4)
F 0.4 (4,6)
B P(C)
T 0.5 (1,1)
F 0.5 (1,1)
B
C
© Jude Shavlik 2006
David Page 2007
Data
A
0
1
0
1
0
1
B
?
?
?
?
?
?
C
0
1
0
1
0
1
CS 760 – Machine Learning (UW-Madison)
Lecture #5, Slide 28
EM Dependent on Initial
Beliefs
P(A)
0.5 (1,1)
A
A P(B)
T 0.6 (6,4)
F 0.4 (4,6)
B P(C)
T 0.5 (1,1)
F 0.5 (1,1)
B
C
© Jude Shavlik 2006
David Page 2007
Data
A
0
1
0
1
0
1
B
?
?
?
?
?
?
C
0
1
0
1
0
1
CS 760 – Machine Learning (UW-Madison)
B is more likely T
than F when A is T.
Filling this in makes
C more likely T than
F when B is T. This
makes B still more
likely T than F when
A is T. Etc. Small
change in CPT for
B (swap 0.6 and 0.4)
would have opposite
effect. Lecture #5, Slide 29
Learning Structure +
Parameters
• Number of structures is
superexponential
• Finding optimal structure (ML or MAP)
is NP-complete
• Two common options:
• Severely restrict possible structures –
e.g., tree-augmented naïve Bayes (TAN)
• Heuristic search (e.g., sparse candidate)
© Jude Shavlik 2006
David Page 2007
CS 760 – Machine Learning (UW-Madison)
Lecture #5, Slide 30
Recall: Naïve Bayes Net
Class
Value
…
F1
F2
© Jude Shavlik 2006
David Page 2007
F3
F N-2
F N-1
CS 760 – Machine Learning (UW-Madison)
FN
Lecture #5, Slide 31
Alternative: TAN
Class
Value
…
F1
F2
© Jude Shavlik 2006
David Page 2007
F3
F N-2
F N-1
CS 760 – Machine Learning (UW-Madison)
FN
Lecture #5, Slide 32
Tree-Augmented Naïve
Bayes
• In addition to the Naïve Bayes arcs
(class to feature), we are permitted a
directed tree among the features
• Given this restriction, there exists a
polynomial-time algorithm to find the
maximum likelihood structure
© Jude Shavlik 2006
David Page 2007
CS 760 – Machine Learning (UW-Madison)
Lecture #5, Slide 33
TAN Learning Algorithm
Friedman, Geiger &
Goldszmidt ’97
• For every pair of features, compute the
mutual information (information gain of one
for the other), conditional on the class
• Add arcs between all pairs of features,
weighted by this value
• Compute the maximum weight spanning
tree, and direct arcs from the root
• Compute parameters as already seen
© Jude Shavlik 2006
David Page 2007
CS 760 – Machine Learning (UW-Madison)
Lecture #5, Slide 34
© Jude Shavlik 2006
David Page 2007
CS 760 – Machine Learning (UW-Madison)
Lecture #5, Slide 35
© Jude Shavlik 2006
David Page 2007
CS 760 – Machine Learning (UW-Madison)
Lecture #5, Slide 36
© Jude Shavlik 2006
David Page 2007
CS 760 – Machine Learning (UW-Madison)
Lecture #5, Slide 37
© Jude Shavlik 2006
David Page 2007
CS 760 – Machine Learning (UW-Madison)
Lecture #5, Slide 38
© Jude Shavlik 2006
David Page 2007
CS 760 – Machine Learning (UW-Madison)
Lecture #5, Slide 39
© Jude Shavlik 2006
David Page 2007
CS 760 – Machine Learning (UW-Madison)
Lecture #5, Slide 40
© Jude Shavlik 2006
David Page 2007
CS 760 – Machine Learning (UW-Madison)
Lecture #5, Slide 41
© Jude Shavlik 2006
David Page 2007
CS 760 – Machine Learning (UW-Madison)
Lecture #5, Slide 42
One Other Key Point
• The previous discussion assumes we are
going to make a prediction based on the
best (e.g., MAP or maximum likelihood)
single hypothesis.
• Alternatively, we could make avoid
committing to a single Bayes Net. Instead
we could compute all Bayes Nets, and have
a probability for each. For any new query
© Jude Shavlik 2006
David Page 2007
CS 760 – Machine Learning (UW-Madison)
Lecture #5, Slide 43
One Other Key Point
(Continued)
• (Continued)… we could calculate the
prediction of every network. We could
then weigh each network’s prediction
by the probability that it is the correct
network (given our previous training
data), and go with the highest scoring
prediction. Such a predictor is the
Bayes-optimal predictor.
© Jude Shavlik 2006
David Page 2007
CS 760 – Machine Learning (UW-Madison)
Lecture #5, Slide 44
Problem with Bayes
Optimal
• Because there are a super-exponential
number of structures, we don’t want to
average over all of them. Two options are
used in practice:
• Selective model averaging:just choose a
subset of “best” but “distinct” models
(networks) and pretend it’s exhaustive.
• Go back to MAP/ML (model selection).
© Jude Shavlik 2006
David Page 2007
CS 760 – Machine Learning (UW-Madison)
Lecture #5, Slide 45