Transcript X - IPAM

Graphical models
Tom Griffiths
UC Berkeley
Challenges of probabilistic models
• Specifying well-defined probabilistic models
with many variables is hard (for modelers)
• Representing probability distributions over
those variables is hard (for computers/learners)
• Computing quantities using those distributions
is hard (for computers/learners)
Representing structured distributions
Four random variables:
X1
X2
X3
X4
coin toss produces heads
pencil levitates
friend has psychic powers
friend has two-headed coin
Domain
{0,1}
{0,1}
{0,1}
{0,1}
Joint distribution
• Requires 15 numbers to specify
probability of all values x1,x2,x3,x4
– N binary variables, 2N-1 numbers
• Similar cost when computing
conditional probabilities
P(x1 1, x 2, x 3 , x 4 )
P(x 2, x 3 , x 4 | x1 1) 
P(x1 1)
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111
How can we use fewer numbers?
Four random variables:
X1
X2
X3
X4
coin toss produces heads
coin toss produces heads
coin toss produces heads
coin toss produces heads
Domain
{0,1}
{0,1}
{0,1}
{0,1}
Statistical independence
• Two random variables X1 and X2 are independent if
P(x1|x2) = P(x1)
– e.g. coinflips: P(x1=H|x2=H) = P(x1=H) = 0.5
• Independence makes it easier to represent and
work with probability distributions
• We can exploit the product rule:
P(x1, x2, x3, x4 )  P(x1 | x2, x3,x4 )P(x2 | x3, x4 )P(x3 | x4 )P(x4 )
If x1, x2, x3, and x4 are all independent…
P(x1, x2, x3, x4 )  P(x1)P(x2 )P(x3 )P(x4 )
Expressing independence
• Statistical independence is the key to efficient
probabilistic representation and computation
• This has led to the development of languages for
indicating dependencies among variables
• Some of the most popular languages are based on
“graphical models”
Graphical models
• Introduction to graphical models
– definitions
– efficient representation and inference
– explaining away
• Graphical models and cognitive science
– uses of graphical models
Graphical models
• Introduction to graphical models
– definitions
– efficient representation and inference
– explaining away
• Graphical models and cognitive science
– uses of graphical models
Graphical models
• Express the probabilistic dependency
structure among a set of variables (Pearl, 1988)
• Consist of
– a set of nodes, corresponding to variables
– a set of edges, indicating dependency
– a set of functions defined on the graph that
specify a probability distribution
QuickTi me™ and a
TIFF ( Uncompressed) decompressor
are needed to see thi s pi ctur e.
Undirected graphical models
• Consist of
X1
X3
X4
– a set of nodes
X2
X5
– a set of edges
– a potential for each clique, multiplied together to
yield the distribution over variables
• Examples
– statistical physics: Ising model, spinglasses
– neural networks (e.g. Boltzmann machines)
Ising models
• Consist of
X1
X2
– a set of nodes
X3
X4
– a set of edges
– a potential for each clique, multiplied together to
yield the distribution over variables
• Distribution is specified as…




P(x)   ij  exp   (x i , x j )


 ij 

ij 
Ising models
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Boltzmann machines
• Consist of
X1
X3
X4
– a set of nodes
X2
X5
– a set of edges
– a potential for each clique, multiplied together to
yield the distribution over variables
• Distribution is specified as…




P(x)  exp  w ij (x i , x j )   i x i 


ij

i
Boltzmann machines
True image
Boltzmann
PCA
PCA
Boltzmann
(Hinton & Salakhutdinov, 2006)
Directed graphical models
• Consist of
X1
X3
X4
– a set of nodes
X2
X5
– a set of edges
– a conditional probability distribution for each
node, conditioned on its parents, multiplied
together to yield the distribution over variables
• Constrained to directed acyclic graphs (DAGs)
• Called Bayesian networks or Bayes nets
Bayesian networks and Bayes
• Two different problems
– Bayesian statistics is a method of inference
– Bayesian networks are a form of representation
• There is no necessary connection
– many users of Bayesian networks rely upon
frequentist statistical methods
– many Bayesian inferences cannot be easily
represented using Bayesian networks
Graphical models
• Introduction to graphical models
– definitions
– efficient representation and inference
– explaining away
• Graphical models and cognitive science
– uses of graphical models
Efficient representation and inference
Four random variables:
X1
X2
X3
X4
coin toss produces heads
pencil levitates
friend has psychic powers
friend has two-headed coin
P(x4) X4
P(x1|x3, x4)
X3 P(x3)
X1
X2 P(x2|x3)
The Markov assumption
Every node is conditionally independent of its nondescendants, given its parents
P(xi | xi1,...,xk )  P(xi | Pa(Xi ))
where Pa(Xi) is the set of parents of Xi
k
P(x1,..., x k )   P(x i | Pa(X i ))
i1
(via the product rule)
Efficient representation and inference
Four random variables:
X1
X2
X3
X4
coin toss produces heads
pencil levitates
friend has psychic powers
friend has two-headed coin
1 P(x4) X4
4
total = 8 (vs 15)
P(x1|x3, x4)
X3 P(x3) 1
X1
X2 P(x2|x3)
2
P(x1, x2, x3, x4) = P(x1|x3, x4)P(x2|x3)P(x3)P(x4)
Reading a Bayesian network
• The structure of a Bayes net can be read as the
generative process behind a distribution
• Gives the joint probability distribution over
variables obtained by sampling each variable
conditioned on its parents
Reading a Bayesian network
Four random variables:
X1
X2
X3
X4
coin toss produces heads
pencil levitates
friend has psychic powers
friend has two-headed coin
P(x4) X4
P(x1|x3, x4)
X3 P(x3)
X1
X2 P(x2|x3)
P(x1, x2, x3, x4) = P(x1|x3, x4)P(x2|x3)P(x3)P(x4)
Reading a Bayesian network
• The structure of a Bayes net can be read as the
generative process behind a distribution
• Gives the joint probability distribution over
variables obtained by sampling each variable
conditioned on its parents
• Simple rules for determining whether two variables
are dependent or independent
Identifying independence
X1 and X3 dependent
X1 and X3 independent
X1
X2
X3
X1
X2
X3
X1
X2
X3
X1
X2
X3
X1
X2
X3
X1
X2
X3
(shaded variables are observed)
Identifying independence
Four random variables:
X4
X1
X2
X3
X4
coin toss produces heads
pencil levitates
friend has psychic powers
friend has two-headed coin
X3
X1
X2
X4 and X2 are independent
X4
X4
X3
X1
X2
X4 and X2 are dependent
X3
X1
X2
X4 and X2 are independent
Reading a Bayesian network
• The structure of a Bayes net can be read as the
generative process behind a distribution
• Gives the joint probability distribution over
variables obtained by sampling each variable
conditioned on its parents
• Simple rules for determining whether two variables
are dependent or independent
• Independence makes inference more efficient
Computing with Bayes nets
P(x1 1, x 2, x 3 , x 4 )
P(x 2, x 3 , x 4 | x1 1) 
P(x1 1)
P(x4) X4
P(x1|x3, x4)
X3 P(x3)
X1
X2 P(x2|x3)
P(x1, x2, x3, x4) = P(x1|x3, x4)P(x2|x3)P(x3)P(x4)

Computing with Bayes nets
P(x1 1) 
 P(x
1
x 2 ,x 3 ,x 4
P(x4) X4
P(x1|x3, x4)
1, x 2 , x 3, x 4 )
sum over 8 values
X3 P(x3)
X1
X2 P(x2|x3)
P(x1, x2, x3, x4) = P(x1|x3, x4)P(x2|x3)P(x3)P(x4)
Computing with Bayes nets
P(x1 1) 
 P(x
1
| x 3, x 4 )P(x 2 | x 3 )P(x 3 )P(x 4 )
x 2 ,x 3 ,x 4
P(x4) X4
P(x1|x3, x4)
X3 P(x3)
X1
X2 P(x2|x3)
P(x1, x2, x3, x4) = P(x1|x3, x4)P(x2|x3)P(x3)P(x4)
Computing with Bayes nets
P(x1 1) 
 P(x
1
x 3 ,x 4
sum over 4 values
P(x4) X4
P(x1|x3, x4)
| x 3, x 4 )P(x 3 )P(x 4 )
X3 P(x3)
X1
X2 P(x2|x3)
P(x1, x2, x3, x4) = P(x1|x3, x4)P(x2|x3)P(x3)P(x4)
Computing with Bayes nets
• Inference algorithms for Bayesian networks exploit
dependency structure
• Message-passing algorithms
– “belief propagation” passes simple messages between
nodes, exact for tree-structured networks
• More general inference algorithms
– exact: “junction-tree”
– approximate: Monte Carlo schemes
Logic and probability
• Bayesian networks are equivalent to a probabilistic
propositional logic
• Associate variables with atomic propositions…
– Bayes net specifies a distribution over possible worlds,
probability of a proposition is a sum over worlds
• More efficient than simply enumerating worlds
• Developing similarly efficient schemes for
working with other probabilistic logics is a major
topic of current research
Graphical models
• Introduction to graphical models
– definitions
– efficient representation and inference
– explaining away
• Graphical models and cognitive science
– uses of graphical models
Identifying independence
X1 and X3 dependent
X1 and X3 independent
X1
X2
X3
X1
X2
X3
X1
X2
X3
X1
X2
X3
X1
X2
X3
X1
X2
X3
(shaded variables are observed)
Explaining away
Rain
Sprinkler
Grass Wet
P( R, S ,W )  P( R) P( S ) P(W | S , R)
Assume grass will be wet if and only if it rained last
night, or if the sprinklers were left on:
P(W  w | S , R)  1 if S  s or R  r
 0 if R  r and S  s.
Explaining away
Rain
Sprinkler
Grass Wet
P( R, S ,W )  P( R) P( S ) P(W | S , R)
P(W  w | S , R)  1 if S  s or R  r
 0 if R  r and S  s.
Compute probability it
rained last night, given
that the grass is wet:
P(r | w) 
P( w | r ) P(r )
P( w)
Explaining away
Rain
Sprinkler
Grass Wet
P( R, S ,W )  P( R) P( S ) P(W | S , R)
P(W  w | S , R)  1 if S  s or R  r
 0 if R  r and S  s.
Compute probability it
rained last night, given
that the grass is wet:
P( w | r ) P(r )
P(r | w) 
 P(w | r, s) P(r, s)
r , s
Explaining away
Rain
Sprinkler
Grass Wet
P( R, S ,W )  P( R) P( S ) P(W | S , R)
P(W  w | S , R)  1 if S  s or R  r
 0 if R  r and S  s.
Compute probability it
rained last night, given
that the grass is wet:
P(r | w) 
P(r )
P(r , s)  P(r , s)  P(r , s)
Explaining away
Rain
Sprinkler
Grass Wet
P( R, S ,W )  P( R) P( S ) P(W | S , R)
P(W  w | S , R)  1 if S  s or R  r
 0 if R  r and S  s.
Compute probability it
rained last night, given
that the grass is wet:
P(r )
P(r | w) 
P(r )  P(r , s)
Explaining away
Rain
Sprinkler
Grass Wet
P( R, S ,W )  P( R) P( S ) P(W | S , R)
P(W  w | S , R)  1 if S  s or R  r
 0 if R  r and S  s.
Compute probability it
rained last night, given
that the grass is wet:
P(r | w) 
P(r )
P(r )  P(r ) P( s)
Between 1 and P(s)
 P (r )
Explaining away
Rain
Sprinkler
Grass Wet
P( R, S ,W )  P( R) P( S ) P(W | S , R)
P(W  w | S , R)  1 if S  s or R  r
 0 if R  r and S  s.
Compute probability it
rained last night, given
that the grass is wet and
sprinklers were left on:
P(r | w, s) 
P( w | r , s ) P(r | s )
P( w | s )
Both terms = 1
Explaining away
Rain
Sprinkler
Grass Wet
P( R, S ,W )  P( R) P( S ) P(W | S , R)
P(W  w | S , R)  1 if S  s or R  r
 0 if R  r and S  s.
Compute probability it
rained last night, given
that the grass is wet and
sprinklers were left on:
P(r | w, s )  P(r | s )  P (r )
Explaining away
Rain
Sprinkler
Grass Wet
P( R, S ,W )  P( R) P( S ) P(W | S , R)
P(W  w | S , R)  1 if S  s or R  r
 0 if R  r and S  s.
P(r )
P(r )  P(r ) P( s)
P(r | w, s )  P(r | s )  P (r )
P(r | w) 
 P (r )
“Discounting” to
prior probability.
Contrast w/ production system
Rain
Sprinkler
Grass Wet
• Formulate IF-THEN rules:
– IF Rain THEN Wet
– IF Wet THEN Rain
IF Wet AND NOT Sprinkler
THEN Rain
• Rules do not distinguish directions of inference
• Requires combinatorial explosion of rules
Contrast w/ spreading activation
Rain
Sprinkler
Grass Wet
• Excitatory links: Rain
Wet, Sprinkler
Wet
• Observing rain, Wet becomes more active.
• Observing grass wet, Rain and Sprinkler become
more active
• Observing grass wet and sprinkler, Rain cannot
become less active. No explaining away!
Contrast w/ spreading activation
Rain
Sprinkler
Grass Wet
• Excitatory links: Rain
• Inhibitory link: Rain
Wet, Sprinkler
Sprinkler
Wet
• Observing grass wet, Rain and Sprinkler become
more active
• Observing grass wet and sprinkler, Rain becomes
less active: explaining away
Contrast w/ spreading activation
Rain
Burst pipe
Sprinkler
Grass Wet
• Each new variable requires more inhibitory connections
• Not modular
– whether a connection exists depends on what others exist
– big holism problem
– combinatorial explosion
Contrast w/ spreading activation
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
(McClelland & Rumelhart, 1981)
Graphical models
• Capture dependency structure in distributions
• Provide an efficient means of representing and
reasoning with probabilities
• Support kinds of inference that are problematic
for other cognitive models: explaining away
– hard to capture in a production system
– more natural than with spreading activation
Graphical models
• Introduction to graphical models
– definitions
– efficient representation and inference
– explaining away
• Graphical models and cognitive science
– uses of graphical models
Uses of graphical models
• Understanding existing cognitive models
– e.g., neural network models
Sigmoid belief networks
y
• We can view multilayer
perceptrons as Bayes nets
with specific probabilities
(e.g., Neal, 1992)
z1
z2
• Makes it possible to use
Bayesian tools with existing
neural network models
(e.g., Mackay, 1992)
x1
x2
Uses of graphical models
• Understanding existing cognitive models
– e.g., neural network models
• Representation and reasoning
– a way to address holism in induction (c.f. Fodor)
The holism of confirmation
• If everything we know is one big probability
distribution, then discovering one small fact
requires changing all of our beliefs…
• Used by Fodor (2001) as an argument against
the possibility of inductive logic
• Bayes nets: everything can be connected to
everything, but inference can still be efficient
vs.
Uses of graphical models
• Understanding existing cognitive models
– e.g., neural network models
• Representation and reasoning
– a way to address holism in induction (c.f. Fodor)
• Defining generative models
– mixture models, language models, …
Graphical models and coinflipping

d1
d2
d3
d4
Fair coin: P(H) = 0.5
d1
d2
d3
P(H) = 
d4
s1
s2
s3
s4
d1
d2
d3
d4
Hidden Markov model:
si {Fair coin, Trick coin}
Plate notation:

di
di
N
N
(number of replications)
A hierarchical Bayesian model
physical knowledge
Coins
 ~ Beta(FH,FT)
FH,FT
Coin 1
d1
Coin 2
1
d2
d3
d4
d1
d2
...
2
d3
d4
200 Coin 200
d1
d2
d3
d4
Uses of graphical models
• Understanding existing cognitive models
– e.g., neural network models
• Representation and reasoning
– a way to address holism in induction (c.f. Fodor)
• Defining generative models
– mixture models, language models, …
• Modeling human causal reasoning
– more on Friday!