Markov nets: an even quicker intro

Download Report

Transcript Markov nets: an even quicker intro

Markov Networks
Model Definition
Comparison to Bayes Nets
Inference techniques
Learning Techniques
We can have potentials
on any cliques—not just
the maximal ones.
So, for example we can
have a potential on A
in addition to the other
four pairwise potentials
A
B
D
Qn: What is the
most likely
configuration of A&B?
C
Okay, you convinced me
that given any potentials
we will have a consistent
Joint. But given any joint,
will there be a potentials
I can provide?
Hammersley-Clifford
theorem…
Although A,B would
Like to agree, B&C
Need to agree,
C&D need to disagree
And D&A need to agree
.and the latter three have
Higher weights!
Mr. & Mrs. Smith example 
Moral: Factors
are not marginals!
Markov Networks
• Undirected graphical models
Smoking
Cancer
Asthma

Cough
Potential functions defined over cliques
1
P( x)    c ( xc )
Z c
Z    c ( xc )
x
c
Smoking Cancer
Ф(S,C)
False
False
4.5
False
True
4.5
True
False
2.7
True
True
4.5
Log-Linear models for Markov
Nets
A
B
D
C
Without loss of generality
Factors are “functions” over their domains
Log linear model consists of
 Features fi (Di ) (functions over domains)
Weights wi for features s.t.
Markov Networks
• Undirected graphical models
Smoking
Cancer
Asthma

Cough
Log-linear model:
1


P( x)  exp   wi f i ( x) 
Z
 i

Weight of Feature i
Feature i
 1 if  Smoking  Cancer
f1 (Smoking, Cancer )  
 0 otherwise
w1  1.5
Markov Nets vs. Bayes Nets
Property
Markov Nets
Bayes Nets
Form
Prod. potentials
Prod. potentials
Potentials
Arbitrary
Cond. probabilities
Cycles
Allowed
Forbidden
Partition func. Z = ? global
Indep. check
Z = 1 local
Graph separation D-separation
Indep. props. Some
Some
Inference
Convert to Markov
MCMC, BP, etc.
Connection to MCMC:
MCMC requires sampling a node given its markov blanket
Need to use P(x|MB(x)). For Bayes nets MB(x) contains more
nodes than are mentioned in the local distribution CPT(x)
 For Markov nets,
Inference in Markov Networks
• Goal: Compute marginals & conditionals of
P( X ) 
1


exp   wi f i ( X ) 
Z
 i



Z   exp   wi f i ( X ) 
X
 i

• Exact inference is #P-complete
• Most BN inference approaches work for MNs too
– Variable Elimination used factor multiplication—and
should work without change..
• Conditioning on Markov blanket is easy:
exp  i wi fi ( x ) 
P( x | MB( x )) 
exp  i wi fi ( x  0)   exp  i wi fi ( x  1) 
• Gibbs sampling exploits this
Gibbs sampling for Markov networks
• Example: P(D | ¬c)
• Resample non-evidence
variables in a pre-defined
order or a random order
• Suppose we begin with A
– B and C are Markov blanket
of A
– Calculate P(A | B,C)
– Use current Gibbs sampling
value for B & C
– Note: never change
(evidence).
A
B
C
D
E
F
A B C D E F
1
0
0
1
1
0
Example: Gibbs sampling
• Resample probability
distribution of A
A
B C D E F
1
0
0
1
1
0
?
0
0
1
1
0
Φ1 × Φ 2 × Φ 3 =
Normalized result =
a
¬a
b
1
5
¬b
4.3
0.2
¬a
25.8 0.8
a
¬a
0.97 0.03
¬a
2
1
a ¬a
c
ϕ1
ϕ2
a
a
A
¬c 3 4
ϕ3
B
C
D
E
F
1 2
Example: Gibbs sampling
• Resample probability
distribution of B
A
B C D E F
1
0
0
1
1
0
b
¬b
1 0 0 1 1 0
1
?
0
1
1
0
Φ1 × Φ 2 × Φ 4 =
Normalized result =
d
¬d
1
2
2
1
a
¬a
b
1
5
¬b
4.3
0.2
ϕ1
ϕ2
A
B
b
¬b
1
8.6
b
¬b
0.11
0.89
C
ϕ4
D
E
F
MCMC: Gibbs Sampling
state ← random truth assignment
for i ← 1 to num-samples do
for each variable x
sample x according to P(x|neighbors(x))
state ← state with new value of x
P(F) ← fraction of states in which F is true
Learning Markov Networks
• Learning parameters (weights)
– Generatively
– Discriminatively
• Learning structure (features)
• Easy Case: Assume complete data
(If not: EM versions of algorithms)
Entanglement in log likelihood…
a
b
c
Learning for log-linear
formulation
What is the expected
Value of the feature
given the current
parameterization
of the network?
Requires inference to answer
(inference at every iteration—
sort of like EM )
Use gradient ascent
Unimodal, because Hessian is
Co-variance matrix over features
Why should we spend so much time
computing gradient?
• Given that gradient is being used only in doing the
gradient ascent iteration, it might look as if we
should just be able to approximate it in any which
way
– Afterall, we are going to take a step with some arbitrary
step size anyway..
• ..But the thing to keep in mind is that the gradient is
a vector. We are talking not just of magnitude but
direction. A mistake in magnitude can change the
direction of the vector and push the search into a
completely wrong direction…
P( X ) 
1


exp   wi f i ( X ) 
Z
 i



Z   exp   wi f i ( X ) 
X
 i

Generative Weight Learning
• Maximize likelihood or posterior probability
• Numerical optimization (gradient or 2nd order)
• No local maxima

log Pw ( x)  ni ( x)  Ew ni ( x)
wi
No. of times feature i is true in data
Expected no. times feature i is true according to model
• Requires inference at each step (slow!)
Alternative Objectives to maximize..
• Since log-likelihood
requires network inference
to compute the derivative,
we might want to focus on
other objectives whose
gradients are easier to
compute (and which also –
hopefully—have optima at
the same parameter values).
• Two options:
– Pseudo Likelihood
– Contrastive Divergence
Given a single data instance x log-li
Log prob of data
Log prob of all other possible
data instances (w.r.t. current q
Maximize the distance
(“increase the divergence”)
Compute likelihood of
each possible data instance
just using markov blanket
(approximate chain rule)
Pick a sample of
typical other instances
(need to sample from Pq
Run MCMC initializing with
the data..)
Pseudo-Likelihood
PL( x)   P( xi | neighbors ( xi ))
i
• Likelihood of each variable given its neighbors
in the data
• Does not require inference at each step
• Consistent estimator
• Widely used in vision, spatial statistics, etc.
• But PL parameters may not work well for
long inference chains
[Which can lead to disasterous results]
Discriminative Weight Learning
• Maximize conditional likelihood of query (y)
given evidence (x)

log Pw ( y | x)  ni ( x, y )  Ew ni ( x, y )
wi
No. of true groundings of clause i in data
Expected no. true groundings according to model
• Approximate expected counts by counts in
MAP state of y given x
Markov Logic: Intuition
• A logical KB is a set of hard constraints
on the set of possible worlds
• Let’s make them soft constraints:
When a world violates a formula,
It becomes less probable, not impossible
• Give each formula a weight
(Higher weight  Stronger constraint)
P(world)  exp  weights of formulas it satisfies

Markov Logic: Definition
• A Markov Logic Network (MLN) is a set of
pairs (F, w) where
– F is a formula in first-order logic
– w is a real number
• Together with a set of constants,
it defines a Markov network with
– One node for each grounding of each predicate
in the MLN
– One feature for each grounding of each formula
F in the MLN, with the corresponding weight w
Example: Friends & Smokers
Smoking causes cancer.
Friends have similar smoking habits.
Example: Friends & Smokers
x Smokes( x )  Cancer ( x )
x, y Friends ( x, y )  Smokes( x )  Smokes( y ) 
Example: Friends & Smokers
1.5 x Smokes( x )  Cancer ( x )
1.1 x, y Friends ( x, y )  Smokes( x )  Smokes( y ) 
Example: Friends & Smokers
1.5 x Smokes( x )  Cancer ( x )
1.1 x, y Friends ( x, y )  Smokes( x )  Smokes( y ) 
Two constants: Anna (A) and Bob (B)
Example: Friends & Smokers
1.5 x Smokes( x )  Cancer ( x )
1.1 x, y Friends ( x, y )  Smokes( x )  Smokes( y ) 
Two constants: Anna (A) and Bob (B)
Smokes(A)
Cancer(A)
Smokes(B)
Cancer(B)
Example: Friends & Smokers
1.5 x Smokes( x )  Cancer ( x )
1.1 x, y Friends ( x, y )  Smokes( x )  Smokes( y ) 
Two constants: Anna (A) and Bob (B)
Friends(A,B)
Friends(A,A)
Smokes(A)
Smokes(B)
Cancer(A)
Friends(B,B)
Cancer(B)
Friends(B,A)
Example: Friends & Smokers
1.5 x Smokes( x )  Cancer ( x )
1.1 x, y Friends ( x, y )  Smokes( x )  Smokes( y ) 
Two constants: Anna (A) and Bob (B)
Friends(A,B)
Friends(A,A)
Smokes(A)
Smokes(B)
Cancer(A)
Friends(B,B)
Cancer(B)
Friends(B,A)
Example: Friends & Smokers
1.5 x Smokes( x )  Cancer ( x )
1.1 x, y Friends ( x, y )  Smokes( x )  Smokes( y ) 
Two constants: Anna (A) and Bob (B)
Friends(A,B)
Friends(A,A)
Smokes(A)
Smokes(B)
Cancer(A)
Friends(B,B)
Cancer(B)
Friends(B,A)
Markov Logic Networks
• MLN is template for ground Markov nets
• Probability of a world x:
1


P( x)  exp   wi ni ( x) 
Z
 i

Weight of formula i
No. of true groundings of formula i in x
• Typed variables and constants greatly reduce
size of ground Markov net
• Functions, existential quantifiers, etc.
• Infinite and continuous domains
Relation to Statistical Models
• Special cases:
–
–
–
–
–
–
–
–
–
–
–
Markov networks
Markov random fields
Bayesian networks
Log-linear models
Exponential models
Max. entropy models
Gibbs distributions
Boltzmann machines
Logistic regression
Hidden Markov models
Conditional random fields
• Obtained by making all
predicates zero-arity
• Markov logic allows
objects to be
interdependent
(non-i.i.d.)
Relation to First-Order Logic
• Infinite weights  First-order logic
• Satisfiable KB, positive weights 
Satisfying assignments = Modes of distribution
• Markov logic allows contradictions between
formulas
MAP/MPE Inference
• Problem: Find most likely state of world
given evidence
arg max P( y | x)
y
Query
Evidence
MAP/MPE Inference
• Problem: Find most likely state of world
given evidence
1


arg max
exp   wi ni ( x, y ) 
Zx
y
 i

MAP/MPE Inference
• Problem: Find most likely state of world
given evidence
arg max
y
 w n ( x, y)
i i
i
MAP/MPE Inference
• Problem: Find most likely state of world
given evidence
arg max
y
 w n ( x, y)
i i
i
• This is just the weighted MaxSAT problem
• Use weighted SAT solver
(e.g., MaxWalkSAT [Kautz et al., 1997] )
• Potentially faster than logical inference (!)
The MaxWalkSAT Algorithm
for i ← 1 to max-tries do
solution = random truth assignment
for j ← 1 to max-flips do
if ∑ weights(sat. clauses) > threshold then
return solution
c ← random unsatisfied clause
with probability p
flip a random variable in c
else
flip variable in c that maximizes
∑ weights(sat. clauses)
return failure, best solution found
But … Memory Explosion
• Problem:
If there are n constants
and the highest clause arity is c, c
the ground network requires O(n ) memory
• Solution:
Exploit sparseness; ground clauses lazily
→ LazySAT algorithm [Singla & Domingos,
2006]
Computing Probabilities
• P(Formula|MLN,C) = ?
• MCMC: Sample worlds, check formula
holds
• P(Formula1|Formula2,MLN,C) = ?
• If Formula2 = Conjunction of ground atoms
– First construct min subset of network necessary
to answer query (generalization of KBMC)
– Then apply MCMC (or other)
• Can also do lifted inference [Braz et al, 2005]
Ground Network Construction
network ← Ø
queue ← query nodes
repeat
node ← front(queue)
remove node from queue
add node to network
if node not in evidence then
add neighbors(node) to queue
until queue = Ø
But … Insufficient for Logic
• Problem:
Deterministic dependencies break MCMC
Near-deterministic ones make it very slow
• Solution:
Combine MCMC and WalkSAT
→ MC-SAT algorithm [Poon & Domingos, 2006]
Learning
•
•
•
•
Data is a relational database
Closed world assumption (if not: EM)
Learning parameters (weights)
Learning structure (formulas)
Weight Learning
• Parameter tying: Groundings of same clause

log Pw ( x)  ni ( x)  Ew ni ( x)
wi
No. of times clause i is true in data
Expected no. times clause i is true according to MLN
• Generative learning: Pseudo-likelihood
• Discriminative learning: Cond. likelihood,
use MC-SAT or MaxWalkSAT for inference
Structure Learning
• Generalizes feature induction in Markov nets
• Any inductive logic programming approach can be
used, but . . .
• Goal is to induce any clauses, not just Horn
• Evaluation function should be likelihood
• Requires learning weights for each candidate
• Turns out not to be bottleneck
• Bottleneck is counting clause groundings
• Solution: Subsampling
Structure Learning
• Initial state: Unit clauses or hand-coded
KB
• Operators: Add/remove literal, flip sign
• Evaluation function:
Pseudo-likelihood + Structure prior
• Search: Beam, shortest-first, bottom-up
[Kok & Domingos, 2005; Mihalkova & Mooney, 2007]
Alchemy
Open-source software including:
• Full first-order logic syntax
• Generative & discriminative weight learning
• Structure learning
• Weighted satisfiability and MCMC
• Programming language features
alchemy.cs.washington.edu
Alchemy
Represent- F.O. Logic +
ation
Markov nets
Prolog
BUGS
Horn
clauses
Bayes
nets
Inference
Model check- Theorem Gibbs
ing, MC-SAT proving
sampling
Learning
Parameters
& structure
No
Params.
Uncertainty Yes
No
Yes
Relational
Yes
No
Yes