Transcript MLNs

First Order
Representations and
Learning
coming up later: scalability!
Announcements

PIG assignment (last one) is out


Added some other examples to the “PIG workflow”
lecture, as well as pointers to documentation, etc.
PIG has been installed on Gates cluster


Sample exam questions will be out



see docs on cluster on course page
soon (Friday?)
think about the non-programming exercises in the
HWs also
Exam is 4/30 (in-class)
A little history

Tasks you do well:







statistical
approaches
Recognizing human faces
Character recognition (7 vs 4)
Simulations of physical events
Understanding statements in natural language
Proving mathematical theorems
Playing chess, Othello, …
logical
Solving logic puzzles (Soduko)
approaches
Logical approaches

“All men are mortal, Socrates is a man, so
Socrates is mortal”.



quantification: for all,
there exists
plus logical
connectives: and, or, …
for all W,X,Y,Z:


for all X: man(X)  mortal(X)
man(socrates)
father(X,Y) & mother(X,Z) & father(W,Y) &
father(W,Z) & W!=X <==> sibling(X,W)
for all X: feedsYoungMilk(X)  mammal(X)

for all X: mammal(X)  hasSpine(X)
Markov Logic Networks
Mostly pilfered from Pedro’s slides
Motivation



The real world is complex and uncertain
Logic handles complexity
Probability handles uncertainty
MLN’s Bold Agenda:
“Unify Logical and Statistical AI”
Field
Logical
approach
Knowledge
representation
First-order logic Graphical models
Automated
reasoning
Satisfiability
testing
Machine learning Inductive logic
programming
Planning
Classical
planning
Natural language Definite clause
grammars
processing
Statistical
approach
Markov chain
Monte Carlo
(e.g. Gibbs)
Neural networks,
….
Markov decision
processes
Prob. contextfree grammars
Markov Networks: [Review]

Undirected graphical models
Smoking
Cancer
Asthma

Cough
Potential functions defined over cliques
1
P( x)    c ( xc )
Z c
Z    c ( xc )
x
c
Smoking Cancer
Ф(S,C)
False
False
4.5
False
True
4.5
True
False
2.7
True
True
4.5
Markov Networks

Undirected graphical models
Smoking
Cancer
Asthma

Cough
Log-linear model:
1


P( x)  exp   wi f i ( x) 
Z
 i

Weight of Feature i
Feature i
 1 if  Smoking  Cancer
f1 (Smoking, Cancer )  
 0 otherwise
w1  1.5
Inference in Markov Networks

Goal: Compute marginals & conditionals of
P( X ) 


1


exp   wi f i ( X ) 
Z
 i



Z   exp   wi f i ( X ) 
X
 i

Exact inference is #P-complete
Conditioning on Markov blanket is easy:
w f ( x) 


P( x | MB( x )) 
exp   w f ( x  0)   exp   w f ( x  1) 
exp
i

i
i i
Gibbs sampling exploits this
i i
i
i i
MCMC: Gibbs Sampling
state ← random truth assignment
for i ← 1 to num-samples do
for each variable x
sample x according to P(x|neighbors(x))
state ← state with new value of x
P(F) ← fraction of states in which F is true
Overview


Motivation
Foundational areas







Probabilistic inference [Markov nets]
Statistical learning
Logical inference
Inductive logic programming
Putting the pieces together
MLNs
Applications
Learning Markov Networks

Learning parameters (weights)




Generatively
Discriminatively
Learning structure (features)
In this tutorial: Assume complete data
(If not: EM versions of algorithms)
Generative Weight Learning



Maximize likelihood or posterior probability
Numerical optimization (gradient or 2nd order)
No local maxima

log Pw ( x)  ni ( x)  Ew ni ( x)
wi
No. of times feature i is true in data
Expected no. times feature i is true according to model

Requires inference at each step (slow!)
Pseudo-Likelihood
PL( x)   P( xi | neighbors ( xi ))
i





Likelihood of each variable given its
neighbors in the data
Does not require inference at each step
Consistent estimator
Widely used in vision, spatial statistics, etc.
But PL parameters may not work well for
long inference chains
Discriminative Weight Learning

Maximize conditional likelihood of query (y)
given evidence (x)

log Pw ( y | x)  ni ( x, y )  Ew ni ( x, y )
wi
No. of true groundings of clause i in data
Expected no. true groundings according to model

Approximate expected counts by counts in
MAP state of y given x

[Like CRF learning – WC]
Overview


Motivation
Foundational areas







Probabilistic inference [Markov nets]
Statistical learning
Logical inference
Inductive logic programming
Putting the pieces together
MLNs
Applications
First-Order Logic





Constants, variables, functions, predicates
E.g.: Anna, x, MotherOf(x), Friends(x, y)
Literal: Predicate or its negation
Clause: Disjunction of literals
Grounding: Replace all variables by constants
E.g.: Friends (Anna, Bob)
World (model, interpretation):
Assignment of truth values to all ground
predicates
Inference in First-Order Logic


“Traditionally” done by theorem proving
(e.g.: Prolog)
Propositionalization followed by model
checking turns out to be faster (often a lot)




[This was popularized by Kautz & Selman – WC]
Propositionalization:
Create all ground atoms and clauses
Model checking: Satisfiability testing
Two main approaches:


Backtracking (e.g.: DPLL)
Stochastic local search (e.g.: WalkSAT)
Satisfiability

Input: Set of clauses
(Convert KB to conjunctive normal form (CNF))






(A v B v ~C)(~B v D) …
Output: Truth assignment that satisfies all clauses,
or failure
The paradigmatic NP-complete problem
Solution: Search
Key point:
Most [random] SAT problems are actually easy
Hard region: Narrow range of
#Clauses / #Variables
Backtracking





Assign truth values by depth-first search
Assigning a variable deletes false literals
and satisfied clauses
Empty set of clauses: Success
Empty clause: Failure
Additional improvements:


Unit propagation (unit clause forces truth value)
Pure literals (same truth value everywhere)
Stochastic Local Search






Uses complete assignments instead of partial
Start with random state
Flip variables in unsatisfied clauses
Hill-climbing: Minimize # unsatisfied clauses
Avoid local minima: Random flips
Multiple restarts
The WalkSAT Algorithm
for i ← 1 to max-tries do
solution = random truth assignment
for j ← 1 to max-flips do
if all clauses satisfied then
return solution
c ← random unsatisfied clause
with probability p
flip a random variable in c
else
flip variable in c that maximizes
number of satisfied clauses
return failure
Overview


Motivation
Foundational areas







Probabilistic inference [Markov nets]
Statistical learning
Logical inference
Inductive logic programming
Putting the pieces together
MLNs
Applications
Markov Logic


Logical language: First-order logic
Probabilistic language: Markov networks



Learning:



Syntax: First-order formulas with weights
Semantics: Templates for Markov net features
Parameters: Generative or discriminative
Structure: ILP with arbitrary clauses and MAP score
Inference:



MAP: Weighted satisfiability
Marginal: MCMC with moves proposed by SAT solver
Partial grounding + Lazy inference
Markov Logic: Intuition



A logical KB is a set of hard constraints
on the set of possible worlds
Let’s make them soft constraints:
When a world violates a formula,
It becomes less probable, not impossible
Give each formula a weight
(Higher weight  Stronger constraint)
P(world)  exp  weights of formulas it satisfies

Markov Logic: Definition

A Markov Logic Network (MLN) is a set of
pairs (F, w) where



F is a formula in first-order logic
w is a real number
Together with a set of constants,
it defines a Markov network with


One node for each grounding of each predicate in
the MLN
One feature for each grounding of each formula F
in the MLN, with the corresponding weight w
Example: Friends & Smokers
Smoking causes cancer.
Friends have similar smoking habits.
Example: Friends & Smokers
x Smokes( x )  Cancer ( x )
x, y Friends ( x, y )  Smokes( x )  Smokes( y ) 
Example: Friends & Smokers
1.5 x Smokes( x )  Cancer ( x )
1.1 x, y Friends ( x, y )  Smokes( x )  Smokes( y ) 
Example: Friends & Smokers
1.5 x Smokes( x )  Cancer ( x )
1.1 x, y Friends ( x, y )  Smokes( x )  Smokes( y ) 
Two constants: Anna (A) and Bob (B)
Example: Friends & Smokers
1.5 x Smokes( x )  Cancer ( x )
1.1 x, y Friends ( x, y )  Smokes( x )  Smokes( y ) 
Two constants: Anna (A) and Bob (B)
Smokes(A)
Cancer(A)
Smokes(B)
Cancer(B)
Example: Friends & Smokers
1.5 x Smokes( x )  Cancer ( x )
1.1 x, y Friends ( x, y )  Smokes( x )  Smokes( y ) 
Two constants: Anna (A) and Bob (B)
Friends(A,B)
Friends(A,A)
Smokes(A)
Smokes(B)
Cancer(A)
Friends(B,B)
Cancer(B)
Friends(B,A)
Example: Friends & Smokers
1.5 x Smokes( x )  Cancer ( x )
1.1 x, y Friends ( x, y )  Smokes( x )  Smokes( y ) 
Two constants: Anna (A) and Bob (B)
Friends(A,B)
Friends(A,A)
Smokes(A)
Smokes(B)
Cancer(A)
Friends(B,B)
Cancer(B)
Friends(B,A)
Example: Friends & Smokers
1.5 x Smokes( x )  Cancer ( x )
1.1 x, y Friends ( x, y )  Smokes( x )  Smokes( y ) 
Two constants: Anna (A) and Bob (B)
Friends(A,B)
Friends(A,A)
Smokes(A)
Smokes(B)
Cancer(A)
Friends(B,B)
Cancer(B)
Friends(B,A)
Markov Logic Networks

MLN is template for ground Markov nets

Probability of a world x:
1


P( x)  exp   wi ni ( x) 
Z
 i

Weight of formula i



No. of true groundings of formula i in x
Typed variables and constants greatly reduce
size of ground Markov net
Functions, existential quantifiers, etc.
Infinite and continuous domains
Relation to Statistical Models

Special cases:











Markov networks
Markov random fields
Bayesian networks
Log-linear models
Exponential models
Max. entropy models
Gibbs distributions
Boltzmann machines
Logistic regression
Hidden Markov models
Conditional random fields

Obtained by making all
predicates zero-arity

Markov logic allows
objects to be
interdependent
(non-i.i.d.)
Relation to First-Order Logic



Infinite weights  First-order logic
Satisfiable KB, positive weights 
Satisfying assignments = Modes of distribution
Markov logic allows contradictions between
formulas
<aside> Is this a good thing?

Pros:




Drawing connections between problems
Specifying/visualizing/prototyping new methods
…
Cons:



Robust, general tools are very hard to build for
computationally hard tasks
This is especially true when learning is involved
Eg, MCMC doesn’t work well for large weights
Learning




Data is a relational database
Closed world assumption (if not: EM)
Learning parameters (weights)
Learning structure (formulas)
Weight Learning

Parameter tying: Groundings of same clause

log Pw ( x)  ni ( x)  Ew ni ( x)
wi
No. of times clause i is true in data
Expected no. times clause i is true according to MLN


Generative learning: Pseudo-likelihood
Discriminative learning: Cond. Likelihood

[still just like CRFs - all we need is to do inference
efficiently. Here they use a Collins-like method.
WC]
Problem 1… Memory
Explosion

Problem:
If there are n constants
and the highest clause arity is c,
c
the ground network requires O(n ) memory

Solution:
Exploit sparseness; ground clauses lazily
→ LazySAT algorithm [Singla & Domingos, 2006]

[WC: this is a partial solution….]
Ground Network Construction
network ← Ø
queue ← query nodes
repeat
node ← front(queue)
remove node from queue
add node to network
if node not in evidence then
add neighbors(node) to queue
until queue = Ø
Example: Friends & Smokers
1.5 x Smokes( x )  Cancer ( x )
1.1 x, y Friends ( x, y )  Smokes( x )  Smokes( y ) 
Two constants: Anna (A) and Bob (B)
Friends(A,B)
Friends(A,A)
Smokes(A)
Smokes(B)
Cancer(A)
Friends(B,B)
Cancer(B)
Friends(B,A)
Problem 2: MAP/MPE Inference

Problem: Find most likely state of world
given evidence
arg max P( y | x)
y
Query
Evidence
MAP/MPE Inference

Problem: Find most likely state of world
given evidence
1


arg max
exp   wi ni ( x, y ) 
Zx
y
 i

MAP/MPE Inference

Problem: Find most likely state of world
given evidence
arg max
y
 w n ( x, y)
i i
i
MAP/MPE Inference

Problem: Find most likely state of world
given evidence
arg max
y



 w n ( x, y)
i i
i
This is just the weighted MaxSAT problem
Use weighted SAT solver
(e.g., MaxWalkSAT [Kautz et al., 1997] )
Potentially faster than logical inference (!)
The MaxWalkSAT Algorithm
for i ← 1 to max-tries do
solution = random truth assignment
for j ← 1 to max-flips do
if ∑ weights(sat. clauses) > threshold then
return solution
c ← random unsatisfied clause
with probability p
flip a random variable in c
else
flip variable in c that maximizes
∑ weights(sat. clauses)
return failure, best solution found
MCMC is Insufficient for Logic

Problem:
Deterministic dependencies break MCMC
Near-deterministic ones make it very slow

Solution:
Combine MCMC and WalkSAT
→ MC-SAT algorithm [Poon & Domingos, 2006]
Current Best Solution: The
MC-SAT Algorithm


[Basic idea: figure out how to sample from SAT
assignments, and use this as a subroutine for MCMC WC]
MC-SAT = MCMC + SAT


MCMC: Slice sampling with an auxiliary variable
for each clause
SAT: Wraps around SampleSAT (a uniform sampler)
to sample from highly non-uniform distributions

Sound: Satisfies ergodicity & detailed balance

Efficient: Orders of magnitude faster than
Gibbs and other MCMC algorithms
The MC-SAT Algorithm
X(0)  A random solution satisfying all hard clauses
for k  1 to num_samples
MØ
forall Ci satisfied by X(k – 1)
With prob. 1 – exp(– wi ) add Ci to M
endfor
X(k)  A uniformly random solution satisfying M
endfor
The MC-SAT Algorithm

Sound: Satisfies ergodicity and detailed balance
Assuming we have a perfect uniform sampler
 Of course, we don’t (since the problem is #P-hard)
But there is an approximate uniform sampler [Wei et al.
2004]




SampleSAT = WalkSAT + Simulated annealing
Trade off uniformity vs. efficiency
by tuning the prob. of WS steps vs. SA steps
Overview


Motivation
Foundational areas







Probabilistic inference [Markov nets]
Statistical learning
Logical inference
Inductive logic programming
Putting the pieces together
MLNs
Applications
Applications







Basics
Logistic regression
Hypertext classification
Information retrieval
Entity resolution
Hidden Markov models
Information extraction







Statistical parsing
Semantic processing
Bayesian networks
Relational models
Robot mapping
Planning and MDPs
Practical tips
Text Classification
page = { 1, … , n }
word = { … }
topic = { … }
Topic(page,topic!)
HasWord(page,word)
!Topic(p,t)
[bias term, since topics are sparse? –WC]
HasWord(p,+w) => Topic(p,+t)
Entity Resolution
Problem: Given database, find duplicate records
HasToken(token,field,record)
SameField(field,record,record)
SameRecord(record,record)
TFIDFSimGreaterThan0.8(+t,+f,r’)
HasToken(+t,+f,r) ^ HasToken(+t,+f,r’)
=> SameField(f,r,r’)
SameField(f,r,r’) => SameRecord(r,r’)
SameRecord(r,r’) ^ SameRecord(r’,r”)
=> SameRecord(r,r”)
TFIDFSimGreaterThan0.6(+t,+f,r’) => SameField(f,r,r’)
TFIDFSimGreaterThan0.4(+t,+f,r’) => SameField(f,r,r’)
Etc, etc
Entity Resolution
Problem: Given database, find duplicate records
HasToken(token,field,record)
SameField(field,record,record)
SameRecord(record,record)
HasToken(+t,+f,r) ^ HasToken(+t,+f,r’)
=> SameField(f,r,r’)
SameField(f,r,r’) => SameRecord(r,r’)
SameRecord(r,r’) ^ SameRecord(r’,r”)
=> SameRecord(r,r”)
Entity Resolution
Can also resolve fields:
HasToken(token,field,record)
SameField(field,record,record)
SameRecord(record,record)
HasToken(+t,+f,r) ^ HasToken(+t,+f,r’)
=> SameField(f,r,r’)
SameField(f,r,r’) <=> SameRecord(r,r’)
SameRecord(r,r’) ^ SameRecord(r’,r”)
=> SameRecord(r,r”)
SameField(f,r,r’) ^ SameField(f,r’,r”)
=> SameField(f,r,r”)
P. Singla & P. Domingos, “Entity Resolution with
Markov Logic”, in Proc. ICDM-2006.
Hidden Markov Models
obs = { Obs1, … , ObsN }
state = { St1, … , StM }
time = { 0, … , T }
State(state!,time)
Obs(obs!,time)
State(+s,0)
State(+s,t) => State(+s',t+1)
Obs(+o,t) => State(+s,t)
William’s summary

MLNs are a compact, elegant way of describing a
Markov network




Standard learning methods work
Network may be very very large
Inference may be expensive
Doesn’t eliminate feature engineering


E.g., complicated “feature” predicates
Experimental results for joint matching/NER are not that
strong overall



Cascading segmentation and then matching improves
segmentation, probably not matching
But it needs to be carefully restricted (efficiency?)
And it’s no better than a “pseudo” joint model using plausible
matches as features