Transcript PPT

FACTORIE
Probabilistic Programming
via
Imperatively Defined Factor Graphs
Seminar
"Probabilistic Models for Information Extraction",
Summer '11
Presenter: Artem Boldyrev
Tutor: Dr. Martin Theobald
Based on papers:
• Andrew McCallum, Karl Schultz, and Sameer Singh: FACTORIE: Probabilistic Programming via imperatively Defined Factor Graphs, NIPS'09
• Michael Wick, Andrew McCallum, and Gerome Miklau: Scalable Probabilistic Databases with Factor Graphs and MCMC, VLDB'10
Outline
• Introduction
– Factor graph
– Inference
– IDF in FACTORIE
• Citation matching
– Approaches
– Implementation with IDF
– Experiments
• Scalable Probabilistic DB
– Named Entity Resolution
– View maintenance
– Experiments
Outline
• Introduction
– Factor graph
– Inference
– IDF in FACTORIE
• Citation matching
– Approaches
– Implementation with IDF
– Experiments
• Scalable Probabilistic DB
– Named Entity Resolution
– View maintenance
– Experiments
Motivation
• Cascading errors in pipelines
– Joint inference across multiple subtasks
• Uni-directional inference information flows
– A highly-coupled, bi-directional joint inference
• Graphical model for inference
– Imperatively-defined factor graphs
• Error reduction & running faster
as the previous state-of-the-art system
Factor Graph
Bipartite graph
– Variable nodes (circles)
– Factor nodes (squares)
– Edge between variable and factor if the factor
depends on the variable
Markov Random Field
• Given a collection of cliques A of X
• We define an undirected graphical model as
the set of distributions:
p( y | x) 
1
A ( x A , y A )

Z A
• A is a factor, it computes a scalar value over
its inputs x A and y A
• The constant Z is called the normalization
constant:
Z   A ( x A , y A )
x, y
A
Markov Random Field
• We assume that factors have the form:


 ( x , y )  exp  f ( x, y )


•  Ak is a real-valued parameter vector (weights)
– Weights learned from the data
•  f Ak  is a set of feature functions
– Designed by expert
A
A
A
Ak
Ak
k
1, if Smokes( P)  Cancer ( P)
f ( Smoking, Cancer )  
0, otherwise
 5
Inference Example
• Given factor graph
• P(A|B,C) = ?
Inference Example
• Given factor graph
• P(A|B,C) = P(A,B,C) / P(B,C) = 0.5
Imperatively Defined Factor Graphs
• A factor template T j consists of:
– parameters  jk ,
– feature functions f jk  , and
– a description of an arbitrary relationship between
variables, yielding a set of satisfying tuples ( xi , yi )
• Let  is the set of factor templates, then the
probability distribution becomes:
Kj


1
p( y | x) 
exp   jk f jk ( xi , yi )


Z ( x) T j  ( xi , yi )T j
 k 1

FACTORIE
• „Factor Graphs, Imperative, Extensible“
• Implemented as a library in Scala
• Combining declarative and procedural
domain knowledge
• A model written as an IDF is a factor graph
with all the traditional semantics of factor
graphs
FACTORIE
Design Goals
•
•
•
•
Representation of factor graphs
Scalability
Efficient, built-in support for training
Easy-to-use
FACTORIE
Applications
•
•
•
•
•
•
•
named entity recognition
entity resolution
relation extraction
parsing
schema matching
ontology alignment
etc.
Outline
• Introduction
– Factor graph
– Inference
– IDF in FACTORIE
• Citation matching
– Approaches
– Implementation with IDF
– Experiments
• Scalable Probabilistic DB
– Named Entity Resolution
– View maintenance
– Experiments
Citation Matching
• We have a large collection of citation
strings from „References“ section
Citation Matching
• We have a large collection of citation
strings from „References“ section
• Citation strings have:
– Different citation styles
Citation Matching
• We have a large collection of citation
strings from „References“ section
• Citation strings have:
– Different citation styles
– Different abbreviations
Citation Matching
• We have a large collection of citation
strings from „References“ section
• Citation strings have:
– Different citation styles
– Different abbreviations
– Typographical errors
Citation Matching
• Input data:
– Set of mention strings (citations)
• Our goal is:
– Segmentation
• Output: author, title and venue
– Entity Resolution
• Output: clusters of the mention strings
Citation Matching
Citation Data
Citation Matching
Segmentation
Citation Matching
Entity Resolution
Citation Matching
Entity Resolution
Citation Matching
Approaches
• Pipeline approach
• Iterated pipeline approach
• Bi-directional approach
for joint inference for segmentation and entity resolution
(our approach)
Citation Matching
Pipeline Approach
• Tasks are solved in some order
• Cascading errors through the stages
• Uni-directional information flow
Citation Matching
Iterated Pipeline Approach
• Close the loop of the pipeline
– Tasks use information of each other
• Cascading errors not still eliminated
Citation Matching
Bi-directional approach (our approach)
• Integrate models in a single, unified, „fully-joint“
factor graph
• We achieved decrease of cascading errors
Citation Matching
Bi-directional approach (our approach)
• Integrate models in a single, unified, „fully-joint“
factor graph
• We achieved decrease of cascading errors
Citation Matching
via Imperatively-defined factor graphs
Segmentation
Variables:
•
•
•
Token: Observed variable representing a word in the mention
Label: Variable that can take any of the field types as a value
Field: Consecutive Tokens that have the same label type
Factors:
•
•
LabelToken: Factor between every token and its label
LabelNextToken: Factor between every label and the next
token
•
LabelPrevToken: Factor between every label and the
•
previous token
FieldFactor: Factor created for every Field to allow field-wise
features
class Token(word:String)
extends EnumVariable(word)
Citation Matching
via Imperatively-defined factor graphs
Segmentation
Variables:
•
•
•
Token: Observed variable representing a word in the mention
Label: Variable that can take any of the field types as a value
Field: Consecutive Tokens that have the same label type
Factors:
•
•
LabelToken: Factor between every token and its label
LabelNextToken: Factor between every label and the next
token
•
LabelPrevToken: Factor between every label and the
•
previous token
FieldFactor: Factor created for every Field to allow field-wise
features
class Label(label:String, val token:Token)
extends EnumVariable(label)
with VarInSeq
Citation Matching
via Imperatively-defined factor graphs
Segmentation
Variables:
•
•
•
Token: Observed variable representing a word in the mention
Label: Variable that can take any of the field types as a value
Field: Consecutive Tokens that have the same label type
Factors:
•
•
LabelToken: Factor between every token and its label
LabelNextToken: Factor between every label and the next
token
•
LabelPrevToken: Factor between every label and the
•
previous token
FieldFactor: Factor created for every Field to allow field-wise
features
Citation Matching
via Imperatively-defined factor graphs
Segmentation
Variables:
•
•
•
Token: Observed variable representing a word in the mention
Label: Variable that can take any of the field types as a value
Field: Consecutive Tokens that have the same label type
Factors:
•
•
LabelToken: Factor between every token and its label
LabelNextToken: Factor between every label and the next
token
•
LabelPrevToken: Factor between every label and the
•
previous token
FieldFactor: Factor created for every Field to allow field-wise
features
new TemplateWithDotStatistics2[Label,Token]
{
def unroll1(label:Label) =
Factor(label, label.token)
def unroll2(token:Token) =
throw new Error(„Token values shouldn‘t
change“)}
Citation Matching
via Imperatively-defined factor graphs
Segmentation
Variables:
•
•
•
Token: Observed variable representing a word in the mention
Label: Variable that can take any of the field types as a value
Field: Consecutive Tokens that have the same label type
Factors:
•
•
LabelToken: Factor between every token and its label
LabelNextToken: Factor between every label and the next
new TemplateWithDotStatistics2[Label,Token]
{
def unroll1(label:Label) =
if(label.hasNext)
Factor(label,label.next.token)
else Nil
token
•
LabelPrevToken: Factor between every label and the
•
previous token
FieldFactor: Factor created for every Field to allow field-wise
features
def unroll2(token:Token) =
throw new Error(„Token values shouldn‘t
change“)}
Citation Matching
via Imperatively-defined factor graphs
Segmentation
Variables:
•
•
•
Token: Observed variable representing a word in the mention
Label: Variable that can take any of the field types as a value
Field: Consecutive Tokens that have the same label type
Factors:
•
•
LabelToken: Factor between every token and its label
LabelNextToken: Factor between every label and the next
new TemplateWithDotStatistics2[Label,Token]
{
def unroll1(label:Label) =
if(label.hasPrev)
Factor(label,label.prev.token)
else Nil
token
•
LabelPrevToken: Factor between every label and the
•
previous token
FieldFactor: Factor created for every Field to allow field-wise
features
def unroll2(token:Token) =
throw new Error(„Token values shouldn‘t
change“)}
Citation Matching
via Imperatively-defined factor graphs
Segmentation
Variables:
•
•
•
Token: Observed variable representing a word in the mention
Label: Variable that can take any of the field types as a value
Field: Consecutive Tokens that have the same label type
Factors:
•
•
LabelToken: Factor between every token and its label
LabelNextToken: Factor between every label and the next
token
•
LabelPrevToken: Factor between every label and the
•
previous token
FieldFactor: Factor created for every Field to allow field-wise
features
new TemplateWithDotStatistics1[Field]
Citation Matching
via Imperatively-defined factor graphs
Segmentation
Variables:
•
•
•
Token: Observed variable representing a word in the mention
Label: Variable that can take any of the field types as a value
Field: Consecutive Tokens that have the same label type
Factors:
•
•
LabelToken: Factor between every token and its label
LabelNextToken: Factor between every label and the next
token
•
LabelPrevToken: Factor between every label and the
•
previous token
FieldFactor: Factor created for every Field to allow field-wise
features
Citation Matching
Bi-directional approach (our approach)
• Integrate models in a single, unified, „fully-joint“
factor graph
• We achieved decrease of cascading errors
Citation Matching
via Imperatively-defined factor graphs
Entity Resolution
Variables:
•
•
class Mention(val string:String) extends
PrimitiveVariable[Entity]
Mention: Variable that takes a single
class Entity extends SetVariable[Mention] {
Entity as its value
Entity: Set of Mentions that are coreferent var canonical:String = „“
def add(m:Mention, d:DiffList) = {
super.add(m,d); m.set(this,d)
canonical = recomputeCanonical(members)}
Factors:
•
•
Affinity: Factor created between
coreferent pairs of Mentions
Repulsion: Factor created between pairs
of Mentions that are not coreferent
def remove(m:Mention, d:DiffList) = {
super.remove(m,d); m.set(null,d)
canonical = recomputeCanonical(members)}
}
Citation Matching
via Imperatively-defined factor graphs
Entity Resolution
Variables:
•
•
Mention: Variable that takes a single
Entity as its value
Entity: Set of Mentions that are coreferent
Factors:
•
•
Affinity: Factor created between
coreferent pairs of Mentions
Repulsion: Factor created between pairs
of Mentions that are not coreferent
new Template2[Mention,Mention] with
DotStatistics1[AffinityVector] {
def unroll1(mention:Mention) =
for(other <- mention.entity.mentions;)
yield Factor(other, mention)
def unroll2(mention:Mention) = Nil
def statistics (mention1:Mention,
mention2:Mention) =
Stat(new AffinityVector(mention1.name,
mention2.name)) }
Citation Matching
via Imperatively-defined factor graphs
Entity Resolution
Variables:
•
•
Mention: Variable that takes a single
Entity as its value
Entity: Set of Mentions that are coreferent
Factors:
•
•
Affinity: Factor created between
coreferent pairs of Mentions
Repulsion: Factor created between pairs
of Mentions that are not coreferent
new Template2[Mention,Mention] with
DotStatistics1[AffinityVector] {
// Same idea as by Affinity Factor Template
}
Citation Matching
via Imperatively-defined factor graphs
Entity Resolution
Variables:
•
•
class Mention(val string:String) extends
PrimitiveVariable[Entity]
Mention: Variable that takes a single
class Entity extends SetVariable[Mention] {
Entity as its value
Entity: Set of Mentions that are coreferent var canonical:String = „“
def add(m:Mention, d:DiffList) = {
super.add(m,d); m.set(this,d)
canonical = recomputeCanonical(members)}
Factors:
•
•
Affinity: Factor created between
coreferent pairs of Mentions
Repulsion: Factor created between pairs
of Mentions that are not coreferent
def remove(m:Mention, d:DiffList) = {
super.remove(m,d); m.set(null,d)
canonical = recomputeCanonical(members)}
}
Citation Matching
Bi-directional approach (our approach)
• Integrate models in a single, unified, „fully-joint“
factor graph
• We achieved decrease of cascading errors
Citation Matching
via Imperatively-defined factor graphs
Combining of the two tasks
Variables:
•
No new variables
Factors:
•
•
•
JntInfBased: Factor between Mention and Label based on joint factors (Poon and Domingos AAAI 2007)
JointAffinity: Factor between Fields of the same type of coreferent Mentions
JointRepulsion: Factor between Fields of the same type of non-coreferent Mentions
Citation Matching
via Imperatively-defined factor graphs
Combining of the two tasks
Variables:
•
No new variables
Factors:
•
•
•
JntInfBased: Factor between Mention and Label based on joint factors (Poon and Domingos AAAI 2007)
JointAffinity: Factor between Fields of the same type of coreferent Mentions
JointRepulsion: Factor between Fields of the same type of non-coreferent Mentions
Citation Matching
via Imperatively-defined factor graphs
Combining of the two tasks
Variables:
•
No new variables
Factors:
•
•
•
JntInfBased: Factor between Mention and Label based on joint factors (Poon and Domingos AAAI 2007)
JointAffinity: Factor between Fields of the same type of coreferent Mentions
JointRepulsion: Factor between Fields of the same type of non-coreferent Mentions
Citation Matching
via Imperatively-defined factor graphs
Combining of the two tasks
Variables:
•
No new variables
Factors:
•
•
•
JntInfBased: Factor between Mention and Label based on joint factors (Poon and Domingos AAAI 2007)
JointAffinity: Factor between Fields of the same type of coreferent Mentions
JointRepulsion: Factor between Fields of the same type of non-coreferent Mentions
Citation Matching
Bi-directional approach (our approach)
• Integrate models in a single, unified, „fully-joint“
factor graph
• We achieved decrease of cascading errors
Citation Matching
via Imperatively-defined factor graphs
(Stages of IDF programming)
// 1. Defining the data representation
// 3. Selecting inference procedure
class Document extends VariableSequence[Token]
class Token(word:String) extends EnumVariable(word)
class Label(label:String, val token:Token) extends
EnumVariable(label) with VarInSeq
class Mention(val string:String) extends
PrimitiveVariable[Entity]
class Entity extends SetVariable[Mention] {
var canonical:String = "„
def add(m:Mention, d:DiffList) = {
super.add(m,d); m.set(this,d)
canonical = recomputeCanonical(members)}
def remove(m:Mention, d:DiffList) = {
super.remove(m,d); m.set(null,d)
canonical = recomputeCanonical(members)}}
val sampler = new ProposalSampler[Mention] {
def propose(m:Mention) = {
// Move Mention m to a randomly-sampled Entity.
entities.sample.add(m) }
}
// 2. Defining the factors for scoring
new Template2[Mention,Entity] with
DotStatistics1[Bool] {
def unroll1(m:Mention) = Factor(m, m.entity)
def unroll2(e:Entity) =
for (mention <- e.mentions) yield
Factor(mention, e)
def statistics(m:Mention,e:Entity) =
Bool(distance(m.string, e.canonical) < 0.5)}
// 4. Reading in the data, run inference
val documents = loadData()
sampler.process(documents.mentions), 1)
4 key imperative constructs in IDFs:
• Imperative variable value coordination
• Imperatively defined model structure
• Imperatively defined mapping from neighbor
variables to features
• Imperatively defined jump function
Citation Matching
via Imperatively-defined factor graphs
Experimental Setup
• Joint entity resolution and segmentation of research
paper citations
– Cora citation dataset
– 1295 citations, 134 entities, 36487 tokens
– Evaluated using three-fold cross-validation
• Isolated model
– Each tast is completely independent of the other
• Joint model
– Single model over both the tasks
• Compare with Poon and Domingos‘ previous state-ofthe-art isolated and joint MLNs
Citation Matching
via Imperatively-defined factor graphs
• 3-15 times faster
– 50-90 mins Isolated, Joint MLN
– 3 mins Isolated IDF
– 18 mins Joint IDF
• 20-25% error reducing
– 13% error reduction between
isolated IDF and joint IDF on
coreference
– 25.2% error reduction between
joint MLN and joint IDF
Citation Matching
via Imperatively-defined factor graphs
Results: Adding Joint Factors (Bi-directionality)
• Isolated: base model containing only isolated segmentation and coreference factors
• Semi-Joint: the model containing weakly joint factors -> only JntInfBased factor
• Fully-Joint: the model consists of bi-directional highly-coupled factors -> all joint factors
Citation Matching
via Imperatively-defined factor graphs
Combining of the two tasks
Variables:
•
No new variables
Factors:
•
•
•
JntInfBased: Factor between Mention and Label based on joint factors (Poon and Domingos AAAI 2007)
JointAffinity: Factor between Fields of the same type of coreferent Mentions
JointRepulsion: Factor between Fields of the same type of non-coreferent Mentions
FACTORIE
Conclusion
•
•
•
•
•
•
•
Object-oriented
Scalable
Imperativ constructs in IDF
Combine declarative and procedural knowledge
Applicable to a wide range of the problems
Extensible concept
Error reduction by increased influence between
tasks
Outline
• Introduction
– Factor graph
– Inference
– IDF in FACTORIE
• Citation matching
– Approaches
– Implementation with IDF
– Experiments
• Scalable Probabilistic DB
– Named Entity Resolution
– View maintenance
– Experiments
Scalable Probabilistic DB
with Factor Graphs and MCMC
What is about this query?
SELECT String, Probability
FROM Token
WHERE Label=‘PER’
• Is it doable? (Named Entity Recognition)
• Efficiently?
• YES!
We can obtain the marginal probability for
each tuple in this query.
Scalable Probabilistic DB
with Factor Graphs and MCMC
• How can we do it?
• Use:
– Factor Graphs (IDFs) for representing of possible
worlds and
– MCMC for efficient inference
• FACTORIE has both of them
Scalable Probabilistic DB
with Factor Graphs and MCMC
• We define relational factor graph over the TOKEN
relation
– Tokens are words
– Label can be
• PER (person), ORG (organization), LOC (location), MISC (none of
the above), O (not a named entity)
Scalable Probabilistic DB
with Factor Graphs and MCMC
• We define relational factor graph over the TOKEN
relation
– Tokens are words
– Label can be
• PER (person), ORG (organization), LOC (location), MISC (none of
the above), O (not a named entity)
Scalable Probabilistic DB
with Factor Graphs and MCMC
• We define relational factor graph over the TOKEN
relation
– 4 factor templates:
•
•
•
•
Emission: between observed strings and corresponding labels
Transmission: between consecutive labels
Bias: over labels
Skip-edge: between labels whose strings are identical
Scalable Probabilistic DB
with Factor Graphs and MCMC
• We define relational factor graph over the TOKEN
relation
– 4 factor templates:
•
•
•
•
Emission: between observed strings and corresponding labels
Transmission: between consecutive labels
Bias: over labels
Skip-edge: between labels whose strings are identical
Scalable Probabilistic DB
with Factor Graphs and MCMC
• We define relational factor graph over the TOKEN
relation
– 4 factor templates:
•
•
•
•
Emission: between observed strings and corresponding labels
Transmission: between consecutive labels
Bias: over labels
Skip-edge: between labels whose strings are identical
Scalable Probabilistic DB
with Factor Graphs and MCMC
• We define relational factor graph over the TOKEN
relation
– 4 factor templates:
•
•
•
•
Emission: between observed strings and corresponding labels
Transmission: between consecutive labels
Bias: over labels
Skip-edge: between labels whose strings are identical
Scalable Probabilistic DB
with Factor Graphs and MCMC
• We define relational factor graph over the TOKEN
relation
– 4 factor templates:
•
•
•
•
Emission: between observed strings and corresponding labels
Transmission: between consecutive labels
Bias: over labels
Skip-edge: between labels whose strings are identical
Scalable Probabilistic DB
with Factor Graphs and MCMC
• The main problem is query evaluation –
– return the set of tupels in the answer of
a query Q over the PDB <W,π> with
their probabilities
• We say: t in Q iff exists w in W s.t. t in Q(w)
then the probability:
Scalable Probabilistic DB
with Factor Graphs and MCMC
• Alternatively
• We use sampling approach for estimating
marginals, because
– trade-off between time and fidelity
Scalable Probabilistic DB
with Factor Graphs and MCMC
• How can we sample?
• 1. Algorithm
• Disadvantages:
– Dependent samples
– Expensive collecting
counts
• Can we do better?
• YES!
Scalable Probabilistic DB
with Factor Graphs and MCMC
• How can we sample?
• 2. Algorithm
• Advantage:
– Thinning
• Open problem:
– Choosing k for thinning
• Can we do better?
• YES!
Scalable Probabilistic DB
with Factor Graphs and MCMC
• How can we sample?
• 3. Algorithm(Idea)
– View maintenance


– Q(w' )  Q(w)  Q' (w,  )  Q' (w,  )
• Selection:
 ( w' )   ( w)   ( )   ( )
• Cartesian product:
w'.R1  w'.R2  w.R1  w.R2  w.R1   .R2  w.R1   .R2
Scalable Probabilistic DB
with Factor Graphs and MCMC
• How can we sample?
• 3. Algorithm
• Advantage:
– View maintenance
– Thinning
• Open problem:
– Choosing k for thinning
Scalable Probabilistic DB
with Factor Graphs and MCMC
• Named Entity Recognition
• We use for experiments:
– NYT DB from the year 2004
– 10 millions tokens from 1788 articles
• We store NYT tokens in our DB relation
TOKEN(TOK_ID, DOC_ID, STRING, LABEL, TRUTH)
Scalable Probabilistic DB
with Factor Graphs and MCMC
• Experiments (NER with skip-chain CRFs)
• Training based on MH
• Materialized MCMC sampler (Algorithm 3) VS.
Basic MCMC sampler (Algorithm 1)
• Query 1:
SELECT String
FROM Token
WHERE Label=‘PER’
• Squared-error loss (element-wise squared loss)
Scalable Probabilistic DB
with Factor Graphs and MCMC
Scalable Probabilistic DB
with Factor Graphs and MCMC
Scalable Probabilistic DB
with Factor Graphs and MCMC
• Experiment on parallelization:
– Benefit of generation samples with higher
independence
– 8 identical copies (worlds)
– Averaging 8 parallel chains
– Evaluating on Query 1:
SELECT String
FROM Token
WHERE Label=‘PER’
– Squared error
Scalable Probabilistic DB
with Factor Graphs and MCMC
Scalable Probabilistic DB
with Factor Graphs and MCMC
• Experimet on aggregation:
– Query 2:
SELECT COUNT(*)
FROM TOKEN
WHERE LABEL=‘PER’
– Query 3:
SELECT T.doc_id
FROM Token T
WHERE (SELECT COUNT(*)
FROM Token T1
WHERE T1.label=‘PER’ AND T.doc id=T1.doc_id)
=(SELECT COUNT(*)
FROM Token T1
WHERE T1.label=‘ORG’ AND T.doc id=T1.doc_id)
Scalable Probabilistic DB
with Factor Graphs and MCMC
Scalable Probabilistic DB
with Factor Graphs and MCMC
•
•
•
•
Conclusion
Framework for probabilistic database
Factor graphs for modeling distribution
over possible worlds
Efficient evaluation of arbitrary relational
queries (MCMC)
Future work
Automatical construction of proposal
functions
Thank you!
http://factorie.googlecode.com/
Additional slides
• Smokers example
– MRF -> Factor graph
– Implemetation with 4 stages of IDF
programming
• Citation matching
– factor template example, unrolling
• Markov Chain Monte Carlo
• Factor graph unrolling
Undirected Graphical Models
• Markov Random
Fields
• Factor graphs
Friends & Smokers
MRF
p  { Amy, Bob}
p : Smokes( p)  Cancer ( p)[5]
p1 , p2 : Friends ( p1 , p2 )  Smokes( p1 )  Smokes( p2 )[2]
Friends & Smokers
Factor Graph
p  { Amy, Bob}
p : Smokes( p)  Cancer ( p)[5]
p1 , p2 : Friends ( p1 , p2 )  Smokes( p1 )  Smokes( p2 )[2]
Friends & Smokers
Factor Graph
p  { Amy, Bob}
p : Smokes( p)  Cancer ( p)[5]
p1 , p2 : Friends ( p1 , p2 )  Smokes( p1 )  Smokes( p2 )[2]
Friends & Smokers
via Imperatively-defined factor graphs
Four stages of IDF programming
• Defining the data representation
• Defining the factors for scoring
• Select the inference procedure
• Reading in the data, run inference and
learning
Friends & Smokers
via Imperatively-defined factor graphs
// Define entity, attribute and relation types
object PersonDomain extends CategoricalDomain[Person]
class Person (val name:String) extends ItemizedObservation[Person] with Entity[Person]
{
def domain = PersonDomain
type GetterType = PersonGetter
class GetterClass extends PersonGetter // Person#GetterClass#super
val smokes = new Smokes; class Smokes extends BooleanVariable with Attribute
val cancer = new Cancer; class Cancer extends BooleanVariable with Attribute
override def toString = name }
object friend extends Relation[Person,Person]
// Define boilerplate, to support access to attributes in the entity-relationship syntax
class PersonGetter extends EntityGetter[Person] {
def smokes = getAttribute(_.smokes)
def cancer = getAttribute(_.cancer)
def friends = getRelationDst[friend.type,Person,Person](friend) // the people whom this
Person considers friends
def friendly = getRelationSrc[friend.type,Person,Person](friend) // the people who
consider this Person a friend }
Friends & Smokers
via Imperatively-defined factor graphs
Four stages of IDF programming
• Defining the data representation
• Defining the factors for scoring
• Select the inference procedure
• Reading in the data, run inference and
learning
Friends & Smokers
via Imperatively-defined factor graphs
p : Smokes( p)  Cancer ( p)[5]
p1 , p2 : Friends ( p1 , p2 )  Smokes( p1 )  Smokes( p2 )[2]
// Define Model
val model = new Model (
Forany[Person] {
p => p.smokes ==> p.cancer
} * 5.0 % „SmokingCausesCancer“,
Forany[Person] {
p => p.friends.smokes ==> p.smokes
} * 2.0 % „FriendsSmoke“
)
Friends & Smokers
via Imperatively-defined factor graphs
Four stages of IDF programming
• Defining the data representation
• Defining the factors for scoring
• Select the inference procedure
• Reading in the data, run inference and
learning
// Do 2000 iterations of sampling, gathering sample counts every 20 iterations
val inferencer =
new VariableSamplingInferencer(new VariableSettingsSampler[BooleanVariable](model))
inferencer.burnIn = 100; inferencer.iterations = 2000; inferencer.thinning = 20
Friends & Smokers
via Imperatively-defined factor graphs
Four stages of IDF programming
• Defining the data representation
• Defining the factors for scoring
• Select the inference procedure
• Reading in the data, run inference and
learning
// Create the data
val amy = new Person("Amy");
val bob = new Person("Bob"); friend(amy,bob); friend(bob,amy);
// Do inference
val marginals = inferencer.infer(List(amy.cancer))
// Test
println("p(amy.cancer == true) = "+marginals(amy.cancer).pr(1))
// p(amy.smokes == true) = 0.55
Citation Matching
via Imperatively-defined factor graphs
Entity Resolution
Template example
Coreference template measuring the compatibility
between a Mention and the canonical representation
of its assigned Entity.
Moved Mention3 in Entity1.
unroll1 returns a factor between the Mention and its newly
assigned Entity.
unroll2 returns a list of factors between itself all its
member Mentions.
new Template2[Mention,Entity] with
DotStatistics1[Bool] {
def unroll1(m:Mention) = Factor(m, m.entity)
def unroll2(e:Entity) = for (mention <e.mentions) yield Factor(mention, e)
def statistics(m:Mention,e:Entity) =
Bool(distance(m.string, e.canonical) < 0.5) }
Citation Matching
via Imperatively-defined factor graphs
Entity Resolution
Template example
Coreference template measuring the compatibility
between a Mention and the canonical representation
of its assigned Entity.
Moved Mention3 in Entity1.
unroll1 returns a factor between the Mention and its newly
assigned Entity.
unroll2 returns a list of factors between itself all its
member Mentions.
new Template2[Mention,Entity] with
DotStatistics1[Bool] {
def unroll1(m:Mention) = Factor(m, m.entity)
def unroll2(e:Entity) = for (mention <e.mentions) yield Factor(mention, e)
def statistics(m:Mention,e:Entity) =
Bool(distance(m.string, e.canonical) < 0.5) }
Citation Matching
via Imperatively-defined factor graphs
Entity Resolution
Template example
Coreference template measuring the compatibility
between a Mention and the canonical representation
of its assigned Entity.
Moved Mention3 in Entity1.
unroll1 returns a factor between the Mention and its newly
assigned Entity.
unroll2 returns a list of factors between itself all its
member Mentions.
new Template2[Mention,Entity] with
DotStatistics1[Bool] {
def unroll1(m:Mention) = Factor(m, m.entity)
def unroll2(e:Entity) = for (mention <e.mentions) yield Factor(mention, e)
def statistics(m:Mention,e:Entity) =
Bool(distance(m.string, e.canonical) < 0.5) }
Citation Matching
via Imperatively-defined factor graphs
Entity Resolution
Template example
Coreference template measuring the compatibility
between a Mention and the canonical representation
of its assigned Entity.
Moved Mention3 in Entity1.
unroll1 returns a factor between the Mention and its newly
assigned Entity.
unroll2 returns a list of factors between itself all its
member Mentions.
new Template2[Mention,Entity] with
DotStatistics1[Bool] {
def unroll1(m:Mention) = Factor(m, m.entity)
def unroll2(e:Entity) = for (mention <e.mentions) yield Factor(mention, e)
def statistics(m:Mention,e:Entity) =
Bool(distance(m.string, e.canonical) < 0.5) }
Markov Chain Monte Carlo
• only represent a single possible world at a time
• performs inference by stochastically proposing
– some change to the current possible world
• accepts the change with a probability
– probability depends on the ration of post- and pre-proposal scores
• efficient, because
– normalization constants cancel,
– not changed variables cancel,
– in implementation they are not even created
• allows to avoid unrolling and scoring the entire world
MCMC proposal (jump) function
for citation matching
Can make changes to
• entity resolution proposal
– selects random Mention, then
– moves it to another randomly selected Entity with probability 0.8
or
– makes it a singleton in a new Entity with probability 0.2
• segmentation proposal
– selects random Field, then
– changes the range of the Field randomly, then
– adjusts corresponding Labels automaticaly via imperative variable
value coordination
Unrolling
• Scoring a proposal
(Scores of factors that did‘t change cancel)
• Efficiently score:
–
–
–
–
proposal method runs
build a list of variables that changed
find factors that touch changed variables
find other unchanged variables needed to calculate those
factors‘ score