Slide Material for DHS Reverse Site Visit

Download Report

Transcript Slide Material for DHS Reverse Site Visit

Information Extraction
Data Mining and Topic Discovery
with Probabilistic Models
Andrew McCallum
Computer Science Department
University of Massachusetts Amherst
Joint work with Charles Sutton, Aron Culotta, Wei Li, Xuerui Wang, Andres Corrada,
Ben Wellner, Chris Pal, Michael Hay, Natasha Mohanty, David Mimno, Gideon Mann.
Information Extraction
with Conditional Random Fields
Andrew McCallum
Computer Science Department
University of Massachusetts Amherst
Joint work with Charles Sutton, Aron Culotta, Wei Li, Xuerui Wang, Andres Corrada,
Ben Wellner, Chris Pal, Michael Hay, Natasha Mohanty, David Mimno, Gideon Mann.
Goal:
Mine actionable knowledge
from unstructured text.
An HR office
Jobs, but not HR jobs
Jobs, but not HR jobs
Extracting Job Openings from the Web
foodscience.com-Job2
JobTitle: Ice Cream Guru
Employer: foodscience.com
JobCategory: Travel/Hospitality
JobFunction: Food Services
JobLocation: Upper Midwest
Contact Phone: 800-488-2611
DateExtracted: January 8, 2001
Source: www.foodscience.com/jobs_midwest.htm
OtherCompanyJobs: foodscience.com-Job1
A Portal for Job Openings
Data Mining the Extracted Job Information
IE from Research Papers
[McCallum et al ‘99]
IE from Research Papers
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Mining Research Papers
[Rosen-Zvi, Griffiths, Steyvers,
Smyth, 2004]
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
IE from
Chinese Documents regarding Weather
Department of Terrestrial System, Chinese Academy of Sciences
200k+ documents
several millennia old
- Qing Dynasty Archives
- memos
- newspaper articles
- diaries
Why prefer “knowledge base search” over
“page search”
•
Targeted, restricted universe of hits
– Don’t show resumes when I’m looking for job openings.
•
Specialized queries
– Topic-specific
– Multi-dimensional
– Based on information spread on multiple pages.
•
Get correct granularity
– Site, page, paragraph
•
Specialized display
– Super-targeted hit summarization in terms of DB slot values
•
Ability to support sophisticated data mining
Information Extraction needed to
automatically build the Knowledge Base.
What is “Information Extraction”
As a task:
Filling slots in a database from sub-segments of text.
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
NAME
TITLE
ORGANIZATION
What is “Information Extraction”
As a task:
Filling slots in a database from sub-segments of text.
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
IE
NAME
Bill Gates
Bill Veghte
Richard Stallman
TITLE
ORGANIZATION
CEO
Microsoft
VP
Microsoft
founder Free Soft..
What is “Information Extraction”
As a family
of techniques:
Information Extraction =
segmentation + classification + clustering + association
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
Microsoft Corporation
CEO
Bill Gates
Microsoft
Gates
Microsoft
Bill Veghte
Microsoft
VP
Richard Stallman
founder
Free Software Foundation
What is “Information Extraction”
As a family
of techniques:
Information Extraction =
segmentation + classification + association + clustering
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
Microsoft Corporation
CEO
Bill Gates
Microsoft
Gates
Microsoft
Bill Veghte
Microsoft
VP
Richard Stallman
founder
Free Software Foundation
What is “Information Extraction”
As a family
of techniques:
Information Extraction =
segmentation + classification + association + clustering
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
Microsoft Corporation
CEO
Bill Gates
Microsoft
Gates
Microsoft
Bill Veghte
Microsoft
VP
Richard Stallman
founder
Free Software Foundation
What is “Information Extraction”
As a family
of techniques:
Information Extraction =
segmentation + classification + association + clustering
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
* Microsoft Corporation
CEO
Bill Gates
* Microsoft
Gates
* Microsoft
Bill Veghte
* Microsoft
VP
Richard Stallman
founder
Free Software Foundation
Larger Context
Spider
Filter
Data
Mining
IE
Segment
Classify
Associate
Cluster
Discover patterns
- entity types
- links / relations
- events
Database
Document
collection
Actionable
knowledge
Prediction
Outlier detection
Decision support
Landscape of IE Tasks (1/4):
Pattern Feature Domain
Text paragraphs
without formatting
Grammatical sentences
and some formatting & links
Astro Teller is the CEO and co-founder of
BodyMedia. Astro holds a Ph.D. in Artificial
Intelligence from Carnegie Mellon University,
where he was inducted as a national Hertz fellow.
His M.S. in symbolic and heuristic computation
and B.S. in computer science are from Stanford
University. His work in science, literature and
business has appeared in international media from
the New York Times to CNN to NPR.
Non-grammatical snippets,
rich formatting & links
Tables
Landscape of IE Tasks (2/4):
Pattern Scope
Web site specific
Formatting
Amazon.com Book Pages
Genre specific
Layout
Resumes
Wide, non-specific
Language
University Names
Landscape of IE Tasks (3/4):
Pattern Complexity
E.g. word patterns:
Closed set
Regular set
U.S. states
U.S. phone numbers
He was born in Alabama…
Phone: (413) 545-1323
The big Wyoming sky…
The CALD main office can be
reached at 412-268-1299
Complex pattern
U.S. postal addresses
University of Arkansas
P.O. Box 140
Hope, AR 71802
Headquarters:
1128 Main Street, 4th Floor
Cincinnati, Ohio 45210
Ambiguous patterns,
needing context and
many sources of evidence
Person names
…was among the six houses
sold by Hope Feldman that year.
Pawel Opalinski, Software
Engineer at WhizBang Labs.
Landscape of IE Tasks (4/4):
Pattern Combinations
Jack Welch will retire as CEO of General Electric tomorrow. The top role
at the Connecticut company will be filled by Jeffrey Immelt.
Single entity
Binary relationship
Person: Jack Welch
Relation: Person-Title
Person: Jack Welch
Title:
CEO
Person: Jeffrey Immelt
Location: Connecticut
“Named entity” extraction
Relation: Company-Location
Company: General Electric
Location: Connecticut
N-ary record
Relation:
Company:
Title:
Out:
In:
Succession
General Electric
CEO
Jack Welsh
Jeffrey Immelt
Evaluation of Single Entity Extraction
TRUTH:
Michael Kearns and Sebastian Seung will start Monday’s tutorial, followed by Richard M. Karpe and Martin Cooke.
PRED:
Michael Kearns and Sebastian Seung will start Monday’s tutorial, followed by Richard M. Karpe and Martin Cooke.
# correctly predicted segments
Precision =
2
=
# predicted segments
6
# correctly predicted segments
Recall
=
2
=
# true segments
4
1
F1
=
Harmonic mean of Precision & Recall =
((1/P) + (1/R)) / 2
State of the Art Performance
• Named entity recognition
– Person, Location, Organization, …
– F1 in high 80’s or low- to mid-90’s
• Binary relation extraction
– Contained-in (Location1, Location2)
Member-of (Person1, Organization1)
– F1 in 60’s or 70’s or 80’s
• Wrapper induction
– Extremely accurate performance obtainable
– Human effort (~30min) required on each site
Outline
• Examples of IE and Data Mining
• IE with Hidden Markov Models
• Introduction to Conditional Random Fields (CRFs)
• Examples of IE with CRFs
• Sequence Alignment with CRFs
• Semi-supervised Learning
Hidden Markov Models
HMMs are the standard sequence modeling tool in
genomics, music, speech, NLP, …
Graphical model
Finite state model
S t-1
St
S t+1
...
...
observations
...
Generates:
State
sequence
Observation
sequence
transitions
O
Ot
t -1
O t +1

|o|
o1
o2
o3
o4
o5
o6
o7
o8
 
P( s , o )   P( st | st 1 ) P(ot | st )
S={s1,s2,…}
Start state probabilities: P(st )
Transition probabilities: P(st|st-1 )
t 1
Parameters: for all states
Usually a multinomial over
Observation (emission) probabilities: P(ot|st ) atomic, fixed alphabet
Training:
Maximize probability of training observations (w/ prior)
IE with Hidden Markov Models
Given a sequence of observations:
Yesterday Pedro Domingos spoke this example sentence.
and a trained HMM:
person name
location name
background
 
Find the most likely state sequence: (Viterbi) arg max s P( s , o )
Yesterday Pedro Domingos spoke this example sentence.
Any words said to be generated by the designated “person name”
state extract as a person name:
Person name: Pedro Domingos
HMM Example: “Nymble”
[Bikel, et al 1998],
[BBN “IdentiFinder”]
Task: Named Entity Extraction
Person
start-ofsentence
end-ofsentence
Org
Other
Train on ~500k words of news wire text.
Case
Mixed
Upper
Mixed
Observation
probabilities
P(st | st-1, ot-1 )
P(ot | st , st-1 )
or
(Five other name classes)
Results:
Transition
probabilities
Language
English
English
Spanish
P(ot | st , ot-1 )
Back-off to:
Back-off to:
P(st | st-1 )
P(ot | st )
P(st )
P(ot )
F1 .
93%
91%
90%
Other examples of shrinkage for HMMs in IE: [Freitag and McCallum ‘99]
We want More than an Atomic View of Words
Would like richer representation of text:
many arbitrary, overlapping features of the words.
S t-1
identity of word
ends in “-ski”
is capitalized
is part of a noun phrase
is “Wisniewski”
is in a list of city names
is under node X in WordNet
part of
ends in
is in bold font
noun phrase
“-ski”
is indented
O t 1
is in hyperlink anchor
last person name was female
next two words are “and Associates”
St
S t+1
…
…
Ot
O t +1
Problems with Richer Representation
and a Generative Model
These arbitrary features are not independent.
– Multiple levels of granularity (chars, words, phrases)
– Multiple dependent modalities (words, formatting, layout)
– Past & future
Two choices:
Model the dependencies.
Each state would have its own
Bayes Net. But we are already
starved for training data!
Ignore the dependencies.
This causes “over-counting” of
evidence (ala naïve Bayes).
Big problem when combining
evidence, as in Viterbi!
S t-1
St
S t+1
S t-1
St
S t+1
O
Ot
O t +1
O
Ot
O t +1
t -1
t -1
Conditional Sequence Models
• We prefer a model that is trained to maximize a
conditional probability rather than joint probability:
P(s|o) instead of P(s,o):
– Can examine features, but not responsible for generating them.
– Don’t have to explicitly model their dependencies.
– Don’t “waste modeling effort” trying to generate what we are
given at test time anyway.
Outline
• Examples of IE and Data Mining
• IE with Hidden Markov Models
• Introduction to Conditional Random Fields (CRFs)
• Examples of IE with CRFs
• Sequence Alignment with CRFs
• Semi-supervised Learning
From HMMs to Conditional Random Fields
s  s1,s2,...sn
Joint
o  o1,o2,...on
[Lafferty, McCallum, Pereira 2001]
St-1
St
St+1
...
|o |
P(s,o )   P(st | st1 )P(ot | st )
t1
Ot-1
Ot
...
Ot+1
Conditional


1 |o |
P(s | o) 
P(st | st1 )P(ot | st )

P(o ) t1
1 |o |

s (st ,st1)o (ot ,st )

Z(o) t1


where o (t)  exp k f k (st ,ot )
 k

St-1
Ot-1
St
Ot
St+1
...
Ot+1
...
(A super-special case of
Conditional Random Fields.)
Set 
parameters by maximum likelihood, using optimization method on dL.
Linear Chain Conditional Random Fields
[Lafferty, McCallum, Pereira 2001]
St
St+1
St+2
St+3
St+4
O = Ot, Ot+1, Ot+2, Ot+3, Ot+4
Markov on s, conditional dependency on o.


1 |o |
P(s | o )   exp
 j f j (st ,st1,o,t)



Z o t1
 j

Hammersley-Clifford-Besag theorem stipulates that the CRF
has this form—an exponential function of the cliques in the graph.

Assuming that the dependency structure of the states is tree-shaped
(linear chain is a trivial tree), inference can be done by dynamic
programming in time O(|o| |S|2)—just like HMMs.
CRFs vs. HMMs
• More general and expressive modeling technique
• Comparable computational efficiency
• Features may be arbitrary functions of any or all
observations
• Parameters need not fully specify generation of
observations; require less training data
• Easy to incorporate domain knowledge
• State means only “state of process”, vs
“state of process” and “observational history I’m keeping”
Training CRFs
Maximize log - likelihood of parameters given training data
L({k } |{ o,s
(i)
:
})
Log - likelihood gradient :
L
2
  Ck (s (i),o (i) )    P{ k } (s | o (i) ) Ck (s,o (i) )  k
k
i
i
s
Ck (s,o )   f k (o,t,st1,st )
t
Feature count using
correct labels
-
Feature count using
predicted labels
-
Smoothing penalty
Outline
• Examples of IE and Data Mining
• IE with Hidden Markov Models
• Introduction to Conditional Random Fields (CRFs)
• Examples of IE with CRFs
• Sequence Alignment with CRFs
• Semi-supervised Learning
Table Extraction from Government Reports
Cash receipts from marketings of milk during 1995 at $19.9 billion dollars, was
slightly below 1994. Producer returns averaged $12.93 per hundredweight,
$0.19 per hundredweight below 1994. Marketings totaled 154 billion pounds,
1 percent above 1994. Marketings include whole milk sold to plants and dealers
as well as milk sold directly to consumers.
An estimated 1.56 billion pounds of milk were used on farms where produced,
8 percent less than 1994. Calves were fed 78 percent of this milk with the
remainder consumed in producer households.
Milk Cows and Production of Milk and Milkfat:
United States, 1993-95
-------------------------------------------------------------------------------:
:
Production of Milk and Milkfat 2/
:
Number
:------------------------------------------------------Year
:
of
:
Per Milk Cow
:
Percentage
:
Total
:Milk Cows 1/:-------------------: of Fat in All :-----------------:
: Milk : Milkfat : Milk Produced : Milk : Milkfat
-------------------------------------------------------------------------------: 1,000 Head
--- Pounds --Percent
Million Pounds
:
1993
:
9,589
15,704
575
3.66
150,582 5,514.4
1994
:
9,500
16,175
592
3.66
153,664 5,623.7
1995
:
9,461
16,451
602
3.66
155,644 5,694.3
-------------------------------------------------------------------------------1/ Average number during year, excluding heifers not yet fresh.
2/ Excludes milk sucked by calves.
Table Extraction from Government Reports
[Pinto, McCallum, Wei, Croft, 2003 SIGIR]
100+ documents from www.fedstats.gov
Labels:
CRF
of milk during 1995 at $19.9 billion dollars, was
eturns averaged $12.93 per hundredweight,
1994. Marketings totaled 154 billion pounds,
ngs include whole milk sold to plants and dealers
consumers.
ds of milk were used on farms where produced,
es were fed 78 percent of this milk with the
cer households.
1993-95
------------------------------------
n of Milk and Milkfat 2/
-------------------------------------: Percentage :
Non-Table
Table Title
Table Header
Table Data Row
Table Section Data Row
Table Footnote
... (12 in all)
Features:
uction of Milk and Milkfat:
w
•
•
•
•
•
•
•
Total
----: of Fat in All :-----------------Milk Produced : Milk : Milkfat
------------------------------------
•
•
•
•
•
•
•
Percentage of digit chars
Percentage of alpha chars
Indented
Contains 5+ consecutive spaces
Whitespace in this line aligns with prev.
...
Conjunctions of all previous features,
time offset: {0,0}, {-1,0}, {0,1}, {1,2}.
Table Extraction Experimental Results
[Pinto, McCallum, Wei, Croft, 2003 SIGIR]
Line labels,
percent correct
HMM
Stateless
MaxEnt
CRF
65 %
85 %
95 %
Table segments,
F1
64 %
92 %
IE from Research Papers
[McCallum et al ‘99]
IE from Research Papers
Field-level F1
Hidden Markov Models (HMMs)
75.6
[Seymore, McCallum, Rosenfeld, 1999]
Support Vector Machines (SVMs)
89.7
 error
40%
[Han, Giles, et al, 2003]
Conditional Random Fields (CRFs)
[Peng, McCallum, 2004]
93.9
Chinese Word Segmentation
[McCallum & Feng 2003]
~100k words data, Penn Chinese Treebank
Lexicon features:
Adjective ending character
adverb ending character
building words
Chinese number characters
Chinese period
cities and regions
countries
dates
department characters
digit characters
foreign name chars
function words
job title
locations
money
negative characters
organization indicator
preposition characters
provinces
punctuation chars
Roman alphabetics
Roman digits
stopwords
surnames
symbol characters
verb chars
wordlist (188k lexicon)
Chinese Word Segmentation Results
[McCallum & Feng 2003]
Precision and recall of segments with perfect boundaries:
Method
# training
sentences
testing segmentation
prec. recall F1
[Peng]
[Ponte]
[Teahan]
[Xue]
CRF
CRF
CRF
~5M
?
~40k
~10k
2805
140
56
75.1
84.4
?
95.2
97.3
95.4
93.9
74.0
87.8
?
95.1
97.8
96.0
95.0
74.2
86.0
94.4
95.2
97.5
95.7
94.4
Prev. world’s best
 error 50%
Named Entity Recognition
CRICKET MILLNS SIGNS FOR BOLAND
CAPE TOWN 1996-08-22
South African provincial side
Boland said on Thursday they
had signed Leicestershire fast
bowler David Millns on a one
year contract.
Millns, who toured Australia with
England A in 1992, replaces
former England all-rounder
Phillip DeFreitas as Boland's
overseas professional.
Labels:
PER
ORG
LOC
MISC
Examples:
Yayuk Basuki
Innocent Butare
3M
KDP
Cleveland
Cleveland
Nirmal Hriday
The Oval
Java
Basque
1,000 Lakes Rally
Automatically Induced Features
[McCallum & Li, 2003, CoNLL]
Index
Feature
0
inside-noun-phrase (ot-1)
5
stopword (ot)
20
capitalized (ot+1)
75
word=the (ot)
100
in-person-lexicon (ot-1)
200
word=in (ot+2)
500
word=Republic (ot+1)
711
word=RBI (ot) & header=BASEBALL
1027
header=CRICKET (ot) & in-English-county-lexicon (ot)
1298
company-suffix-word (firstmentiont+2)
4040
location (ot) & POS=NNP (ot) & capitalized (ot) & stopword (ot-1)
4945
moderately-rare-first-name (ot-1) & very-common-last-name (ot)
4474
word=the (ot-2) & word=of (ot)
Named Entity Extraction Results
[McCallum & Li, 2003, CoNLL]
Method
F1
HMMs BBN's Identifinder
73%
CRFs w/out Feature Induction 83%
CRFs with Feature Induction
based on LikelihoodGain
90%
Related Work
• CRFs are widely used for information extraction
...including more complex structures, like trees:
– [Zhu, Nie, Zhang, Wen, ICML 2007] Dynamic
Hierarchical Markov Random Fields and their
Application to Web Data Extraction
– [Viola & Narasimhan]: Learning to Extract Information
from Semi-structured Text using a Discriminative
Context Free Grammar
– [Jousse et al 2006]: Conditional Random Fields for
XML Trees
Outline
• Examples of IE and Data Mining
• IE with Hidden Markov Models
• Introduction to Conditional Random Fields (CRFs)
• Examples of IE with CRFs
• Sequence Alignment with CRFs
• Semi-supervised Learning
String Edit Distance
• Distance between sequences x and y:
– “cost” of lowest-cost sequence of edit operations
that transform string x into y.
String Edit Distance
• Distance between sequences x and y:
– “cost” of lowest-cost sequence of edit operations
that transform string x into y.
• Applications
– Database Record Deduplication
Apex International Hotel
Grassmarket Street
Apex Internat’l
Grasmarket Street
Records are duplicates of the same hotel?
String Edit Distance
• Distance between sequences x and y:
– “cost” of lowest-cost sequence of edit operations
that transform string x into y.
• Applications
– Database Record Deduplication
– Biological Sequences
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
AGCTCTTACGATAGAGGACTCCAGA
AGGTCTTACCAAAGAGGACTTCAGA
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
String Edit Distance
• Distance between sequences x and y:
– “cost” of lowest-cost sequence of edit operations
that transform string x into y.
• Applications
– Database Record Deduplication
– Biological Sequences
– Machine Translation
Il a achete une pomme
He bought an apple
String Edit Distance
• Distance between sequences x and y:
– “cost” of lowest-cost sequence of edit operations
that transform string x into y.
• Applications
– Database Record Deduplication
– Biological Sequences
– Machine Translation
– Textual Entailment
He bought a new car last night
He purchased a brand new automobile yesterday evening
Levenshtein Distance
Edit operations
Align two strings
copy
insert
delete
subst
Copy a character from x to y
Insert a character into y
Delete a character from y
Substitute one character for another
i a m _ W . _ C o h o n
copy
subst
copy
copy
copy
copy
delete
delete
delete
copy
copy
subst
insert
copy
copy
copy
copy
operation cost
(cost 0)
(cost 1)
(cost 1)
(cost 1)
x1 = William W. Cohon
x2 = Willleam Cohen
W i l l
Lowest cost
alignment
[1966]
W i l l l e a m
_ C o h e n
0 0 0 0 1 1 0 0 1 1 1 0 0 0 0 1 0
Total cost = 6
= Levenshtein Distance
Levenshtein Distance
Edit operations
copy
insert
delete
subst
Copy a character from x to y
Insert a character into y
Delete a character from y
Substitute one character for another
Dynamic program
D(i,j) = score of best alignment
from x1... xi to y1... yj.
D(i-1,j-1) + d(xi≠yj )
D(i,j) = min D(i-1,j) + 1
D(i,j-1) + 1
W
i
l
l
i
a
m
0
1
2
3
4
5
6
7
W
1
0
1
2
3
4
5
6
i
2
1
0
1
2
3
4
5
l
3
2
1
0
1
2
3
4
l
4
3
2
1
0
1
2
3
l
5
4
3
2
1
1
2
3
e
6
5
4
3
2
2
2
3
(cost 0)
(cost 1)
(cost 1)
(cost 1)
a
7
6
5
4
3
3
2
3
m
8
7
6
5
4
4
4
2
insert
subst
total
cost =
distance
repeated
delete
is cheaper
Levenshtein Distance
with Markov Dependencies
Edit operations
copy
insert
delete
subst
c
0
1
1
1
Cost after a
Copy a character from x to y
Insert a character into y
Delete a character from y
Substitute one character for another
i d s
0 0 0
1 1 1
2
1 12 1
1 1 1
Learn these costs
from training data
subst
copy
delete
insert
W
i
l
l
i
a
m
0
1
2
3
4
5
6
7
W
1
0
1
2
3
4
5
6
i
2
1
0
1
2
3
4
5
l
3
2
1
0
1
2
3
4
l
4
3
2
1
0
1
2
3
l
5
4
3
2
1
1
2
3
e
6
5
4
3
2
2
2
3
a
7
6
5
4
3
3
2
3
m
8
7
6
5
4
4
4
2
3D
DP
table
Ristad & Yianilos (1997)

Essentially a Pair-HMM,
generating a edit/state/alignment-sequence and two strings
string 1
8
W i l l l e a m
8
p(a,x1,x 2 )   p(at | at1 ) p(x1,a t .i1 , x 2,a t .i2 | at )
Match score =
p(x1,x 2 ) 
8
8
copy
7
subst
6
10 11 12 13 14 15 16
copy
insert
5
9
copy
4
8
copy
3
7
copy
copy
2
6
delete
copy
1
5
delete
4
delete
4
copy
3
copy
2
i a m _ W . _ C o h o n
subst
1
copy
string 2
W i l l
copy
alignment
x1
a.i1
a.e
a.i2
x2
9
10 11 12 13 14
_ C o h e n
complete data likelihood
t
  p(a
t
a:x 1 ,x 2
| at1) p(x1,a t .i1 , x 2,at .i2 | at ) incomplete data likelihood
(sum over all alignments
consistent with x1 and x2)
t
Given training set of
matching string pairs, objective fn is
O   p(x(1 j ),x(2 j ) )
j
Learn via EM:
Expectation step: Calculate likelihood of alignment paths
Maximization step: Make those paths more likely.
Ristad & Yianilos Regrets
• 
Limited features of input strings
– Examine only single character pair at a time
– Difficult to use upcoming string context, lexicons, ...
– Example: “Senator John Green” “John Green”
• Limited edit operations
– Difficult to generate arbitrary jumps in both strings
– Example: “UMass” “University of Massachusetts”.
• Trained only on positive match data
– Doesn’t include information-rich “near misses”
– Example: “ACM SIGIR” ≠ “ACM SIGCHI”
So, consider model trained by conditional probability
Conditional Probability (Sequence) Models
• We prefer a model that is trained to maximize a
conditional probability rather than joint probability:
P(y|x) instead of P(y,x):
– Can examine features, but not responsible for
generating them.
– Don’t have to explicitly model their dependencies.
CRF String Edit Distance
string 1
8
W i l l l e a m
joint complete data likelihood
8
8
8
9
10 11 12 13 14
_ C o h e n
p(a,x1,x 2 )   p(at | at1 ) p(x1,a t .i1 , x 2,a t .i2 | at )
t
conditional complete data likelihood p(a | x1,x 2 ) 
 pairs,
Want to train from set of string
each labeled one of {match, non-match}
match

non-match
match
match
non-match
copy
7
subst
6
10 11 12 13 14 15 16
copy
insert
5
9
copy
4
8
copy
3
7
copy
copy
2
6
delete
copy
1
5
delete
4
delete
4
copy
3
copy
2
i a m _ W . _ C o h o n
subst
1
copy
string 2
W i l l
copy
alignment
x1
a.i1
a.e
a.i2
x2
1
Z x 1 ,x 2
(a ,a
t
,x1,x 2 )
t1
t
“William W. Cohon”
“Willlleam Cohen”
“Bruce D’Ambrosio”
“Bruce Croft”
“Tommi Jaakkola”
“Tommi Jakola”
“Stuart Russell”
“Stuart Russel”
“Tom Deitterich”“Tom Dean”
CRF String Edit Distance FSM
subst
copy
delete
insert
CRF String Edit Distance FSM
conditional incomplete data likelihood
p(m | x1,x 2 ) 
Z
a S m
subst
1
 (a ,a
t
x 1 ,x 2
,x1,x 2 )
t1
t
copy
match
m=1
delete
insert
subst
copy
Start
non-match
m=0
delete
insert
CRF String Edit Distance FSM
x1 = “Tommi Jaakkola”
x2 = “Tommi Jakola”
subst
copy
match
m=1
delete
insert
subst
copy
Probability summed over
all alignments in match states
0.8
Start
non-match
m=0
delete
insert
Probability summed over
all alignments in non-match states
0.2
CRF String Edit Distance FSM
x1 = “Tom Dietterich”
x2 = “Tom Dean”
subst
copy
match
m=1
delete
insert
subst
copy
Probability summed over
all alignments in match states
0.1
Start
non-match
m=0
delete
insert
Probability summed over
all alignments in non-match states
0.9

Parameter Estimation
Given training set of
string pairs and match/non-match labels,
objective fn is the incomplete log likelihood
The complete log likelihood
log p(m
j
O  log p(m( j ) | x(1 j ),x(2j ) )
j
( j)
| a,x(1 j ),x(2 j ) ) p(a | x(1 j ),x(2j ) )
a

Expectation Maximization
• E-step: Estimate
distribution over alignments,

p(a | x( j ),x ( j ) ) , using current parameters
• M-step: Change parameters to maximize the
complete (penalized) log likelihood,
with an iterative quasi-Newton method (BFGS)
1
2
This is “conditional EM”, but avoid complexities of [Jebara 1998],
because no need to solve M-step in closed form.
Efficient Training
• Dynamic programming table is 3D;
|x1| = |x2| = 100, |S| = 12, .... 120,000 entries
• Use beam search during E-step
[Pal, Sutton, McCallum 2005]
• Unlike completely observed CRFs, objective
function is not convex.
• Initialize parameters not at zero, but so as to
yield a reasonable initial edit distance.
What Alignments are Learned?
x1 = “Tommi Jaakkola”
x2 = “Tommi Jakola”
T o m m i
subst
copy
match
m=1
delete
insert
subst
copy
Start
non-match
m=0
delete
insert
T
o
m
m
i
J
a
k
o
l
a
J a a k k o l a
What Alignments are Learned?
x1 = “Bruce Croft”
x2 = “Tom Dean”
subst
copy
match
m=1
delete
insert
Start
B r u c e
subst
copy
non-match
m=0
delete
insert
T
o
m
D
e
a
n
C r o f t
What Alignments are Learned?
x1 = “Jaime Carbonell”
x2 = “Jamie Callan”
subst
copy
match
m=1
delete
insert
Start
J a i m e
subst
copy
non-match
m=0
delete
insert
J
a
m
i
e
C
a
l
l
a
n
C a r b o n e l l
Summary of Advantages
• Arbitrary features of the input strings
– Examine past, future context
– Use lexicons, WordNet
• Extremely flexible edit operations
– Single operation may make arbitrary jumps in both
strings, of size determined by input features
• Discriminative Training
– Maximize ability to predict match vs non-match
• Easy to Label Data
– Match/Non-Match
no need for labeled alignments
Experimental Results:
Data Sets
• Restaurant name, Restaurant address
– 864 records, 112 matches
– E.g. “Abe’s Bar & Grill, E. Main St”
“Abe’s Grill, East Main Street”
• People names, UIS DB generator
– synthetic noise
– E.g. “John Smith”
vs “Snith, John”
• CiteSeer Citations
– In four sections: Reason, Face, Reinforce, Constraint
– E.g. “Rusell & Norvig, “Artificial Intelligence: A Modern...”
“Russell & Norvig, “Artificial Intelligence: An Intro...”
Experimental Results:
Features
•
•
•
•
•
•
•
same, different
same-alphabetic, different alphbetic
same-numeric, different-numeric
punctuation1, punctuation2
alphabet-mismatch, numeric-mismatch
end-of-1, end-of-2
same-next-character, different-next-character
Experimental Results:
Edit Operations
•
•
•
•
•
•
•
insert, delete, substitute/copy
swap-two-characters
skip-word-if-in-lexicon
skip-parenthesized-words
skip-any-word
substitute-word-pairs-in-translation-lexicon
skip-word-if-present-in-other-string
Experimental Results
[Bilenko & Mooney 2003]
F1 (average of precision and recall)
Distance
metric
Restaurant
name
Restaurant
address
CiteSeer
Reason Face
Reinf
Constraint
Levenshtein
Learned Leven.
Vector
Learned Vector
0.290
0.354
0.365
0.433
0.686
0.712
0.380
0.532
0.927
0.938
0.897
0.924
0.924
0.941
0.923
0.913
0.952
0.966
0.922
0.875
0.893
0.907
0.903
0.808
Experimental Results
[Bilenko & Mooney 2003]
F1 (average of precision and recall)
Distance
metric
Restaurant
name
Restaurant
address
CiteSeer
Reason Face
Reinf
Constraint
Levenshtein
Learned Leven.
Vector
Learned Vector
0.290
0.354
0.365
0.433
0.686
0.712
0.380
0.532
0.927
0.938
0.897
0.924
0.952
0.966
0.922
0.875
0.893
0.907
0.903
0.808
0.924
0.941
0.923
0.913
CRF Edit Distance
0.448
0.783
0.964
0.918
0.917
0.976
Experimental Results
Data set: person names, with word-order noise added
F1
Without skip-if-present-in-other-string
With skip-if-present-in-other-string
0.856
0.981
Related Work
• Learned Edit Distance
– [Bilenko & Mooney 2003], [Cohen et al 2003],...
– [Joachims 2003]: Max-margin, trained on alignments
• Conditionally-trained models with latent variables
– [Jebara 1999]: “Conditional Expectation Maximization”
– [Quattoni, Collins, Darrell 2005]: CRF for visual object
recognition, with latent classes for object sub-patches
– [Zettlemoyer & Collins 2005]: CRF for mapping
sentences to logical form, with latent parses.
Outline
• Examples of IE and Data Mining
• IE with Hidden Markov Models
• Introduction to Conditional Random Fields (CRFs)
• Examples of IE with CRFs
• Sequence Alignment with CRFs
• Semi-supervised Learning
Semi-Supervised Learning
How to train with limited labeled data?
Augment with lots of unlabeled data
“Expectation Regularization”
[Mann, McCallum, ICML 2007]
Supervised Learning
Decision
boundary
Creation of labeled instances requires extensive human effort
What if limited labeled data?
Small amount of labeled data
Semi-Supervised Learning:
Labeled & Unlabeled data
Small amount of labeled data
Large amount of unlabeled data
Augment limited labeled data by using unlabeled data
More Semi-Supervised Algorithms
than Applications
30
25
20
# papers
Algorithms
Applications
15
10
5
0
1998
2000
2002
2004
2006
Compiled from [Zhu, 2007]
Weakness of Many
Semi-Supervised Algorithms
Difficult to Implement
Significantly more complicated than supervised
counterparts
Fragile
Meta-parameters hard to tune
Lacking in Scalability
O(n2) or O(n3) on unlabeled data
“EM will generally degrade [tagging]
accuracy, except when only a limited
amount of hand-tagged text is
available.”
[Merialdo, 1994]
“When the percentage of
labeled data increases from
50% to 75%, the performance
of [Label Propagation with
Jensen-Shannon divergence]
and SVM become almost
same, while [Label propagation
with cosine distance] performs
significantly worse than SVM.”
[Niu,Ji,Tan, 2005]
Families of
Semi-Supervised Learning
1.
2.
3.
4.
Expectation Maximization
Graph-Based Methods
Auxiliary Functions
Decision Boundaries in Sparse Regions
Family 1 : Expectation Maximization
[Dempster, Laird, Rubin, 1977]
Fragile -- often worse than supervised
Family 2: Graph-Based Methods
[Szummer, Jaakkola, 2002]
[Zhu, Ghahramani, 2002]
Lacking in scalability, Sensitive to choice of metric
Family 3: Auxiliary-Task Methods
[Ando and Zhang, 2005]
Complicated to find appropriate auxiliary tasks
Family 4: Decision Boundary in Sparse
Region
Family 4: Decision Boundary in Sparse
Region
Transductive SVMs [Joachims, 1999]: Sparsity measured by margin
Entropy Regularization [Grandvalet and Bengio, 2005] …by label entropy
Minimal Entropy Solution!
How do we know the minimal entropy
solution is wrong?
We suspect at least some of
the data is in the second
class!
0.8
0.7
0.6
0.5
0.4
0.3
In fact we often have
prior knowledge of the
relative class
proportions
0.2
0.1
0
Class Size
0.8 : Student
0.2 : Professor
How do we know the minimal entropy
solution is wrong?
We suspect at least some of
the data is in the second
class!
In fact we often have
prior knowledge of
the relative class
proportions
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Class Size
0.1 : Gene Mention
0.9 : Background
How do we know the minimal entropy
solution is wrong?
We suspect at least some of
the data is in the second
class!
0.6
0.5
0.4
0.3
In fact we often have
prior knowledge of
the relative class
proportions
0.2
0.1
0
Class Size
0.6 : Person
0.4 : Organization
Families of
Semi-Supervised Learning
1.
2.
3.
4.
5.
Expectation Maximization
Graph-Based Methods
Auxiliary Functions
Decision Boundaries in Sparse Regions
Expectation Regularization
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Class Size
Family 5:
Expectation Regularization
Low
density
region
0.8
Favor decision boundaries
that match the prior
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Class Size
Family 5:
Expectation Regularization
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Class Size
Label Regularization
Expectation Regularization
special case
general case
p(y)
p(y|feature)
Expectation Regularization
Simple:
Easy to implement
Robust:
Meta-parameters need little or no tuning
Scalable:
Linear in number of unlabeled examples
Discriminative Models
• Predict class boundaries directly
• Do not directly estimate class densities
• Make few assumptions (e.g. independence)
on features
• Are trained by optimizing conditional loglikelihood
Logistic Regression
Constraints
Expectations
Expectation Regularization (XR)
Log-likelihood
KL-Divergence between a prior distribution
and an expected distribution
over the unlabeled data
XR
Prior distribution
(provided from supervised training or
estimated on the labeled data)
Model’s expected distribution on the unlabeled data
After Training, Model Matches
Prior Distribution
Supervised only
Supervised + XR
Gradient for Logistic Regression
When
the gradient is 0
XR Results for Classification
Secondary Structure Prediction
Accuracy
# Labeled Examples
2
100
1000
SVM (supervised)
55.41%
66.29%
Cluster Kernel SVM
57.05%
65.97%
QC Smartsub
57.68%
59.16%
Naïve Bayes (supervised)
52.42%
57.12%
64.47%
Naïve Bayes EM
50.79%
57.34%
57.60%
Logistic Regression (supervised)
52.42%
56.74%
65.43%
Logistic Regression + Ent. Reg.
48.56%
54.45%
58.28%
Logistic Regression + XR
57.08%
58.51%
65.44%
XR Results for Classification: Sliding
Window Model
CoNLL03 Named Entity Recognition Shared Task
XR Results for Classification: Sliding
Window Model 2
BioCreativeII 2007 Gene/Gene Product Extraction
XR Results for Classification: Sliding
Window Model 3
Wall Street Journal Part-of-Speech Tagging
XR Results for Classification: SRAA
Simulated/Real Auto/Aviation Text Classification
Noise in Prior Knowledge
What happens when users’ estimates of the
class proportions is in error?
Noisy Prior Distribution
CoNLL03 Named Entity Recognition Shared Task
20% change
in probability of
majority class
Conclusion
Expectation Regularization is an effective, robust
method of semi-supervised training which can be
applied to discriminative models, such as logistic
regression
Ongoing and Future Work
Applying Expectation Regularization to other
discriminative models, e.g. conditional random fields
Experimenting with priors other than class label priors
End of Part 1
Joint Inference in
Information Extraction & Data Mining
Andrew McCallum
Computer Science Department
University of Massachusetts Amherst
Joint work with Charles Sutton, Aron Culotta, Wei Li, Xuerui Wang, Andres Corrada,
Ben Wellner, Chris Pal, Michael Hay, Natasha Mohanty, David Mimno, Gideon Mann.
From Text to Actionable Knowledge
Spider
Filter
Data
Mining
IE
Segment
Classify
Associate
Cluster
Discover patterns
- entity types
- links / relations
- events
Database
Document
collection
Actionable
knowledge
Prediction
Outlier detection
Decision support
Knowledge
Discovery
IE
Segment
Classify
Associate
Cluster
Problem:
Discover patterns
- entity types
- links / relations
- events
Database
Document
collection
Actionable
knowledge
Combined in serial juxtaposition,
IE and DM are unaware of each others’
weaknesses and opportunities.
1) DM begins from a populated DB, unaware of
where the data came from, or its inherent
errors and uncertainties.
2) IE is unaware of emerging patterns and
regularities in the DB.
The accuracy of both suffers, and significant mining
of complex text sources is beyond reach.
Solution:
Uncertainty Info
Spider
Filter
Data
Mining
IE
Segment
Classify
Associate
Cluster
Discover patterns
- entity types
- links / relations
- events
Database
Document
collection
Actionable
knowledge
Emerging Patterns
Prediction
Outlier detection
Decision support
Solution:
Unified Model
Spider
Filter
Data
Mining
IE
Segment
Classify
Associate
Cluster
Probabilistic
Model
Discover patterns
- entity types
- links / relations
- events
Discriminatively-trained undirected graphical models
Document
collection
Conditional Random Fields
[Lafferty, McCallum, Pereira]
Conditional PRMs
[Koller…], [Jensen…],
[Geetor…], [Domingos…]
Complex Inference and Learning
Just what we researchers like to sink our teeth into!
Actionable
knowledge
Prediction
Outlier detection
Decision support
Scientific Questions
• What model structures will capture salient dependencies?
• Will joint inference actually improve accuracy?
• How to do inference in these large graphical models?
• How to do parameter estimation efficiently in these models,
which are built from multiple large components?
• How to do structure discovery in these models?
Outline
• The need for joint inference
• Examples of joint inference
– Joint Labeling of Cascaded Sequences
(Belief Propagation)
– Joint Labeling of Distant Entities (BP by Tree Reparameterization)
– Joint Co-reference Resolution
– Joint Segmentation and Co-ref
(Graph Partitioning)
(Sparse BP)
– Joint Relation Extraction and Data Mining
(ICM)
– Probability + First-order Logic, Co-ref on Entities
(MCMC)
Cascaded Predictions
Named-entity tag
Part-of-speech
Segmentation
(output prediction)
Chinese character (input observation)
Cascaded Predictions
Named-entity tag
Part-of-speech
(output prediction)
Segmentation
(input observation)
Chinese character (input observation)
Cascaded Predictions
Named-entity tag
(output prediction)
Part-of-speech
(input obseration)
Segmentation
(input observation)
Chinese character (input observation)
But errors cascade--must be perfect at every stage to do well.
Joint Prediction
Cross-Product over Labels
O(|V| x 14852) parameters
O(|o| x 14852) running time
3 x 45 x 11 = 1485 possible states
e.g.: state label = (Wordbeg, Noun, Person)
Segmentation+POS+NE (output prediction)
Chinese character (input observation)
Joint Prediction
Factorial CRF
O(|V| x 2785) parameters
Named-entity tag
(output prediction)
Part-of-speech
(output prediction)
Segmentation
(output prediction)
Chinese character (input observation)

Linear-Chain to Factorial CRFs
Model Definition
Linear-chain
p(y | x) 
Factorial
p(y | x) 
T
1
y (y t , y t1 )xy (x t , y t )

Z(x) t1
1
Z(x)
T
  (u ,u
u
t
t1 ) v (v t ,v t1 ) w (w t ,w t1 )
t1
uv (ut ,v t )vw (v t ,w t )wx (w t , x t )

where


()  exp k f k ()
 k

y
...
x
...
u
...
v
...
w
...
x
...
Dynamic CRFs
Undirected conditionally-trained analogue
to Dynamic Bayes Nets (DBNs)
Factorial
Higher-Order
Hierarchical
Training CRFs
Maximize log - likelihood of parameters given training data
L({k } |{ o,s
(i)
:
})
Log - likelihood gradient :
L
2
  Ck (s (i),o (i) )    P{ k } (s | o (i) ) Ck (s,o (i) )  k
k
i
i
s
Ck (s,o )   f k (o,t,st1,st )
t
Feature count using
correct labels
-
Feature count using
predicted labels
-
Smoothing penalty
Training DCRFs
Maximize log - likelihood of parameters given training data
L({k } |{ o,s
(i)
:
})
Log - likelihood gradient :
L
2
  Ck (s (i),o (i) )    P{ k } (s | o (i) ) Ck (s,o (i) )  k
k
i
i
s
Ck (s,o )  
 f (o,t,c)
k
t c Cliques
Feature count using
correct labels
-
Feature count using
predicted labels
Same form as general CRFs
-
Smoothing penalty
Experiments
Simultaneous noun-phrase & part-of-speech tagging
B
I
I
B
I
I
O O O
N
N
N
O N N
V O V
Rockwell International Corp. 's Tulsa unit said it signed
B
I
I
O
B
I
O B
I
O J
N
V
O
N
O N
N
a tentative agreement extending its contract with Boeing Co.
•
Data from CoNLL Shared Task 2000 (Newswire)
– 8936 training instances
– 45 POS tags, 3 NP tags
– Features: word identity, capitalization, regex’s, lexicons…
Experiments
Simultaneous noun-phrase & part-of-speech tagging
B
I
I
B
I
I
O O O
N
N
N
O N N
V O V
Rockwell International Corp. 's Tulsa unit said it signed
B
I
I
O
B
I
O B
I
O J
N
V
O
N
O N
N
a tentative agreement extending its contract with Boeing Co.
Two experiments
•
Compare exact and approximate inference
•
Compare Noun Phrase Segmentation F1 of
– Cascaded CRF+CRF
– Cascaded Brill+CRF
– Joint Factorial DCRFs
Comparison of
Cascaded
CRF+CRF Brill+CRF
POS acc
98.28
N/A
& Joint
FCRF
98.92
Joint acc
95.56
N/A
96.48
NP F1
93.10
93.33
93.87
CRF+CRF and FCRF trained on 8936 CoNLL sentences
Brill tagger trained on 30,000+ sentences,
including CoNLL test set!
20%
 error
Accuracy by Training Set Size
Joint prediction of part-of-speech and noun-phrase in newswire,
matching accuracy with only 50% of the training data.
Outline
• The need for joint inference
• Examples of joint inference
– Joint Labeling of Cascaded Sequences
(Belief Propagation)
– Joint Labeling of Distant Entities (BP by Tree Reparameterization)
– Joint Co-reference Resolution
– Joint Segmentation and Co-ref
(Graph Partitioning)
(Sparse BP)
– Joint Relation Extraction and Data Mining
(ICM)
– Probability + First-order Logic, Co-ref on Entities
(MCMC)
Jointly labeling distant mentions
Skip-chain CRFs [Sutton, McCallum, SRL 2004]
…
Senator Joe Green said today
…
.
Green ran
for …
Dependency among similar, distant mentions ignored.
Jointly labeling distant mentions
Skip-chain CRFs [Sutton, McCallum, SRL 2004]
…
Senator Joe Green said today
…
.
Green ran
for …
14% reduction in error on most repeated field
in email seminar announcements.
Inference:
Tree reparameterized BP
[Wainwright et al, 2002]
See also
[Finkel, et al, 2005]
Outline
• The need for joint inference
• Examples of joint inference
– Joint Labeling of Cascaded Sequences
(Belief Propagation)
– Joint Labeling of Distant Entities (BP by Tree Reparameterization)
– Joint Co-reference Resolution
– Joint Segmentation and Co-ref
(Graph Partitioning)
(Sparse BP)
– Joint Relation Extraction and Data Mining
(ICM)
– Probability + First-order Logic, Co-ref on Entities
(MCMC)
Joint co-reference among all pairs
Affinity Matrix CRF
“Entity resolution”
“Object correspondence”
. . . Mr Powell . . .
45
. . . Powell . . .
Y/N
99
Y/N
Y/N
11
~25% reduction in error on
co-reference of
proper nouns in newswire.
. . . she . . .
Inference:
Correlational clustering
graph partitioning
[Bansal, Blum, Chawla, 2002]
[McCallum, Wellner, IJCAI WS 2003, NIPS 2004]
Coreference Resolution
AKA "record linkage", "database record deduplication",
"citation matching", "object correspondence", "identity uncertainty"
Output
Input
News article,
with named-entity "mentions" tagged
Number of entities, N = 3
Today Secretary of State Colin Powell
met with . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . he . . . . . .
. . . . . . . . . . . . . Condoleezza Rice . . . . .
. . . . Mr Powell . . . . . . . . . .she . . . . . . .
. . . . . . . . . . . . . . Powell . . . . . . . . . . . .
. . . President Bush . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . Rice . . . . . . . . . .
. . . . . . Bush . . . . . . . . . . . . . . . . . . . . .
........... . . . . . . . . . . . . . . . .
#1
Secretary of State Colin Powell
he
Mr. Powell
Powell
#2
Condoleezza Rice
she
Rice
.........................
#3
President Bush
Bush
Inside the Traditional Solution
Pair-wise Affinity Metric
Mention (3)
. . . Mr Powell . . .
N
Y
Y
Y
Y
N
Y
Y
N
Y
N
N
Y
Y
Mention (4)
Y/N?
. . . Powell . . .
Two words in common
One word in common
"Normalized" mentions are string identical
Capitalized word in common
> 50% character tri-gram overlap
< 25% character tri-gram overlap
In same sentence
Within two sentences
Further than 3 sentences apart
"Hobbs Distance" < 3
Number of entities in between two mentions = 0
Number of entities in between two mentions > 4
Font matches
Default
OVERALL SCORE =
29
13
39
17
19
-34
9
8
-1
11
12
-3
1
-19
98
> threshold=0
The Problem
. . . Mr Powell . . .
affinity = 98
Y
affinity = 104
Pair-wise merging
decisions are being
made independently
from each other
. . . Powell . . .
N
Y
affinity = 11
. . . she . . .
Affinity measures are noisy and imperfect.
They should be made
in relational dependence
with each other.
A Generative Model Solution
[Russell 2001], [Pasula et al 2002]
(Applied to citation matching, and
object correspondence in vision)
N
id
Issues:
context words
id
surname
distance fonts
.
.
.
gender
age
.
.
.
1) Generative model
makes it difficult
to use complex
features.
2) Number of entities
is hard-coded into
the model structure,
but we are supposed
to predict num entities!
Thus we must modify
model structure during
inference---MCMC.
A Markov Random Field for Co-reference
(MRF)
[McCallum & Wellner, 2003, ICML]
. . . Mr Powell . . .
45
. . . Powell . . .
Y/N
30
Y/N
Y/N
Make pair-wise merging
decisions in dependent
relation to each other by
- calculating a joint prob.
- including all edge weights
- adding dependence on
consistent triangles.
11
. . . she . . .


1
P(y | x ) 
exp
l f l (x i , x j , y ij )   ' f '(y ij , y jk , y ik )




Zx
i, j l
i, j,k

A Markov Random Field for Co-reference
(MRF)
[McCallum & Wellner, 2003]
. . . Mr Powell . . .
45
. . . Powell . . .
Y/N
30
Y/N
Y/N
11
Make pair-wise merging
decisions in dependent
relation to each other by
- calculating a joint prob.
- including all edge weights
- adding dependence on
consistent triangles.

. . . she . . .


1
P(y | x ) 
exp
l f l (x i , x j , y ij )   ' f '(y ij , y jk , y ik )




Zx
i, j l
i, j,k

A Markov Random Field for Co-reference
(MRF)
[McCallum & Wellner, 2003]
. . . Mr Powell . . .
45)
. . . Powell . . .
Y
30)
N
Y
(11)
. . . she . . .
infinity


1
P(y | x ) 
exp
l f l (x i , x j , y ij )   ' f '(y ij , y jk , y ik )




Zx
i, j l
i, j,k

A Markov Random Field for Co-reference
(MRF)
[McCallum & Wellner, 2003]
. . . Mr Powell . . .
45)
. . . Powell . . .
Y
30)
N
N
(11)
. . . she . . .
64


1
P(y | x ) 
exp
l f l (x i , x j , y ij )   ' f '(y ij , y jk , y ik )




Zx
i, j l
i, j,k

Inference in these MRFs = Graph Partitioning
[Boykov, Vekler, Zabih, 1999], [Kolmogorov & Zabih, 2002], [Yu, Cross, Shi, 2002]
. . . Mr Powell . . .
45
. . . Powell . . .
106
30
134
11
. . . Condoleezza Rice . . .
. . . she . . .
10
log P(y | x )    l f l (x i , x j , y ij ) 
i, j
l
w
i, j w/in
paritions
ij

w
i, j across
paritions
ij
Inference in these MRFs = Graph Partitioning
[Boykov, Vekler, Zabih, 1999], [Kolmogorov & Zabih, 2002], [Yu, Cross, Shi, 2002]
. . . Mr Powell . . .
45
. . . Powell . . .
106
30
134
11
. . . Condoleezza Rice . . .
. . . she . . .
10
log P(y | x )    l f l (x i , x j , y ij ) 
i, j
l
w
i, j w/in
paritions
ij

w
i, j across
paritions
ij
= 22
Inference in these MRFs = Graph Partitioning
[Boykov, Vekler, Zabih, 1999], [Kolmogorov & Zabih, 2002], [Yu, Cross, Shi, 2002]
. . . Mr Powell . . .
45
. . . Powell . . .
106
30
134
11
. . . Condoleezza Rice . . .
. . . she . . .
10
log P(y | x )    l f l (x i , x j , y ij ) 
i, j
l
w
i, j w/in
paritions
ij

w'
i, j across
paritions
ij
= 314
Co-reference Experimental Results
[McCallum & Wellner, 2003]
Proper noun co-reference
DARPA ACE broadcast news transcripts, 117 stories
Single-link threshold
Best prev match [Morton]
MRFs
Partition F1
16 %
83 %
88 %
error=30%
Pair F1
18 %
89 %
92 %
error=28%
DARPA MUC-6 newswire article corpus, 30 stories
Single-link threshold
Best prev match [Morton]
MRFs
Partition F1
11%
70 %
74 %
error=13%
Pair F1
7%
76 %
80 %
error=17%
Joint Co-reference for
Multiple Entity Types [Culotta & McCallum 2005]
People
Stuart Russell
Y/N
Stuart Russell
Y/N
Y/N
S. Russel
Joint Co-reference for
Multiple Entity Types [Culotta & McCallum 2005]
People
Stuart Russell
Organizations
University of California at Berkeley
Y/N
Y/N
Stuart Russell
Y/N
Y/N
S. Russel
Berkeley
Y/N
Y/N
Berkeley
Joint Co-reference for
Multiple Entity Types [Culotta & McCallum 2005]
People
Stuart Russell
Organizations
University of California at Berkeley
Y/N
Y/N
Stuart Russell
Y/N
Y/N
S. Russel
Berkeley
Y/N
Y/N
Reduces error by 22%
Berkeley
Outline
• The need for joint inference
• Examples of joint inference
– Joint Labeling of Cascaded Sequences
(Belief Propagation)
– Joint Labeling of Distant Entities (BP by Tree Reparameterization)
– Joint Co-reference Resolution
– Joint Segmentation and Co-ref
(Graph Partitioning)
(Sparse BP)
– Joint Relation Extraction and Data Mining
(ICM)
– Probability + First-order Logic, Co-ref on Entities
(MCMC)
Joint segmentation and co-reference
Extraction from and matching of
research paper citations.
o
s
Laurel, B. Interface Agents:
Metaphors with Character, in
The Art of Human-Computer Interface
Design, B. Laurel (ed), AddisonWesley, 1990.
World
Knowledge
c
y
Brenda Laurel. Interface Agents:
Metaphors with Character, in
Laurel, The Art of Human-Computer
Interface Design, 355-366, 1990.
p
Co-reference
decisions
y
Database
field values
c
s
c
y
Citation attributes
s
o
Segmentation
o
35% reduction in co-reference error by using segmentation uncertainty.
6-14% reduction in segmentation error by using co-reference.
Inference:
Sparse Generalized Belief Propagation
[Pal, Sutton, McCallum, 2005]
[Wellner, McCallum, Peng, Hay, UAI 2004]
see also [Marthi, Milch, Russell, 2003]
Joint segmentation and co-reference
Joint IE and Coreference from Research Paper Citations
Textual citation mentions
(noisy, with duplicates)
Paper database, with fields,
clean, duplicates collapsed
AUTHORS
TITLE
Cowell, Dawid…
Probab…
Montemerlo, Thrun…FastSLAM…
Kjaerulff
Approxi…
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
VENUE
Springer
AAAI…
Technic…
Citation Segmentation and Coreference
Laurel, B.
Interface Agents: Metaphors with Character , in
The Art of Human-Computer Interface Design , T. Smith (ed) ,
Addison-Wesley , 1990 .
Brenda Laurel . Interface Agents: Metaphors with Character , in
Smith , The Art of Human-Computr Interface Design , 355-366 , 1990 .
Citation Segmentation and Coreference
Laurel, B.
Interface Agents: Metaphors with Character , in
The Art of Human-Computer Interface Design , T. Smith (ed) ,
Addison-Wesley , 1990 .
Brenda Laurel . Interface Agents: Metaphors with Character , in
Smith , The Art of Human-Computr Interface Design , 355-366 , 1990 .
1)
Segment citation fields
Citation Segmentation and Coreference
Laurel, B.
Y
?
N
Interface Agents: Metaphors with Character , in
The Art of Human-Computer Interface Design , T. Smith (ed) ,
Addison-Wesley , 1990 .
Brenda Laurel . Interface Agents: Metaphors with Character , in
Smith , The Art of Human-Computr Interface Design , 355-366 , 1990 .
1)
Segment citation fields
2)
Resolve coreferent citations
Citation Segmentation and Coreference
Laurel, B.
Y
?
N
Interface Agents: Metaphors with Character , in
The Art of Human-Computer Interface Design , T. Smith (ed) ,
Addison-Wesley , 1990 .
Brenda Laurel . Interface Agents: Metaphors with Character , in
Smith , The Art of Human-Computr Interface Design , 355-366 , 1990 .
AUTHOR =
TITLE =
PAGES =
BOOKTITLE =
EDITOR =
PUBLISHER =
YEAR =
Brenda Laurel
Interface Agents: Metaphors with Character
355-366
The Art of Human-Computer Interface Design
T. Smith
Addison-Wesley
1990
1)
Segment citation fields
2)
Resolve coreferent citations
3)
Form canonical database record
Resolving conflicts
Citation Segmentation and Coreference
Laurel, B.
Y
?
N
Interface Agents: Metaphors with Character , in
The Art of Human-Computer Interface Design , T. Smith (ed) ,
Addison-Wesley , 1990 .
Brenda Laurel . Interface Agents: Metaphors with Character , in
Smith , The Art of Human-Computr Interface Design , 355-366 , 1990 .
AUTHOR =
TITLE =
PAGES =
BOOKTITLE =
EDITOR =
PUBLISHER =
YEAR =
Perform
Brenda Laurel
Interface Agents: Metaphors with Character
355-366
The Art of Human-Computer Interface Design
T. Smith
Addison-Wesley
1990
1)
Segment citation fields
2)
Resolve coreferent citations
3)
Form canonical database record
jointly.
IE + Coreference Model
AUT AUT YR TITL TITL
CRF Segmentation
s
Observed citation
x
J Besag 1986 On the…
IE + Coreference Model
AUTHOR = “J Besag”
YEAR =
“1986”
TITLE = “On the…”
Citation mention attributes
c
CRF Segmentation
s
Observed citation
x
J Besag 1986 On the…
IE + Coreference Model
Smyth
,
P Data mining…
Structure for each
citation mention
c
s
x
Smyth . 2001 Data Mining…
J Besag 1986 On the…
IE + Coreference Model
Smyth
,
P Data mining…
Binary coreference variables
for each pair of mentions
c
s
x
Smyth . 2001 Data Mining…
J Besag 1986 On the…
IE + Coreference Model
Smyth
,
P Data mining…
Binary coreference variables
for each pair of mentions
y
n
n
c
s
x
Smyth . 2001 Data Mining…
J Besag 1986 On the…
IE + Coreference Model
Smyth
,
P Data mining…
AUTHOR = “P Smyth”
YEAR =
“2001”
TITLE = “Data Mining…”
...
Research paper entity
attribute nodes
y
n
n
c
s
x
Smyth . 2001 Data Mining…
J Besag 1986 On the…
IE + Coreference Model
Smyth
Research paper entity
attribute node
,
P Data mining…
y
y
y
c
s
x
Smyth . 2001 Data Mining…
J Besag 1986 On the…
IE + Coreference Model
Smyth
,
P Data mining…
y
n
n
c
s
x
Smyth . 2001 Data Mining…
J Besag 1986 On the…
Such a highly connected graph makes
exact inference intractable, so…
Approximate Inference 1
• Loopy Belief
Propagation
m1(v2)
v1
m2(v3)
v2
m2(v1)
v4
v3
m3(v2)
v5
messages passed
between nodes
v6
Approximate Inference 1
• Loopy Belief
Propagation
• Generalized Belief
Propagation
m1(v2)
v1
m2(v3)
v2
m2(v1)
v3
m3(v2)
messages passed
between nodes
v4
v5
v6
v1
v2
v3
v4
v5
v6
v7
v8
v9
messages passed
between regions
Here, a message is a conditional probability table passed among nodes.
But when message size grows exponentially with region size!
Approximate Inference 2
• Iterated Conditional
Modes (ICM)
v1
v2
v3
= held constant
[Besag 1986]
v4
v6i+1
= argmax
v6i
P(v6i |
v\
v6i)
v5
v6
Approximate Inference 2
• Iterated Conditional
Modes (ICM)
v1
v2
v3
= held constant
[Besag 1986]
v4
v5j+1
= argmax
v5j
P(v5j |
v\
v5j)
v5
v6
Approximate Inference 2
• Iterated Conditional
Modes (ICM)
v1
v2
v3
= held constant
[Besag 1986]
v4
v4k+1
= argmax
v4k
P(v4k |
v\
v5
v6
v4k)
But greedy, and easily falls into local minima.
Approximate Inference 2
•
Iterated Conditional
Modes (ICM)
v1
v2
v3
= held constant
[Besag 1986]
v4
v4k+1
= argmax
P(v4k |
v\
v5
v6
v4k)
v4k
•
“Iterated Conditional Sampling” or “Sparse Belief Propagation”
Instead of passing only argmax, sample of argmaxes of P(v4k | v \ v4k)
e.g. an N-best list (the top N values)
v1
v4
v2
v5
v3
v6
Can use “Generalized Version” of this;
doing exact inference on a region of
several nodes at once.
Here, a “message” grows only linearly
with region size and N!
Inference by Sparse “Generalized BP”
Smyth
,
P Data mining…
[Pal, Sutton, McCallum 2005]
Exact inference on
these linear-chain regions
From each chain
pass an N-best List
into coreference
Smyth . 2001 Data Mining…
J Besag 1986 On the…
Inference by Sparse “Generalized BP”
Smyth
,
P Data mining…
[Pal, Sutton, McCallum 2005]
Approximate inference
by graph partitioning…
Make scale to 1M
citations with Canopies
…integrating out
uncertainty
in samples
of extraction
Smyth . 2001 Data Mining…
[McCallum, Nigam, Ungar 2000]
J Besag 1986 On the…
Inference:
Sample = N-best List from CRF Segmentation
Name
Title
Book Title
Year
Laurel, B. Interface
Agents: Metaphors
with Character
The Art of Human
Computer Interface
Design
1990
Laurel, B.
Interface Agents:
Metaphors with
Character The Art
of Human Computer
Interface Design
1990
Agents: Metaphors
with Character
The Art of Human
Computer Interface
Design
Laurel, B. Interface
When calculating
similarity with another
citation, have more
opportunity to find
correct, matching fields.
Name
Title
…
Laurel, B
Interface Agents:
Metaphors with
Character The
…
Laurel, B.
Interface Agents:
Metaphors with
Character
…
Laurel, B. Interface
Agents
Metaphors with
Character
…
1990
y?n
Inference by Sparse “Generalized BP”
Smyth
,
P Data mining…
[Pal, Sutton, McCallum 2005]
Exact (exhaustive) inference
over entity attributes
y
n
n
Smyth . 2001 Data Mining…
J Besag 1986 On the…
Inference by Sparse “Generalized BP”
Smyth
,
P Data mining…
[Pal, Sutton, McCallum 2005]
Revisit exact inference
on IE linear chain,
now conditioned on
entity attributes
y
n
n
Smyth . 2001 Data Mining…
J Besag 1986 On the…
Parameter Estimation: Piecewise Training
[Sutton & McCallum 2005]
Divide-and-conquer parameter estimation
IE Linear-chain
Exact MAP
Coref graph edge weights
MAP on individual edges
Entity attribute potentials
MAP, pseudo-likelihood
y
n
n
In all cases:
Climb MAP gradient with
quasi-Newton method
Results on 4 Sections of CiteSeer Citations
Coreference F1 performance
N
Reinforce
Face
Reason
Constraint
1
0.946
0.967
0.945
0.961
3
0.95
0.979
0.961
0.960
7
0.948
0.979
0.951
0.971
9
0.982
0.967
0.960
0.971
Optimal
0.995
0.992
0.994
0.988
Average error reduction is 35%.
“Optimal” makes best use of N-best list by using true labels.
Indicates that even more improvement can be obtained
Joint segmentation and co-reference
[Wellner, McCallum,
Peng, Hay, UAI 2004]
o
Extraction from and matching of
research paper citations.
s
Laurel, B. Interface Agents:
Metaphors with Character, in
The Art of Human-Computer Interface
Design, B. Laurel (ed), AddisonWesley, 1990.
World
Knowledge
c
y
Brenda Laurel. Interface Agents:
Metaphors with Character, in
Laurel, The Art of Human-Computer
Interface Design, 355-366, 1990.
p
Co-reference
decisions
y
Database
field values
c
s
c
y
s
o
Citation attributes
Segmentation
o
35% reduction in co-reference error by using segmentation uncertainty.
6-14% reduction in segmentation error by using co-reference.
Inference:
Sparse Belief Propagation
[Pal, Sutton, McCallum, 2005]
Outline
• The need for joint inference
• Examples of joint inference
– Joint Labeling of Cascaded Sequences
(Belief Propagation)
– Joint Labeling of Distant Entities (BP by Tree Reparameterization)
– Joint Co-reference Resolution
– Joint Segmentation and Co-ref
(Graph Partitioning)
(Sparse BP)
– Joint Relation Extraction and Data Mining
(ICM)
– Probability + First-order Logic, Co-ref on Entities
(MCMC)
Data
• 270 Wikipedia articles
• 1000 paragraphs
• 4700 relations
• 52 relation types
– JobTitle, BirthDay, Friend, Sister, Husband,
Employer, Cousin, Competition, Education, …
• Targeted for density of relations
– Bush/Kennedy/Manning/Coppola families and friends
George W. Bush
…his father George H. W. Bush…
…his cousin John Prescott Ellis…
George H. W. Bush
…his sister Nancy Ellis Bush…
Nancy Ellis Bush
…her son John Prescott Ellis…
Cousin = Father’s Sister’s Son
sibling
George HW Bush
Nancy Ellis Bush
son
son
cousin
George X
W Bush
John Prescott
Ellis
Y
likely a cousin
John Kerry
…celebrated with Stuart Forbes…
Name
Son
Rosemary Forbes
John Kerry
James Forbes
Stuart Forbes
Name
Sibling
Rosemary Forbes
James Forbes
Rosemary Forbes
sibling
son
James Forbes
son
cousin
John Kerry
Stuart Forbes
Iterative DB Construction
Joseph P. Kennedy, Sr
… son
John F. Kennedy
with
Rose Fitzgerald
Son
Wife
Name
Son
Joseph P. Kennedy
John F. Kennedy
Rose Fitzgerald
John F. Kennedy
Ronald Reagan
George W. Bush
Use relational
Fill
DB with
features
with
“first-pass”
CRFCRF
“second-pass”
(0.3)
Results
ME
CRF
RCRF
RCRF .9
RCRF .5
RCRF
Truth
RCRF
Truth.5
F1
.5489
.5995
.6100
.6008
.6136
.6791
.6363
Prec
.6475
.7019
.6799
.7177
.7095
.7553
.7343
Recall
.4763
.5232
.5531
.5166
.5406
.6169
.5614
ME = maximum entropy
CRF = conditional random field RCRF = CRF + mined features
Examples of Discovered Relational
Features
•
•
•
•
•
•
•
Mother: FatherWife
Cousin: MotherHusbandNephew
Friend: EducationStudent
Education: FatherEducation
Boss: BossSon
MemberOf: GrandfatherMemberOf
Competition: PoliticalPartyMemberCompetition
Outline
• The need for joint inference
• Examples of joint inference
– Joint Labeling of Cascaded Sequences
(Belief Propagation)
– Joint Labeling of Distant Entities (BP by Tree Reparameterization)
– Joint Co-reference Resolution
– Joint Segmentation and Co-ref
(Graph Partitioning)
(Sparse BP)
– Joint Relation Extraction and Data Mining
(ICM)
– Probability + First-order Logic, Co-ref on Entities
(MCMC)
Sometimes graph partitioning with
pairwise comparisons
is not enough.
• Entities have multiple attributes
(name, email, institution, location);
need to measure “compatibility” among them.
• Having 2 “given names” is common, but not 4.
– e.g. Howard M. Dean / Martin, Dean / Howard Martin
• Need to measure size of the clusters of mentions.
•  a pair of lastname strings that differ > 5?
We need measures on hypothesized “entities”
We need First-order logic
Pairwise Co-reference Features
Howard Dean
SamePerson(Dean Martin, Howard Dean)?
SamePers
Martin)?
on(Howard Dean, Howard
Pairwise Features
StringMatch(x1,x2)
EditDistance(x1,x2)
Dean Martin
Howard Martin
SamePerson(Dean Martin, Howard Martin)?
Toward High-Order Representations
Identity Uncertainty
Howard Dean
First-Order Features
• maximum edit distance between
any pair is < 0.5
• number of distinct names < 3
• All have same gender
• There exists a number mismatch
• Only pronouns
SamePerson(Howard Dean,
Howard Martin,
Dean Martin)?
Dean Martin
Howard Martin
Weighted Logic
• This model brings together
the two main (long-separated)
branches of Artificial Intelligence:
Logic
Probability
Toward High-Order Representations
.
.
.
Identity Uncertainty
Combinatorial Explosion!
SamePerson(x1,x2 ,x3,x4 ,x5 ,x6)
…
SamePerson(x1,x2 ,x3,x4 ,x5)
…
SamePerson(x1,x2 ,x3,x4)
…
SamePerson(x1,x2 ,x3)
…
SamePerson(x1,x2)
…
Dean Martin
Howard Dean
Howard Martin
Dino
Howie
.
.
.
Martin
…
This space complexity is common in
first-order probabilistic models
Markov Logic as a Template to Construct a
Markov Network using First-Order Logic
[Richardson
& Domingos 2005]
[Paskin & Russell 2002]
ground Markov network
grounding Markov network
requires space O(nr)
n = number constants
r = highest clause arity
How can we perform inference and learning
in models that cannot be grounded?
Inference in First-Order Models
SAT Solvers
• Weighted SAT solvers [Kautz et al 1997]
– Requires complete grounding of network
• LazySAT [Singla & Domingos 2006]
– Saves memory by only storing clauses that may become unsatisfied
– Initialization still requires time O(nr) to visit all ground clauses
Inference in First-Order Models
MCMC
• Gibbs Sampling
– Difficult to move between high probability configurations by changing
single variables
• Although, consider MC-SAT! [Poon & Domingos ‘06]
• An alternative: Metropolis-Hastings sampling
[Culotta & McCallum 2006]
– Can be extended to partial configurations
• Only instantiate relevant variables
– Key advantage: can design arbitrary “smart” jumps
– Successfully used in BLOG models [Milch et al 2005]
– 2 parts: proposal distribution, acceptance distribution.
Model
“First-order features”
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Howard Dean
Governor
Howie
fw: SamePerson(x)
Dean Martin
Dino
Howard Martin
Howie Martin
Model
Howard Martin
Howie Martin
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Howard Dean
Governor
Howie
Dean Martin
Dino
Model
ZX: Sum over all possible configurations!
Inference with
Metropolis-Hastings
• p(y’)/p(y) : likelihood ratio
– Ratio of P(Y|X)
– ZX cancels!
• q(y’|y) : proposal distribution
– probability of proposing move y y’
• What is nice about this?
– Can design arbitrary “smart” proposal distributions
Proposal Distribution
Dean Martin
Howie Martin
Howard Martin
Dino
y
Dean Martin
Dino
Howard Martin
Howie Martin
y’
Proposal Distribution
Dean Martin
Howie Martin
Howard Martin
Dino
Dean Martin
Howie Martin
Howard Martin
Howie Martin
y
y’
Proposal Distribution
Dean Martin
Howie Martin
Howard Martin
Howie Martin
Dean Martin
Howie Martin
Howard Martin
Dino
y
y’
Feature List
• Exact Match/Mis-Match
–
–
–
–
–
–
–
Entity type
Gender (requires lexicon)
Number
Case
Entity Text
Entity Head
Entity Modifier/Numerical
Modifier
– Sentence
– WordNet:
hypernym,synonym,antonym
• Other
– Relative pronoun agreement
– Sentence distance in bins
– Partial text overlaps
• Quantification
– Existential
 a gender mismatch
 three different first names
– Universal
 NER type match
 named mentions str identical
• Filters
(limit quantifiers to mention type)
–
–
–
–
None
Pronoun
Nominal (description)
Proper (name)
Learning the Likelihood Ratio
Can’t normalize over all possible y’s.
...temporarily consider the following... ad hoc training:
Maximize p(b|y,x), where b in {TRUE, FALSE}
Error Driven Training Motivation
Where to get training examples?
• Generate all possible partial clusters
intractable
– Sample uniformly?
– Sample from clusters visited during inference?
• Error driven
– Focus learning on examples that need it most
Error-Driven Training Results
B-cubed F1
Non-Error-driven
69
Error-driven
72
Learning the Likelihood Ratio
Given a pair of configurations, learn to rank the “better”
configuration higher.
Rank-Based Training
• Instead of training
[Powell, Mr. Powell, he] --> TRUE
[Powell, Mr. Powell, she] --> FALSE
• ...Rather...
[Powell, Mr. Powell, he] > [Powell, Mr. Powell, she]
[Powell, Mr. Powell, he] > [Powell, Mr. Powell]
[Powell, Mr. Powell, George, he] > [Powell, Mr. Powell, George, she]
Rank-Based Training Results
B-cubed F1
Non-Ranked-Based
(Error-driven)
72
Rank-Based
(Error-driven)
79
Previous best in literature
68
Experimental Results
• ACE 2005 newswire coreference
• All entity types
– Proper-, common-, pro- nouns
• 443 documents
B-cubed
Previous best results, 1997:
Previous best results, 2002:
Previous best results, 2005:
Our new results
[Culotta, Wick, Hall, McCallum, NAACL/HLT 2007]
65
67
68
79
Learning the Proposal Distribution
by Tying Parameters
• Proposal distribution q(y’|y) “cheap”
approximation to p(y)
• Reuse subset of parameters in p(y)
• E.g. in identity uncertainty model
– Sample two clusters
– Stochastic agglomerative clustering to propose
new configuration
Scalability
• Currently running on >2 million author name
mentions.
• Canopies
• Schedule of different proposal distributions
Weighted Logic Summary
• Bring together
– Logic
– Probability
• Inference and Learning incredibly difficult.
• Our recommendation:
– MCMC for inference
– Error-driven, rank-based for training
Outline
• The need for joint inference
• Examples of joint inference
– Joint Labeling of Cascaded Sequences
(Belief Propagation)
– Joint Labeling of Distant Entities (BP by Tree Reparameterization)
– Joint Co-reference Resolution
– Joint Segmentation and Co-ref
(Graph Partitioning)
(Sparse BP)
– Joint Relation Extraction and Data Mining
(ICM)
– Probability + First-order Logic, Co-ref on Entities
(MCMC)
End of Part 2
Sparse Belief Propagation
“Beam Search” in arbitrary graphical models
Different pruning strategy based on variational inference.
1. Compute messages
xs
mst
xt
mvt
xv
2. Form marginal
b(xt)
b’(xt)
3. Retain 1-%
of mass
“Sparse Belief Propagation”
used during training [Pal, Sutton, McCallum,
ICAASP 2006]
“Accuracy”
Like beam search,
but modifications from Variational Methods of Inference.
Traditional beam
Training Time
Context
Spider
Filter
Data
Mining
IE
Segment
Classify
Associate
Cluster
Discover patterns
- entity types
- links / relations
- events
Database
Document
collection
Joint inference
among detailed steps
Leveraging Text in Social Network Analysis
Actionable
knowledge
Prediction
Outlier detection
Decision support
Outline
Social Network Analysis with Topic Models
• Role Discovery (Author-Recipient-Topic Model, ART)
• Group Discovery (Group-Topic Model, GT)
• Enhanced Topic Models
– Correlations among Topics (Pachinko Allocation, PAM)
– Time Localized Topics (Topics-over-Time Model, TOT)
– Markov Dependencies in Topics (Topical N-Grams Model, TNG)
• Bibliometric Impact Measures enabled by Topics
Multi-Conditional Mixtures