Automatically Building Special Purpose Search Engines with

Download Report

Transcript Automatically Building Special Purpose Search Engines with

Toward Unified Graphical Models of
Information Extraction and Data Mining
Andrew McCallum
Computer Science Department
University of Massachusetts Amherst
Joint work with
Charles Sutton, Aron Culotta, Ben Wellner, Khashayar Rohanimanesh, Wei Li
Goal:
Mine actionable knowledge
from unstructured text.
Extracting Job Openings from the Web
foodscience.com-Job2
JobTitle: Ice Cream Guru
Employer: foodscience.com
JobCategory: Travel/Hospitality
JobFunction: Food Services
JobLocation: Upper Midwest
Contact Phone: 800-488-2611
DateExtracted: January 8, 2001
Source: www.foodscience.com/jobs_midwest.htm
OtherCompanyJobs: foodscience.com-Job1
A Portal for Job Openings
Category = High Tech
Keyword = Java
Location = U.S.
Job Openings:
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Data Mining the Extracted Job Information
IE from
Chinese Documents regarding Weather
Department of Terrestrial System, Chinese Academy of Sciences
200k+ documents
several millennia old
- Qing Dynasty Archives
- memos
- newspaper articles
- diaries
IE from Research Papers
[McCallum et al ‘99]
IE from Research Papers
Mining Research Papers
[Rosen-Zvi, Griffiths, Steyvers,
Smyth, 2004]
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
What is “Information Extraction”
As a family
of techniques:
Information Extraction =
segmentation + classification + clustering + association
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
Microsoft Corporation
CEO
Bill Gates
Microsoft
Gates
Microsoft
Bill Veghte
Microsoft
VP
Richard Stallman
founder
Free Software Foundation
What is “Information Extraction”
As a family
of techniques:
Information Extraction =
segmentation + classification + association + clustering
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
Microsoft Corporation
CEO
Bill Gates
Microsoft
Gates
Microsoft
Bill Veghte
Microsoft
VP
Richard Stallman
founder
Free Software Foundation
What is “Information Extraction”
As a family
of techniques:
Information Extraction =
segmentation + classification + association + clustering
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
Microsoft Corporation
CEO
Bill Gates
Microsoft
Gates
Microsoft
Bill Veghte
Microsoft
VP
Richard Stallman
founder
Free Software Foundation
What is “Information Extraction”
As a family
of techniques:
Information Extraction =
segmentation + classification + association + clustering
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
* Microsoft Corporation
CEO
Bill Gates
* Microsoft
Gates
* Microsoft
Bill Veghte
* Microsoft
VP
Richard Stallman
founder
Free Software Foundation
Larger Context
Spider
Filter
Data
Mining
IE
Segment
Classify
Associate
Cluster
Discover patterns
- entity types
- links / relations
- events
Database
Document
collection
Actionable
knowledge
Prediction
Outlier detection
Decision support
Outline
• Examples of IE and Data Mining.
a
• Brief review of Conditional Random Fields
• Joint inference: Motivation and examples
– Joint Labeling of Cascaded Sequences
(Belief Propagation)
– Joint Labeling of Distant Entities (BP by Tree Reparameterization)
– Joint Co-reference Resolution
– Joint Segmentation and Co-ref
(Graph Partitioning)
(Iterated Conditional Samples)
• Efficiently training large, multi-component models:
Piecewise Training
Hidden Markov Models
HMMs are the standard sequence modeling tool in
genomics, music, speech, NLP, …
Graphical model
Finite state model
S t-1
St
S t+1
...
...
observations
...
Generates:
State
sequence
Observation
sequence
transitions
O
Ot
t -1
O t +1

|o|
o1
o2
o3
o4
o5
o6
o7
o8
 
P( s , o )   P( st | st 1 ) P(ot | st )
S={s1,s2,…}
Start state probabilities: P(st )
Transition probabilities: P(st|st-1 )
t 1
Parameters: for all states
Usually a multinomial over
Observation (emission) probabilities: P(ot|st ) atomic, fixed alphabet
Training:
Maximize probability of training observations (w/ prior)
IE with Hidden Markov Models
Given a sequence of observations:
Yesterday Rich Caruana spoke this example sentence.
and a trained HMM:
person name
location name
background
Find the most likely state sequence: (Viterbi)
Yesterday Rich Caruana spoke this example sentence.
Any words said to be generated by the designated “person name”
state extract as a person name:
Person name: Rich Caruana
We want More than an Atomic View of Words
Would like richer representation of text:
many arbitrary, overlapping features of the words.
S t-1
identity of word
ends in “-ski”
is capitalized
is part of a noun phrase
is “Wisniewski”
is in a list of city names
is under node X in WordNet
part of
ends in
is in bold font
noun phrase
“-ski”
is indented
O t 1
is in hyperlink anchor
last person name was female
next two words are “and Associates”
St
S t+1
…
…
Ot
O t +1
From HMMs to Conditional Random Fields
s  s1,s2,...sn
o  o1,o2,...on
[Lafferty, McCallum, Pereira 2001]
|o |
Joint
P(s,o )   P(st | st1)P(ot | st )
St-1
St
St+1
...
t1
Ot-1
Ot
...
Ot+1
Conditional

1 |o |
P(s | o) 
P(st | st1)P(ot | st )

P(o ) t1
1 |o |

s (st ,st1)o (ot ,st )

Z(o) t1

where



o (t)  exp k f k (st ,ot )
 k

St-1
Ot-1
St
Ot
St+1
...
Ot+1
...
(A super-special case of
Conditional Random Fields.)
Set parameters by maximum likelihood, using optimization method on L.
Conditional Random Fields
[Lafferty, McCallum, Pereira 2001]
1. FSM special-case:
linear chain among unknowns, parameters tied across time steps.
St
St+1
St+2
St+3
St+4
 
P( s | o ) 

|o |
O = Ot, Ot+1, Ot+2, Ot+3, Ot+4
1

 
exp

f
(
s
,
s
,
o
, t) 

 

k k
t
t 1
Z (o ) t 1
 k

2. In general:
CRFs = "Conditionally-trained Markov Network"
arbitrary structure among unknowns
3. Relational Markov Networks [Taskar, Abbeel, Koller 2002]:
Parameters tied across hits from SQL-like queries ("clique templates")
Linear-chain CRFs vs. HMMs
• Comparable computational efficiency for inference
• Features may be arbitrary functions of any or all
observations
– Parameters need not fully specify generation of
observations; can require less training data
– Easy to incorporate domain knowledge
Table Extraction from Government Reports
Cash receipts from marketings of milk during 1995 at $19.9 billion dollars, was
slightly below 1994. Producer returns averaged $12.93 per hundredweight,
$0.19 per hundredweight below 1994. Marketings totaled 154 billion pounds,
1 percent above 1994. Marketings include whole milk sold to plants and dealers
as well as milk sold directly to consumers.
An estimated 1.56 billion pounds of milk were used on farms where produced,
8 percent less than 1994. Calves were fed 78 percent of this milk with the
remainder consumed in producer households.
Milk Cows and Production of Milk and Milkfat:
United States, 1993-95
-------------------------------------------------------------------------------:
:
Production of Milk and Milkfat 2/
:
Number
:------------------------------------------------------Year
:
of
:
Per Milk Cow
:
Percentage
:
Total
:Milk Cows 1/:-------------------: of Fat in All :-----------------:
: Milk : Milkfat : Milk Produced : Milk : Milkfat
-------------------------------------------------------------------------------: 1,000 Head
--- Pounds --Percent
Million Pounds
:
1993
:
9,589
15,704
575
3.66
150,582 5,514.4
1994
:
9,500
16,175
592
3.66
153,664 5,623.7
1995
:
9,461
16,451
602
3.66
155,644 5,694.3
-------------------------------------------------------------------------------1/ Average number during year, excluding heifers not yet fresh.
2/ Excludes milk sucked by calves.
Table Extraction from Government Reports
[Pinto, McCallum, Wei, Croft, 2003 SIGIR]
100+ documents from www.fedstats.gov
Labels:
CRF
of milk during 1995 at $19.9 billion dollars, was
eturns averaged $12.93 per hundredweight,
1994. Marketings totaled 154 billion pounds,
ngs include whole milk sold to plants and dealers
consumers.
ds of milk were used on farms where produced,
es were fed 78 percent of this milk with the
cer households.
1993-95
------------------------------------
n of Milk and Milkfat 2/
-------------------------------------: Percentage :
Non-Table
Table Title
Table Header
Table Data Row
Table Section Data Row
Table Footnote
... (12 in all)
Features:
uction of Milk and Milkfat:
w
•
•
•
•
•
•
•
Total
----: of Fat in All :-----------------Milk Produced : Milk : Milkfat
------------------------------------
•
•
•
•
•
•
•
Percentage of digit chars
Percentage of alpha chars
Indented
Contains 5+ consecutive spaces
Whitespace in this line aligns with prev.
...
Conjunctions of all previous features,
time offset: {0,0}, {-1,0}, {0,1}, {1,2}.
Table Extraction Experimental Results
[Pinto, McCallum, Wei, Croft, 2003 SIGIR]
Line labels,
percent correct
HMM
65 %
Stateless
MaxEnt
85 %
CRF
95 %
Table segments,
F1
64 %
92 %
IE from Research Papers
[McCallum et al ‘99]
IE from Research Papers
Field-level F1
Hidden Markov Models (HMMs)
75.6
[Seymore, McCallum, Rosenfeld, 1999]
Support Vector Machines (SVMs)
89.7
 error
40%
[Han, Giles, et al, 2003]
Conditional Random Fields (CRFs)
[Peng, McCallum, 2004]
93.9
Named Entity Recognition
CRICKET MILLNS SIGNS FOR BOLAND
CAPE TOWN 1996-08-22
South African provincial side
Boland said on Thursday they
had signed Leicestershire fast
bowler David Millns on a one
year contract.
Millns, who toured Australia with
England A in 1992, replaces
former England all-rounder
Phillip DeFreitas as Boland's
overseas professional.
Labels:
PER
ORG
LOC
MISC
Examples:
Yayuk Basuki
Innocent Butare
3M
KDP
Cleveland
Cleveland
Nirmal Hriday
The Oval
Java
Basque
1,000 Lakes Rally
Named Entity Extraction Results
[McCallum & Li, 2003, CoNLL]
Method
F1
HMMs BBN's Identifinder
73%
CRFs w/out Feature Induction 83%
CRFs with Feature Induction
based on LikelihoodGain
90%
Outline
• Examples of IE and Data Mining.
a
• Brief review of Conditional Random Fields
a
• Joint inference: Motivation and examples
– Joint Labeling of Cascaded Sequences
(Belief Propagation)
– Joint Labeling of Distant Entities (BP by Tree Reparameterization)
– Joint Co-reference Resolution
– Joint Segmentation and Co-ref
(Graph Partitioning)
(Iterated Conditional Samples)
• Efficiently training large, multi-component models:
Piecewise Training
Larger Context
Spider
Filter
Data
Mining
IE
Segment
Classify
Associate
Cluster
Discover patterns
- entity types
- links / relations
- events
Database
Document
collection
Actionable
knowledge
Prediction
Outlier detection
Decision support
Knowledge
Discovery
IE
Segment
Classify
Associate
Cluster
Problem:
Discover patterns
- entity types
- links / relations
- events
Database
Document
collection
Actionable
knowledge
Combined in serial juxtaposition,
IE and KD are unaware of each others’
weaknesses and opportunities.
1) KD begins from a populated DB, unaware of where
the data came from, or its inherent uncertainties.
2) IE is unaware of emerging patterns and
regularities in the DB.
The accuracy of both suffers, and significant mining
of complex text sources is beyond reach.
Solution:
Uncertainty Info
Spider
Filter
Data
Mining
IE
Segment
Classify
Associate
Cluster
Discover patterns
- entity types
- links / relations
- events
Database
Document
collection
Actionable
knowledge
Emerging Patterns
Prediction
Outlier detection
Decision support
Solution:
Unified Model
Spider
Filter
Data
Mining
IE
Segment
Classify
Associate
Cluster
Probabilistic
Model
Discover patterns
- entity types
- links / relations
- events
Discriminatively-trained undirected graphical models
Document
collection
Conditional Random Fields
[Lafferty, McCallum, Pereira]
Conditional PRMs
[Koller…], [Jensen…],
[Geetor…], [Domingos…]
Complex Inference and Learning
Just what we researchers like to sink our teeth into!
Actionable
knowledge
Prediction
Outlier detection
Decision support
Larger-scale Joint Inference for IE
• What model structures will capture salient dependencies?
• Will joint inference improve accuracy?
• How do to inference in these large graphical models?
• How to efficiently train these models,
which are built from multiple large components?
1. Jointly labeling cascaded sequences
Factorial CRFs
[Sutton, Khashayar,
McCallum, ICML 2004]
Named-entity tag
Noun-phrase boundaries
Part-of-speech
English words
1. Jointly labeling cascaded sequences
Factorial CRFs
[Sutton, Khashayar,
McCallum, ICML 2004]
Named-entity tag
Noun-phrase boundaries
Part-of-speech
English words
1. Jointly labeling cascaded sequences
Factorial CRFs
[Sutton, Khashayar,
McCallum, ICML 2004]
Named-entity tag
Noun-phrase boundaries
Part-of-speech
English words
But errors cascade--must be perfect at every stage to do well.
1. Jointly labeling cascaded sequences
Factorial CRFs
[Sutton, Khashayar,
McCallum, ICML 2004]
Named-entity tag
Noun-phrase boundaries
Part-of-speech
English words
Joint prediction of part-of-speech and noun-phrase in newswire,
matching accuracy with only 50% of the training data.
Inference:
Tree reparameterization BP
[Wainwright et al, 2002]
2. Jointly labeling distant mentions
Skip-chain CRFs [Sutton, McCallum, SRL 2004]
…
Senator Joe Green said today
…
.
Green ran
for …
Dependency among similar, distant mentions ignored.
2. Jointly labeling distant mentions
Skip-chain CRFs [Sutton, McCallum, SRL 2004]
…
Senator Joe Green said today
…
.
Green ran
14% reduction in error on most repeated field
in email seminar announcements.
Inference:
Tree reparameterization BP
[Wainwright et al, 2002]
for …
3. Joint co-reference among all pairs
Affinity Matrix CRF
“Entity resolution”
“Object correspondence”
. . . Mr Powell . . .
45
. . . Powell . . .
Y/N
99
Y/N
Y/N
11
~25% reduction in error on
co-reference of
proper nouns in newswire.
. . . she . . .
Inference:
Correlational clustering
graph partitioning
[Bansal, Blum, Chawla, 2002]
[McCallum, Wellner, IJCAI WS 2003, NIPS 2004]
Coreference Resolution
AKA "record linkage", "database record deduplication",
"entity resolution", "object correspondence", "identity uncertainty"
Output
Input
News article,
with named-entity "mentions" tagged
Number of entities, N = 3
Today Secretary of State Colin Powell
met with . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . he . . . . . .
. . . . . . . . . . . . . Condoleezza Rice . . . . .
. . . . Mr Powell . . . . . . . . . .she . . . . . . .
. . . . . . . . . . . . . . Powell . . . . . . . . . . . .
. . . President Bush . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . Rice . . . . . . . . . .
. . . . . . Bush . . . . . . . . . . . . . . . . . . . . .
........... . . . . . . . . . . . . . . . .
#1
Secretary of State Colin Powell
he
Mr. Powell
Powell
#2
Condoleezza Rice
she
Rice
.........................
#3
President Bush
Bush
Inside the Traditional Solution
Pair-wise Affinity Metric
Mention (3)
. . . Mr Powell . . .
N
Y
Y
Y
Y
N
Y
Y
N
Y
N
N
Y
Y
Mention (4)
Y/N?
. . . Powell . . .
Two words in common
One word in common
"Normalized" mentions are string identical
Capitalized word in common
> 50% character tri-gram overlap
< 25% character tri-gram overlap
In same sentence
Within two sentences
Further than 3 sentences apart
"Hobbs Distance" < 3
Number of entities in between two mentions = 0
Number of entities in between two mentions > 4
Font matches
Default
OVERALL SCORE =
29
13
39
17
19
-34
9
8
-1
11
12
-3
1
-19
98
> threshold=0
The Problem
. . . Mr Powell . . .
affinity = 98
Y
affinity = 104
Pair-wise merging
decisions are being
made independently
from each other
. . . Powell . . .
N
Y
affinity = 11
. . . she . . .
Affinity measures are noisy and imperfect.
They should be made
in relational dependence
with each other.
A Generative Model Solution
[Russell 2001], [Pasula et al 2002],
[Milch et al 2003], [Marthi et al 2003]
(Applied to citation matching, and
object correspondence in vision)
N
id
context words
id
surname
distance fonts
.
.
.
gender
age
.
.
.
A Markov Random Field for Co-reference
(MRF)
[McCallum & Wellner, 2003, ICML]
. . . Mr Powell . . .
45
. . . Powell . . .
Y/N
30
Y/N
Y/N
Make pair-wise merging
decisions in dependent
relation to each other by
- calculating a joint prob.
- including all edge weights
- adding dependence on
consistent triangles.
11
. . . she . . .


1
P(y | x ) 
exp
l f l (x i , x j , y ij )   ' f '(y ij , y jk , y ik )




Zx
i, j l
i, j,k

A Markov Random Field for Co-reference
(MRF)
[McCallum & Wellner, 2003]
. . . Mr Powell . . .
45
. . . Powell . . .
Y/N
30
Y/N
Y/N
11
Make pair-wise merging
decisions in dependent
relation to each other by
- calculating a joint prob.
- including all edge weights
- adding dependence on
consistent triangles.

. . . she . . .


1
P(y | x ) 
exp
l f l (x i , x j , y ij )   ' f '(y ij , y jk , y ik )




Zx
i, j l
i, j,k

A Markov Random Field for Co-reference
(MRF)
[McCallum & Wellner, 2003]
. . . Mr Powell . . .
(45)
. . . Powell . . .
N
(30)
N
Y
(11)
. . . she . . .
4


1
P(y | x ) 
exp
l f l (x i , x j , y ij )   ' f '(y ij , y jk , y ik )




Zx
i, j l
i, j,k

A Markov Random Field for Co-reference
(MRF)
[McCallum & Wellner, 2003]
. . . Mr Powell . . .
(45)
. . . Powell . . .
Y
(30)
N
Y
(11)
. . . she . . .
infinity


1
P(y | x ) 
exp
l f l (x i , x j , y ij )   ' f '(y ij , y jk , y ik )




Zx
i, j l
i, j,k

A Markov Random Field for Co-reference
(MRF)
[McCallum & Wellner, 2003]
. . . Mr Powell . . .
(45)
. . . Powell . . .
Y
(30)
N
N
(11)
. . . she . . .
64


1
P(y | x ) 
exp
l f l (x i , x j , y ij )   ' f '(y ij , y jk , y ik )




Zx
i, j l
i, j,k

Inference in these MRFs = Graph Partitioning
[Boykov, Vekler, Zabih, 1999], [Kolmogorov & Zabih, 2002], [Yu, Cross, Shi, 2002]
. . . Mr Powell . . .
45
. . . Powell . . .
106
30
134
11
. . . Condoleezza Rice . . .
. . . she . . .
10
log(P(y | x )   l f l (x i, x j , y ij ) 
i, j
l
w
i, j w/in
paritions
ij

w
i, j across
paritions
ij
Inference in these MRFs = Graph Partitioning
[Boykov, Vekler, Zabih, 1999], [Kolmogorov & Zabih, 2002], [Yu, Cross, Shi, 2002]
. . . Mr Powell . . .
45
. . . Powell . . .
106
30
134
11
. . . Condoleezza Rice . . .
. . . she . . .
10
log(P(y | x )   l f l (x i, x j , y ij ) 
i, j
l
w
i, j w/in
paritions
ij

w
i, j across
paritions
ij
= 22
Inference in these MRFs = Graph Partitioning
[Boykov, Vekler, Zabih, 1999], [Kolmogorov & Zabih, 2002], [Yu, Cross, Shi, 2002]
. . . Mr Powell . . .
45
. . . Powell . . .
106
30
134
11
. . . Condoleezza Rice . . .
. . . she . . .
10
log(P(y | x )   l f l (x i, x j , y ij ) 
i, j
l
w
i, j w/in
paritions
ij

w'
i, j across
paritions
ij
= 314
Co-reference Experimental Results
[McCallum & Wellner, 2003]
Proper noun co-reference
DARPA ACE broadcast news transcripts, 117 stories
Single-link threshold
Best prev match [Morton]
MRFs
Partition F1
16 %
83 %
88 %
error=30%
Pair F1
18 %
89 %
92 %
error=28%
DARPA MUC-6 newswire article corpus, 30 stories
Single-link threshold
Best prev match [Morton]
MRFs
Partition F1
11%
70 %
74 %
error=13%
Pair F1
7%
76 %
80 %
error=17%
Joint co-reference among all pairs
Affinity Matrix CRF
. . . Mr Powell . . .
45
. . . Powell . . .
Y/N
99
Y/N
Y/N
11
~25% reduction in error on
co-reference of
proper nouns in newswire.
. . . she . . .
Inference:
Correlational clustering
graph partitioning
[Bansal, Blum, Chawla, 2002]
[McCallum, Wellner, IJCAI WS 2003, NIPS 2004]
4. Joint segmentation and co-reference
Extraction from and matching of
research paper citations.
o
s
Laurel, B. Interface Agents:
Metaphors with Character, in
The Art of Human-Computer Interface
Design, B. Laurel (ed), AddisonWesley, 1990.
World
Knowledge
c
y
Brenda Laurel. Interface Agents:
Metaphors with Character, in
Laurel, The Art of Human-Computer
Interface Design, 355-366, 1990.
p
Co-reference
decisions
y
Database
field values
c
s
c
y
Citation attributes
s
o
Segmentation
o
35% reduction in co-reference error by using segmentation uncertainty.
6-14% reduction in segmentation error by using co-reference.
Inference:
Variant of Iterated Conditional Modes
[Besag, 1986]
[Wellner, McCallum, Peng, Hay, UAI 2004]
see also [Marthi, Milch, Russell, 2003]
4. Joint segmentation and co-reference
Joint IE and Coreference from Research Paper Citations
Textual citation mentions
(noisy, with duplicates)
Paper database, with fields,
clean, duplicates collapsed
AUTHORS
TITLE
Cowell, Dawid…
Probab…
Montemerlo, Thrun…FastSLAM…
Kjaerulff
Approxi…
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
VENUE
Springer
AAAI…
Technic…
Citation Segmentation and Coreference
Laurel, B.
Interface Agents: Metaphors with Character , in
The Art of Human-Computer Interface Design , T. Smith (ed) ,
Addison-Wesley , 1990 .
Brenda Laurel . Interface Agents: Metaphors with Character , in
Smith , The Art of Human-Computr Interface Design , 355-366 , 1990 .
Citation Segmentation and Coreference
Laurel, B.
Interface Agents: Metaphors with Character , in
The Art of Human-Computer Interface Design , T. Smith (ed) ,
Addison-Wesley , 1990 .
Brenda Laurel . Interface Agents: Metaphors with Character , in
Smith , The Art of Human-Computr Interface Design , 355-366 , 1990 .
1)
Segment citation fields
Citation Segmentation and Coreference
Laurel, B.
Y
?
N
Interface Agents: Metaphors with Character , in
The Art of Human-Computer Interface Design , T. Smith (ed) ,
Addison-Wesley , 1990 .
Brenda Laurel . Interface Agents: Metaphors with Character , in
Smith , The Art of Human-Computr Interface Design , 355-366 , 1990 .
1)
Segment citation fields
2)
Resolve coreferent citations
Citation Segmentation and Coreference
Laurel, B.
Y
?
N
Interface Agents: Metaphors with Character , in
The Art of Human-Computer Interface Design , T. Smith (ed) ,
Addison-Wesley , 1990 .
Brenda Laurel . Interface Agents: Metaphors with Character , in
Smith , The Art of Human-Computr Interface Design , 355-366 , 1990 .
AUTHOR =
TITLE =
PAGES =
BOOKTITLE =
EDITOR =
PUBLISHER =
YEAR =
Brenda Laurel
Interface Agents: Metaphors with Character
355-366
The Art of Human-Computer Interface Design
T. Smith
Addison-Wesley
1990
1)
Segment citation fields
2)
Resolve coreferent citations
3)
Form canonical database record
Resolving conflicts
Citation Segmentation and Coreference
Laurel, B.
Y
?
N
Interface Agents: Metaphors with Character , in
The Art of Human-Computer Interface Design , T. Smith (ed) ,
Addison-Wesley , 1990 .
Brenda Laurel . Interface Agents: Metaphors with Character , in
Smith , The Art of Human-Computr Interface Design , 355-366 , 1990 .
AUTHOR =
TITLE =
PAGES =
BOOKTITLE =
EDITOR =
PUBLISHER =
YEAR =
Perform
Brenda Laurel
Interface Agents: Metaphors with Character
355-366
The Art of Human-Computer Interface Design
T. Smith
Addison-Wesley
1990
1)
Segment citation fields
2)
Resolve coreferent citations
3)
Form canonical database record
jointly.
IE + Coreference Model
AUT AUT YR TITL TITL
CRF Segmentation
s
Observed citation
x
J Besag 1986 On the…
IE + Coreference Model
AUTHOR = “J Besag”
YEAR =
“1986”
TITLE = “On the…”
Citation mention attributes
c
CRF Segmentation
s
Observed citation
x
J Besag 1986 On the…
IE + Coreference Model
Smyth
,
P Data mining…
Structure for each
citation mention
c
s
x
Smyth . 2001 Data Mining…
J Besag 1986 On the…
IE + Coreference Model
Smyth
,
P Data mining…
Binary coreference variables
for each pair of mentions
c
s
x
Smyth . 2001 Data Mining…
J Besag 1986 On the…
IE + Coreference Model
Smyth
,
P Data mining…
Binary coreference variables
for each pair of mentions
y
n
n
c
s
x
Smyth . 2001 Data Mining…
J Besag 1986 On the…
IE + Coreference Model
Smyth
,
P Data mining…
AUTHOR = “P Smyth”
YEAR =
“2001”
TITLE = “Data Mining…”
...
Research paper entity
attribute nodes
y
n
n
c
s
x
Smyth . 2001 Data Mining…
J Besag 1986 On the…
IE + Coreference Model
Smyth
Research paper entity
attribute node
,
P Data mining…
y
y
y
c
s
x
Smyth . 2001 Data Mining…
J Besag 1986 On the…
IE + Coreference Model
Smyth
,
P Data mining…
y
n
n
c
s
x
Smyth . 2001 Data Mining…
J Besag 1986 On the…
Such a highly connected graph makes
exact inference intractable, so…
Approximate Inference 1
• Loopy Belief
Propagation
m1(v2)
v1
m2(v3)
v2
m2(v1)
v4
v3
m3(v2)
v5
messages passed
between nodes
v6
Approximate Inference 1
• Loopy Belief
Propagation
• Generalized Belief
Propagation
m1(v2)
v1
m2(v3)
v2
m2(v1)
v3
m3(v2)
messages passed
between nodes
v4
v5
v6
v1
v2
v3
v4
v5
v6
v7
v8
v9
messages passed
between regions
Here, a message is a conditional probability table passed among nodes.
But when message size grows exponentially with size of overlap between regions!
Approximate Inference 2
• Iterated Conditional
Modes (ICM)
v1
v2
v3
= held constant
[Besag 1986]
v4
v6i+1
= argmax
v6i
P(v6i |
v\
v6i)
v5
v6
Approximate Inference 2
• Iterated Conditional
Modes (ICM)
v1
v2
v3
= held constant
[Besag 1986]
v4
v5j+1
= argmax
v5j
P(v5j |
v\
v5j)
v5
v6
Approximate Inference 2
• Iterated Conditional Modes
(ICM)
v1
v2
v3
= held constant
[Besag 1986]
v4
v4k+1
= argmax
P(v4k |
v\
v5
v6
v4k)
v4k
Structured inference scales well here,
but greedy, and easily falls into local minima.
Approximate Inference 2
•
Iterated Conditional Modes
(ICM)
v1
v2
v3
= held constant
[Besag 1986]
v4
v4k+1
= argmax
P(v4k |
v\
v5
v6
v4k)
v4k
•
Iterated Conditional Sampling (ICS)
(our name)
Instead of selecting only argmax, sample of argmaxes of P(v4k | v \ v4k)
e.g. an N-best list (the top N values)
v1
v4
v2
v5
v3
v6
Can use “Generalized Version” of this;
doing exact inference on a region of
several nodes at once.
Here, a “message” grows only linearly
with overlap region size and N!
Features of this Inference Method
1)
2)
3)
Structured or “factored” representation (ala GBP)
Uses samples to approximate density
Closed-loop message-passing on loopy graph (ala BP)
Related Work
• Beam search
– “Forward”-only inference
• Particle filtering, e.g. [Doucet 1998]
– Usually on tree-shaped graph, or “feedforward” only.
• MC Sampling…Embedded HMMs [Neal, 2003]
– Sample from high-dim continuous state space; do forward-backward
• Sample Propagation [Paskin, 2003]
– Messages = samples, on a junction tree
• Fields to Trees [Hamze & de Freitas, UAI 2003]
– Rao-Blackwellized MCMC, partitioning G into non-overlapping trees
• Factored Particles for DBNs [Ng, Peshkin, Pfeffer, 2002]
– Combination of Particle Filtering and Boyan-Koller for DBNs
IE + Coreference Model
Smyth
,
P Data mining…
Exact inference on
these linear-chain regions
From each chain
pass an N-best List
into coreference
Smyth . 2001 Data Mining…
J Besag 1986 On the…
IE + Coreference Model
Smyth
,
P Data mining…
Approximate inference
by graph partitioning…
Make scale to 1M
citations with Canopies
…integrating out
uncertainty
in samples
of extraction
Smyth . 2001 Data Mining…
[McCallum, Nigam, Ungar 2000]
J Besag 1986 On the…
IE + Coreference Model
Smyth
,
P Data mining…
Exact (exhaustive) inference
over entity attributes
y
n
n
Smyth . 2001 Data Mining…
J Besag 1986 On the…
IE + Coreference Model
Smyth
,
P Data mining…
Revisit exact inference
on IE linear chain,
now conditioned on
entity attributes
y
n
n
Smyth . 2001 Data Mining…
J Besag 1986 On the…
Parameter Estimation
Separately for different regions
IE Linear-chain
Exact MAP
Coref graph edge weights
MAP on individual edges
Entity attribute potentials
MAP, pseudo-likelihood
y
n
n
In all cases:
Climb MAP gradient with
quasi-Newton method
Experimenal Results
• Set of citations from CiteSeer
– 1500 citation mentions
– to 900 paper entities
• Hand-labeled for coreference and field-extraction
• Divided into 4 subsets, each on a different topic
– RL, Face detection, Reasoning, Constraint Satisfaction
– Within each subset many citations share authors,
publication venues, publishers, etc.
• 70% of the citation mentions are singletons
Coreference Results
Coreference cluster recall
N
Reinforce
Face
Reason
Constraint
1 (Baseline)
0.946
0.96
0.94
0.96
3
0.95
0.98
0.96
0.96
7
0.95
0.98
0.95
0.97
9
0.982
0.97
0.96
0.97
Optimal
0.99
0.99
0.99
0.99
• Average error reduction is 35%.
• “Optimal” makes best use of N-best list by using true labels.
• Indicates that even more improvement can be obtained
Information Extraction Results
Segmentation F1
Reinforce
Face
Reason
Constraint
Baseline
.943
.908
.929
.934
w/ Coref
.949
.914
.935
.943
Err. Reduc.
.101
.062
.090
.142
P-value
.0442
.0014
.0001
.0001
• Error reduction ranges from 6-14%.
• Small, but significant at 95% confidence level (p-value < 0.05)
Biggest limiting factor in both sets of results:
data set is small, and does not have large coreferent sets.
Parameter Estimation
Separately for different regions
IE Linear-chain
Exact MAP
Coref graph edge weights
MAP on individual edges
Entity attribute potentials
MAP, pseudo-likelihood
y
n
n
In all cases:
Climb MAP gradient with
quasi-Newton method
Outline
• Examples of IE and Data Mining.
a
• Brief review of Conditional Random Fields
a
• Joint inference: Motivation and examples
a
– Joint Labeling of Cascaded Sequences
(Belief Propagation)
– Joint Labeling of Distant Entities (BP by Tree Reparameterization)
– Joint Co-reference Resolution
– Joint Segmentation and Co-ref
(Graph Partitioning)
(Iterated Conditional Samples)
• Efficiently training large, multi-component models:
Piecewise Training
Piecewise Training
Piecewise Training with NOTA
Experimental Results
Named entity tagging (CoNLL-2003)
Training set = 15k newswire sentences
9 labels
Test F1
Training time
CRF
89.87
9 hours
MEMM
88.90
1 hour
CRF-PT
90.50
5.3 hours
stat. sig. improvement at
p = 0.001
Experimental Results 2
Part-of-speech tagging (Penn Treebank, small subset)
Training set = 1154 newswire sentences
45 labels
Test F1
Training time
CRF
88.1
14 hours
MEMM
88.1
2 hours
CRF-PT
88.8
2.5 hours
stat. sig. improvement at
p = 0.001
“Parameter Independence Diagrams”
Graphical models = formalism for representing
independence assumptions among variables.
Here we represent
independence assumptions among parameters (in factor graph)
“Parameter Independence Diagrams”
Train some in pieces, some globally
Piecewise Training Research Questions
• How to select the boundaries of “pieces”?
• What choices of limited interaction are best?
• How to sample sparse subsets of NOTA instances?
• Application to simpler models (classifiers)
• Application to more complex models (parsing)
Main Application Project:
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Main Application Project:
Cites
Research
Paper
Main Application Project:
Expertise
Cites
Research
Paper
Grant
Venue
Person
University
Groups
Summary
• Conditional Random Fields
– Conditional probability models of structured data
• Data mining complex unstructured text suggests the
need for joint inference IE -> DM.
• Early examples
–
–
–
–
Factorial finite state models
Jointly labeling distant entities
Coreference analysis
Segmentation uncertainty aiding coreference
• Piecewise Training
– Faster + higher accuracy
by making independence assumptions
End of Talk
4. Joint segmentation and co-reference
[Wellner, McCallum,
Peng, Hay, UAI 2004]
o
Extraction from and matching of
research paper citations.
s
Laurel, B. Interface Agents:
Metaphors with Character, in
The Art of Human-Computer Interface
Design, B. Laurel (ed), AddisonWesley, 1990.
World
Knowledge
c
y
Brenda Laurel. Interface Agents:
Metaphors with Character, in
Laurel, The Art of Human-Computer
Interface Design, 355-366, 1990.
p
Co-reference
decisions
y
Database
field values
c
s
c
y
s
o
Citation attributes
Segmentation
o
35% reduction in co-reference error by using segmentation uncertainty.
6-14% reduction in segmentation error by using co-reference.
Inference:
Variant of Iterated Conditional Modes
[Besag, 1986]
Label Bias Problem in
Conditional Sequence Models
• Example (after Bottou ‘91):
r
o
b
“rob”
r
i
b
“rib”
start
• Bias toward states with few “siblings”.
• Per-state normalization in MEMMs does not allow “probability
mass” to transfer from one branch to the other.
Proposed Solutions
• Determinization:
– not always possible
– state-space explosion
start
o
b “rob”
i
b “rib”
r
• Use fully-connected models:
– lacks prior structural knowledge.
• Our solution: Conditional random fields (CRFs):
– Probabilistic conditional models generalizing MEMMs.
– Allow some transitions to vote more strongly than others in
computing state sequence probability.
– Whole sequence rather than per-state normalization.