o t+1 - College of Information and Computer Sciences

Download Report

Transcript o t+1 - College of Information and Computer Sciences

Information Extraction
with Markov Random Fields
Andrew McCallum
University of Massachusetts Amherst
Joint work with
Aron Culotta, Wei Li and Ben Wellner
Bruce Croft, James Allan, David Pinto, Xing Wei, Andres Corrada-Emmanuel
David Jensen, Jen Neville,
John Lafferty (CMU), Fernando Pereira (UPenn)
IE from Cargo Container Ship Manifests
Cargo Tracking Div.
US Navy
IE from
Chinese Documents regarding Weather
Department of Terrestrial System, Chinese Academy of Sciences
200k+ documents
several millennia old
- Qing Dynasty Archives
- memos
- newspaper articles
- diaries
IE from Research Papers
[McCallum et al ‘99]
IE in Context
Create ontology
Spider
Filter by relevance
IE
Segment
Classify
Associate
Cluster
Load DB
Document
collection
Train extraction models
Label training data
Database
Query,
Search
Data mining
Prediction
Outlier detection
Decision support
What is “Information Extraction”
As a task:
Filling slots in a database from sub-segments of text.
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
NAME
TITLE
ORGANIZATION
What is “Information Extraction”
As a task:
Filling slots in a database from sub-segments of text.
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
IE
NAME
Bill Gates
Bill Veghte
Richard Stallman
TITLE
ORGANIZATION
CEO
Microsoft
VP
Microsoft
founder Free Soft..
What is “Information Extraction”
As a family
of techniques:
Information Extraction =
segmentation + classification + clustering + association
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
Microsoft Corporation
CEO
Bill Gates
Microsoft
Gates
Microsoft
Bill Veghte
Microsoft
VP
Richard Stallman
founder
Free Software Foundation
What is “Information Extraction”
As a family
of techniques:
Information Extraction =
segmentation + classification + association + clustering
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
Microsoft Corporation
CEO
Bill Gates
Microsoft
Gates
Microsoft
Bill Veghte
Microsoft
VP
Richard Stallman
founder
Free Software Foundation
What is “Information Extraction”
As a family
of techniques:
Information Extraction =
segmentation + classification + association + clustering
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
Microsoft Corporation
CEO
Bill Gates
Microsoft
Gates
Microsoft
Bill Veghte
Microsoft
VP
Richard Stallman
founder
Free Software Foundation
What is “Information Extraction”
As a family
of techniques:
Information Extraction =
segmentation + classification + association + clustering
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
* Microsoft Corporation
CEO
Bill Gates
* Microsoft
Gates
* Microsoft
Bill Veghte
* Microsoft
VP
Richard Stallman
founder
Free Software Foundation
Issues that arise
Application issues
• Directed spidering
• Page filtering
• Sequence segmentation
• Segment classification
• Record association
• Co-reference
Scientific issues
• Learning more than 100k
parameters from limited and
noisy training data
• Taking advantage of rich,
interdependent features.
• Deciding which input feature
representation to use.
• Efficient inference in models
with many interrelated
predictions.
• Clustering massive data sets
Issues that arise
Application issues
• Directed spidering
• Page filtering
• Sequence segmentation
• Segment classification
• Record association
• Co-reference
Scientific issues
• Learning more than 100k
parameters from limited and
noisy training data
• Taking advantage of rich,
interdependent features.
• Deciding which input feature
representation to use.
• Efficient inference in models
with many interrelated
predictions.
• Clustering massive data sets
Outline
a
• The need.
The task.
• Review of Random Field Models for IE.
• Recent work
– New results for Table Extraction
– New training procedure using Feature Induction
– New extended model for Co-reference Resolution
• Conclusions
Hidden Markov Models
HMMs are the standard sequence modeling tool in
genomics, music, speech, NLP, …
Graphical model
Finite state model
S t-1
St
S t+1
...
...
observations
...
Generates:
State
sequence
Observation
sequence
transitions
O
Ot
t -1
O t +1

|o|
o1
o2
o3
o4
o5
o6
o7
o8
 
P( s , o )   P( st | st 1 ) P(ot | st )
S={s1,s2,…}
Start state probabilities: P(st )
Transition probabilities: P(st|st-1 )
t 1
Parameters: for all states
Usually a multinomial over
Observation (emission) probabilities: P(ot|st ) atomic, fixed alphabet
Training:
Maximize probability of training observations (w/ prior)
IE with Hidden Markov Models
Given a sequence of observations:
Yesterday Arny Rosenberg spoke this example sentence.
and a trained HMM:
person name
location name
background
 
Find the most likely state sequence: (Viterbi) arg max s P( s , o )
Yesterday Arny Rosenberg spoke this example sentence.
Any words said to be generated by the designated “person name”
state extract as a person name:
Person name: Arny Rosenberg
Regrets from Atomic View of Tokens
Would like richer representation of text:
multiple overlapping features, whole chunks of text.
Example word features:
–
–
–
–
–
–
–
–
–
–
–
line, sentence, or paragraph features:
identity of word
is in all caps
ends in “-ski”
is part of a noun phrase
is in a list of city names
is under node X in WordNet or Cyc
is in bold font
is in hyperlink anchor
features of past & future
last person name was female
next two words are “and Associates”
–
–
–
–
–
–
–
–
–
length
is centered in page
percent of non-alphabetics
white-space aligns with next line
containing sentence has two verbs
grammatically contains a question
contains links to “authoritative” pages
emissions that are uncountable
features at multiple levels of granularity
Problems with Richer Representation
and a Generative Model
• These arbitrary features are not independent:
– Overlapping and long-distance dependences
– Multiple levels of granularity (words, characters)
– Multiple modalities (words, formatting, layout)
– Observations from past and future
 
P( s , o )
• HMMs are generative models of the text:
• Generative models do not easily handle these nonindependent features. Two choices:
– Model the dependencies. Each state would have its own
Bayes Net. But we are already starved for training data!
– Ignore the dependencies. This causes “over-counting” of
evidence (ala naïve Bayes). Big problem when combining
evidence, as in Viterbi!
Conditional Sequence Models
• We would prefer a conditional model:
P(s|o) instead of P(s,o):
– Can examine features, but not responsible for generating
them.
– Don’t have to explicitly model their dependencies.
– Don’t “waste modeling effort” trying to generate what we are
given at test time anyway.
Conditional Finite State Sequence Models
[McCallum, Freitag & Pereira, 2000]
From HMMs to MEMMs to CRFs

s  s1 , s2 ,...sn
HMM
[Lafferty, McCallum, Pereira 2001]

o  o1 , o2 ,...on
|o|
 
P( s , o )   P( st | st 1 ) P(ot | st )
t 1
St-1
Ot-1
St
Ot
St-1
St+1
...
Ot+1
St
St+1
...

|o |
MEMM
 
P( s | o )   P( st | st 1 , ot )
t 1
Ot-1
   j f j ( st , st 1 ) 
 j

1
 
P( s | o ) 
exp 


Z o t 1
    k g k ( st , ot ) 
 k

St-1

|o |
CRF
Ot
Ot-1
(A special case of MEMMs and CRFs.)
Ot+1
St
Ot
...
...
St+1
...
Ot+1
...
Conditional Random Fields (CRFs)
[Lafferty, McCallum, Pereira 2001]
St
St+1
St+2
St+3
St+4
O = Ot, Ot+1, Ot+2, Ot+3, Ot+4
Markov on s, conditional dependency on o.

|o|
1

 
 
P( s | o ) 
exp

f
(
s
,
s
,
o
, t) 



k k
t
t 1
Z o t 1
 k

Hammersley-Clifford theorem stipulates that the CRF has this
form—an exponential function of the cliques in the graph.
Assuming that the dependency structure of the states is tree-shaped
(linear chain is a trivial tree), inference can be done by dynamic
programming in time O(|o| |S|2)—just like HMMs.
Set parameters by maximum likelihood, using optimization method on dL.
General CRFs vs. HMMs
• More general and expressive modeling technique
• Comparable computational efficiency for inference
• Features may be arbitrary functions of any or all
observations
• Parameters need not fully specify generation of
observations; require less training data
• Easy to incorporate domain knowledge
Main Point #1
Conditional probability sequence models
give great flexibility regarding features used,
and have efficient dynamic-programmingbased algorithms for inference.
Outline
a
• The need. The task.
• Review of Random Field Models for IE.
a
• Recent work
– New results for Table Extraction
– New training procedure using Feature Induction
– New extended model for Co-reference Resolution
• Conclusions
Table Extraction from Government Reports
Cash receipts from marketings of milk during 1995 at $19.9 billion dollars, was
slightly below 1994. Producer returns averaged $12.93 per hundredweight,
$0.19 per hundredweight below 1994. Marketings totaled 154 billion pounds,
1 percent above 1994. Marketings include whole milk sold to plants and dealers
as well as milk sold directly to consumers.
An estimated 1.56 billion pounds of milk were used on farms where produced,
8 percent less than 1994. Calves were fed 78 percent of this milk with the
remainder consumed in producer households.
Milk Cows and Production of Milk and Milkfat:
United States, 1993-95
-------------------------------------------------------------------------------:
:
Production of Milk and Milkfat 2/
:
Number
:------------------------------------------------------Year
:
of
:
Per Milk Cow
:
Percentage
:
Total
:Milk Cows 1/:-------------------: of Fat in All :-----------------:
: Milk : Milkfat : Milk Produced : Milk : Milkfat
-------------------------------------------------------------------------------: 1,000 Head
--- Pounds --Percent
Million Pounds
:
1993
:
9,589
15,704
575
3.66
150,582 5,514.4
1994
:
9,500
16,175
592
3.66
153,664 5,623.7
1995
:
9,461
16,451
602
3.66
155,644 5,694.3
-------------------------------------------------------------------------------1/ Average number during year, excluding heifers not yet fresh.
2/ Excludes milk sucked by calves.
Table Extraction from Government Reports
[Pinto, McCallum, Wei, Croft, 2003 SIGIR]
100+ documents from www.fedstats.gov
Labels:
CRF
of milk during 1995 at $19.9 billion dollars, was
eturns averaged $12.93 per hundredweight,
1994. Marketings totaled 154 billion pounds,
ngs include whole milk sold to plants and dealers
consumers.
ds of milk were used on farms where produced,
es were fed 78 percent of this milk with the
cer households.
1993-95
------------------------------------
n of Milk and Milkfat 2/
-------------------------------------: Percentage :
Non-Table
Table Title
Table Header
Table Data Row
Table Section Data Row
Table Footnote
... (12 in all)
Features:
uction of Milk and Milkfat:
w
•
•
•
•
•
•
•
Total
----: of Fat in All :-----------------Milk Produced : Milk : Milkfat
------------------------------------
•
•
•
•
•
•
•
Percentage of digit chars
Percentage of alpha chars
Indented
Contains 5+ consecutive spaces
Whitespace in this line aligns with prev.
...
Conjunctions of all previous features,
time offset: {0,0}, {-1,0}, {0,1}, {1,2}.
Table Extraction Experimental Results
[Pinto, McCallum, Wei, Croft, 2003 SIGIR]
Line labels,
percent correct
Table segments,
F1
HMM
65 %
Stateless
MaxEnt
85 %
CRF w/out
conjunctions
52 %
68 %
CRF
95 %
92 %
64 %
D error
= 85%
-
D error
= 77%
Main Point #2
Conditional Random Fields are more
accurate in practice than HMMs
... on a table extraction task.
... (and others)
Outline
a
• The need. The task.
• Review of Random Field Models for IE.
a
• Recent work
– New results for Table Extraction
a
– New training procedure using Feature Induction
– New extended model for Co-reference Resolution
• Conclusions
A key remaining question for CRFs,
given their freedom to include the
kitchen sink's worth of features, is
what features to include?
Feature Induction for CRFs
[McCallum, 2003, UAI, submitted]
1. Begin with knowledge of atomic features,
but no features yet in the model.
2. Consider many candidate features, including
atomic and conjunctions.
3. Evaluate each candidate feature.
4. Add to the model some that are ranked
highest.
5. Train the model.
Candidate Feature Evaluation
[McCallum, 2003, UAI, submitted]
Common method: Information Gain
InfoGain (C , F )  H (C )   P( f ) H (C | f )
f F
True optimization criterion: Likelihood of training data
Likelihood Gain ( f ,  )  L  f  L
Technical meat is in how to calculate this
efficiently for CRFs
• Mean field approximation
• Emphasize error instances (related to Boosting)
• Newton's method to set 
Named Entity Recognition
CRICKET MILLNS SIGNS FOR BOLAND
CAPE TOWN 1996-08-22
South African provincial side
Boland said on Thursday they
had signed Leicestershire fast
bowler David Millns on a one
year contract.
Millns, who toured Australia with
England A in 1992, replaces
former England all-rounder
Phillip DeFreitas as Boland's
overseas professional.
Labels:
PER
ORG
LOC
MISC
Examples:
Yayuk Basuki
Innocent Butare
3M
KDP
Cleveland
Cleveland
Nirmal Hriday
The Oval
Java
Basque
1,000 Lakes Rally
Automatically Induced Features
[McCallum & Li, 2003, CoNLL]
Index
Feature
0
inside-noun-phrase (ot-1)
5
stopword (ot)
20
capitalized (ot+1)
75
word=the (ot)
100
in-person-lexicon (ot-1)
200
word=in (ot+2)
500
word=Republic (ot+1)
711
word=RBI (ot) & header=BASEBALL
1027
header=CRICKET (ot) & in-English-county-lexicon (ot)
1298
company-suffix-word (firstmentiont+2)
4040
location (ot) & POS=NNP (ot) & capitalized (ot) & stopword (ot-1)
4945
moderately-rare-first-name (ot-1) & very-common-last-name (ot)
4474
word=the (ot-2) & word=of (ot)
Named Entity Extraction Results
[McCallum & Li, 2003, CoNLL]
Method
F1
# features
CRFs w/out Feature Induction 75%
~1M
CRFs with Feature Induction
based on LikelihoodGain
~60k
89%
Main Point #3
One of CRFs chief challenges---choosing the
feature set---can also be addressed in a
formal way, based on maximum likelihood.
For CRFs (and many other methods)
Information Gain is not the appropriate
metric for selecting features.
Outline
a
• The need. The task.
• Review of Random Field Models for IE.
a
• Recent work
– New results for Table Extraction
a
– New training procedure using Feature Induction
a
– New extended model for Co-reference Resolution
• Conclusions
IE in Context
Create ontology
Spider
Filter by relevance
IE
Segment
Classify
Associate
Cluster
Load DB
Document
collection
Train extraction models
Label training data
Database
Query,
Search
Data mining
Prediction
Outlier detection
Decision support
IE in Context
Create ontology
Spider
Filter by relevance
IE
Segment
Classify
Associate
Cluster
Load DB
Document
collection
Train extraction models
Label training data
Database
Query,
Search
Data mining
Prediction
Outlier detection
Decision support
Coreference Resolution
AKA "record linkage", "database record deduplication",
"citation matching", "object correspondence", "identity uncertainty"
Output
Input
News article,
with named-entity "mentions" tagged
Number of entities, N = 3
Today Secretary of State Colin Powell
met with . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . he . . . . . .
. . . . . . . . . . . . . Condoleezza Rice . . . . .
. . . . Mr Powell . . . . . . . . . .she . . . . . . .
. . . . . . . . . . . . . . Powell . . . . . . . . . . . .
. . . President Bush . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . Rice . . . . . . . . . .
. . . . . . Bush . . . . . . . . . . . . . . . . . . . . .
........... . . . . . . . . . . . . . . . .
#1
Secretary of State Colin Powell
he
Mr. Powell
Powell
#2
Condoleezza Rice
she
Rice
.........................
#3
President Bush
Bush
Inside the Traditional Solution
Pair-wise Affinity Metric
Mention (3)
. . . Mr Powell . . .
N
Y
Y
Y
Y
N
Y
Y
N
Y
N
N
Y
Y
Mention (4)
Y/N?
. . . Powell . . .
Two words in common
One word in common
"Normalized" mentions are string identical
Capitalized word in common
> 50% character tri-gram overlap
< 25% character tri-gram overlap
In same sentence
Within two sentences
Further than 3 sentences apart
"Hobbs Distance" < 3
Number of entities in between two mentions = 0
Number of entities in between two mentions > 4
Font matches
Default
OVERALL SCORE =
29
13
39
17
19
-34
9
8
-1
11
12
-3
1
-19
98
> threshold=0
The Problem
. . . Mr Powell . . .
affinity = 98
Y
affinity = 104
Pair-wise merging
decisions are being
made independently
from each other
. . . Powell . . .
N
Y
affinity = 11
. . . she . . .
Affinity measures are noisy and imperfect.
They should be made
in relational dependence
with each other.
A UCBerkeley Solution
[Russell 2001], [Pasula et al 2002]
(Applied to citation matching, and
object correspondence in vision)
N
id
Problems:
context words
id
surname
distance fonts
.
.
.
gender
age
.
.
.
1) Generative model
makes it difficult
to use complex
features.
2) Number of entities
is hard-coded into
the model structure,
but we are supposed
to predict num entities!
Thus we must modify
model structure during
inference---complicated.
A Markov Random Field for Co-reference
(MRF)
[McCallum & Wellner, 2003, ICML submitted]
. . . Mr Powell . . .
45
. . . Powell . . .
Y/N
30
Y/N
Y/N
11
Make pair-wise merging
decisions in dependent
relation to each other by
- calculating a joint prob.
- including all edge weights
- adding dependence on
consistent triangles.

. . . she . . .


1
 
P( y | x ) 
exp   l f l ( xi , x j , yij )    ' f ' ( yij , y jk , yik ) 
Z x
i , j ,k
 i, j l

A Markov Random Field for Co-reference
(MRF)
[McCallum & Wellner, 2003, ICML submitted]
. . . Mr Powell . . .
(45)
. . . Powell . . .
N
(30)
N
Y
(11)
. . . she . . .
4


1
 
P( y | x ) 
exp   l f l ( xi , x j , yij )    ' f ' ( yij , y jk , yik ) 
Z x
i , j ,k
 i, j l

A Markov Random Field for Co-reference
(MRF)
[McCallum & Wellner, 2003, ICML submitted]
. . . Mr Powell . . .
(45)
. . . Powell . . .
Y
(30)
N
Y
(11)
. . . she . . .
infinity


1
 
P( y | x ) 
exp   l f l ( xi , x j , yij )    ' f ' ( yij , y jk , yik ) 
Z x
i , j ,k
 i, j l

A Markov Random Field for Co-reference
(MRF)
[McCallum & Wellner, 2003, ICML submitted]
. . . Mr Powell . . .
(45)
. . . Powell . . .
Y
(30)
N
N
(11)
. . . she . . .
64


1
 
P( y | x ) 
exp   l f l ( xi , x j , yij )    ' f ' ( yij , y jk , yik ) 
Z x
i , j ,k
 i, j l

Inference in these MRFs = Graph Partitioning
[Boykov, Vekler, Zabih, 1999], [Kolmogorov & Zabih, 2002], [Yu, Cross, Shi, 2002]
. . . Mr Powell . . .
45
. . . Powell . . .
106
30
134
11
. . . Condoleezza Rice . . .
. . . she . . .
10


1
 
P( y | x ) 
exp   l f l ( xi , x j , yij )    ' f ' ( yij , y jk , yik ) 
Z x
i , j ,k
 i, j l

Inference in these MRFs = Graph Partitioning
[Boykov, Vekler, Zabih, 1999], [Kolmogorov & Zabih, 2002], [Yu, Cross, Shi, 2002]
. . . Mr Powell . . .
45
. . . Powell . . .
106
30
134
11
. . . Condoleezza Rice . . .
. . . she . . .
10
 
log (P( y | x )    l f l ( xi , x j , yij ) 
i, j
l
w
i , j w/in
paritions
ij

w
i , j across
paritions
ij
Inference in these MRFs = Graph Partitioning
[Boykov, Vekler, Zabih, 1999], [Kolmogorov & Zabih, 2002], [Yu, Cross, Shi, 2002]
. . . Mr Powell . . .
45
. . . Powell . . .
106
30
134
11
. . . Condoleezza Rice . . .
. . . she . . .
10
 
log (P( y | x )    l f l ( xi , x j , yij ) 
i, j
l
w
i , j w/in
paritions
ij

w
i , j across
paritions
ij
= 22
Inference in these MRFs = Graph Partitioning
[Boykov, Vekler, Zabih, 1999], [Kolmogorov & Zabih, 2002], [Yu, Cross, Shi, 2002]
. . . Mr Powell . . .
45
. . . Powell . . .
106
30
134
11
. . . Condoleezza Rice . . .
. . . she . . .
10
 
log (P( y | x )    l f l ( xi , x j , yij ) 
i, j
l
w
i , j w/in
paritions
ij

 w'
i , j across
paritions
ij
= 314
Markov Random Fields for Co-reference
• Train by maximum likelihood
– Can calculate gradient by Gibbs sampling, or approximate
by stochastic gradient ascent (e.g. voted perceptron).
– Given labeled training data in which partions are given,
learn an affinity measure for which partitioning will reproduce those partitions.
• In need of better algorithms for graph partitioning
– Standard algorithms (e.g. Fiducia Mathesis) do not apply
with negative edge weights.
– Currently beginning to use "Typical Cut" [Gdalyahu,
Weinshall, Werman, 1999]---a greedy, randomized
algorithm.
Co-reference Experimental Results
[McCallum & Wellner, 2003, ICML submitted]
Proper noun co-reference
DARPA ACE broadcast news transcripts, 117 stories
Single-link threshold
Best prev match [Morton]
MRFs
Partition F1
16 %
83 %
88 %
Derror=30%
Pair F1
18 %
89 %
92 %
Derror=28%
DARPA MUC-6 newswire article corpus, 30 stories
Single-link threshold
Best prev match [Morton]
MRFs
Partition F1
11%
70 %
74 %
Derror=13%
Pair F1
7%
76 %
80 %
Derror=17%
Outline
a
• The need. The task.
• Review of Random Field Models for IE.
a
• Recent work
a
– New results for Table Extraction
a
– New training procedure using Feature Induction
a
– New extended model for Co-reference Resolution
a
• Conclusions
MEMM & CRF Related Work
• Maximum entropy for language tasks:
–
–
–
–
Language modeling [Rosenfeld ‘94, Chen & Rosenfeld ‘99]
Part-of-speech tagging [Ratnaparkhi ‘98]
Segmentation [Beeferman, Berger & Lafferty ‘99]
Named entity recognition “MENE” [Borthwick, Grishman,…’98]
• HMMs for similar language tasks
– Part of speech tagging [Kupiec ‘92]
– Named entity recognition [Bikel et al ‘99]
– Other Information Extraction [Leek ‘97], [Freitag & McCallum ‘99]
• Serial Generative/Discriminative Approaches
– Speech recognition [Schwartz & Austin ‘93]
– Reranking Parses [Collins, ‘00]
• Other conditional Markov models
–
–
–
–
Non-probabilistic local decision models [Brill ‘95], [Roth ‘98]
Gradient-descent on state path [LeCun et al ‘98]
Markov Processes on Curves (MPCs) [Saul & Rahim ‘99]
Voted Perceptron-trained FSMs [Collins ’02]
Conclusions
• Conditional Markov Random fields combine
the benefits of
– Conditional probability models (arbitrary features)
– Markov models (for sequences or other relations)
• Success in
– Feature induction
– Coreference analysis
• Future work:
– Factorial CRFs, structure learning, semisupervised learning.
– Tight integration of IE and Data Mining
End of talk
Voted Perceptron Sequence Models
[Collins 2002]
Like CRFs with stochastic gradient ascent and a Viterbi approximation.
 
Given trai ning data : { o , s
(i )
}
Initialize parameters to zero : k  k  0
Iterate to convergenc e :
for all training instances, i


 

sViterbi  arg max s  exp    k f k ( st , st 1 , o , t ) 
t
 k

 


k :  k   Ck ( s ( i ) , o ( i ) )  Ck ( sViterbi, o ( i ) )


 

 where Ck ( s , o )   f k (st 1 , st , o , t ) as before 
t


Analogous to
the gradient
for this one
training instance
Avoids calculating the partition function (normalizer), Zo,
but gradient ascent, not 2nd-order or conjugate gradient method.
Part-of-speech Tagging
[Pereira 2001
personal comm.]
45 tags, 1M words training data, Penn Treebank
DT
NN
NN ,
NN
, VBZ RB
JJ
IN
The asbestos fiber , crocidolite, is unusually resilient once
PRP VBZ DT NNS , IN RB JJ
NNS TO PRP VBG
it enters the lungs , with even brief exposures to it causing
NNS WDT VBP RP NNS JJ ,
NNS
VBD .
symptoms that show up decades later , researchers said .
Using spelling features*
Error
oov error
HMM
5.69% 45.99%
CRF
5.55% 48.05%
error
D err
oov error
4.27% -24% 23.76%
D err
-50%
* use words, plus overlapping features: capitalized, begins with #,
contains hyphen, ends in -ing, -ogy, -ed, -s, -ly, -ion, -tion, -ity, -ies.
Big-Picture Future Directions
• Unification of IE, data mining, decision making
– Build large, working systems to test these ideas
– Bring CiteSeer to UMass; build super-replacement
• Configuring & training complex systems with less
human effort
– Resource-bounded human/machine interaction
– Learning from labeled and unlabeled data,
semi-supervised learning.
• Combinations of text with other modalities
– images, video, sound
– robotics, state estimation, model-building
– bioinformatics, security ...
• Use of IE-inspired models for applications in
– networking, compilers, security, bio-sequences, ...
IE from ...
• SEC Filings
– Dale Kirkland (NASD) charged with detecting securities trading
fraud
• Open Source software
– Lee Giles (PennState) building system that supports code
sharing, interoperability and assessment of reliability.
• Biology and Bioinformatics Research Pubs
– Karen Duca (Virginia Bioinformatics Institute) part of large team
trying to assemble facts that will help guide future experiments.
• Web pages about researchers, groups, pubs,
conferences and grants
– Steve Mahaney (NSF) charged with Congressionally-mandated
effort by NSF to model research impact of its grants.