Automatically Building Special Purpose Search Engines with
Download
Report
Transcript Automatically Building Special Purpose Search Engines with
Unified Models of
Information Extraction and Data Mining
with Application to Social Network Analysis
Andrew McCallum
Information Extraction and Synthesis Laboratory
Computer Science Department
University of Massachusetts Amherst
Joint work with David Jensen
Knowledge Discovery and Dissemination (KDD) Conference
September 2004
Qui ckTi me™ and a
TIFF (Uncompressed) decompr essor
are needed to see this pictur e.
Intelligence Technology Innovation Center
ITIC
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Qui ckT ime™ and a
T IFF (Uncompres sed) decompres sor
are needed to s ee this picture.
Goal:
Improve the state-of-the-art in our ability
to mine actionable knowledge
from unstructured text.
Extracting Job Openings from the Web
foodscience.com-Job2
Employer: foodscience.com
JobTitle: Ice Cream Guru
JobCategory: Travel/Hospitality
JobFunction: Food Services
JobLocation: Upper Midwest
Contact Phone: 800-488-2611
DateExtracted: January 8, 2001
Source: www.foodscience.com/jobs_midwest.html
OtherCompanyJobs: foodscience.com-Job1
Quic kT ime™ and a
T IFF (Uncompres sed) decompres sor
are needed to s ee this picture.
Data Mining the Extracted Job Information
Quic kT ime™ and a
T IFF (Uncompres sed) decompres sor
are needed to s ee this picture.
IE from
Chinese Documents regarding Weather
Department of Terrestrial System, Chinese Academy of Sciences
200k+ documents
several millennia old
- Qing Dynasty Archives
- memos
- newspaper articles
- diaries
Traditional Pipeline
Spider
Filter
Knowledge
Discovery
IE
Segment
Classify
Associate
Cluster
Discover patterns
- entity types
- links / relations
- events
Database
Document
collection
Actionable
knowledge
Prediction
Outlier detection
Decision support
Knowledge
Discovery
IE
Segment
Classify
Associate
Cluster
Problem:
Discover patterns
- entity types
- links / relations
- events
Database
Document
collection
Actionable
knowledge
Combined in serial juxtaposition,
IE and KD are unaware of each others’
weaknesses and opportunities.
1) KD begins from a populated DB, unaware of where
the data came from, or its inherent uncertainties.
2) IE is unaware of emerging patterns and
regularities in the DB.
The accuracy of both suffers, and significant mining
of complex text sources is beyond reach.
Solution:
Uncertainty Info
Spider
Filter
Data
Mining
IE
Segment
Classify
Associate
Cluster
Discover patterns
- entity types
- links / relations
- events
Database
Document
collection
Actionable
knowledge
Emerging Patterns
Prediction
Outlier detection
Decision support
Research & Approach:
Unified Model
Spider
Filter
Data
Mining
IE
Segment
Classify
Associate
Cluster
Probabilistic
Model
Discover patterns
- entity types
- links / relations
- events
Conditionally-trained undirected graphical models
Document
collection
Conditional Random Fields
[Lafferty, McCallum, Pereira]
Conditional PRMs
[Koller…], [Jensen…],
[Geetor…], [Domingos…], …
Complex Inference and Learning
Just what we researchers like to sink our teeth into!
Actionable
knowledge
Prediction
Outlier detection
Decision support
Accomplishments, Discoveries & Results:
• Extracting answers, and also uncertainty/confidence.
– Formally justified as marginalization in graphical models
– Applications to new word discovery in Chinese word segmentation,
and correction propagation in interactive IE
• Joint inference, with efficient methods
– Multiple, cascaded label sequences (Factorial CRFs)
– Multiple distant, but related mentions (Skip-chain CRFs)
– Multiple co-reference decisions (Affinity Matrix CRF)
– Integrating extraction with co-reference (Graphs & chains)
• Put it into a large-scale, working system
– Social network analysis from Email and the Web
– A new portal: research, people, connections.
Accomplishments, Discoveries & Results:
• Extracting answers, and also uncertainty/confidence.
– Formally justified as marginalization in graphical models
– Applications to new word discovery in Chinese word segmentation,
and correction propagation in interactive IE
• Joint inference, with efficient methods
– Multiple, cascaded label sequences (Factorial CRFs)
– Multiple distant, but related mentions (Skip-chain CRFs)
– Multiple co-reference decisions (Affinity Matrix CRF)
– Integrating extraction with co-reference (Graphs & chains)
• Put it into a large-scale, working system
– Social network analysis from Email and the Web
– A new portal: research, people, connections.
Types of Uncertainty in
Knowledge Discovery from Text
• Confidence that extractor correctly obtained
statements the author intended.
• Confidence that what was written is truthful
– Author could have had misconceptions.
– …or have been purposefully trying to mislead.
• Confidence that the emerging, discovered
pattern is a reliable fact or generalization.
1. Labeling Sequence Data
Linear-chain CRFs
[Lafferty, McCallum, Pereira 2001]
Undirected graphical model,
trained to maximize conditional probability of outputs given inputs
Finite state model
Graphical model
OTHER
y t-1
PERSON
yt
PERSON
y t+1
ORG
y t+2
TITLE …
y t+3
output seq
FSM states
...
observations
x
x
t -1
said
t
Arden
1 T
p(y | x)
y (y t , y t1 )xy (x t , y t )
Z(x) t1
x
t +1
Bement
where
x
t +2
NSF
x
t +3
Director …
input seq
() exp k f k ()
k
Segmenting tables in textual gov’t reports, 85% reduction in error over HMMs.
Noun phrase, Named entity [HLT’03], [CoNLL’03]
Protein structure prediction [ICML’04]
IE from Bioinformatics text [Bioinformatics ‘04],…
Asian word segmentation [COLING’04], [ACL’04]
IE from
Research papers [HTL’04]
Object classification in images [CVPR ‘04]
Confidence Estimation in
Linear-chain CRFs
[Culotta, McCallum 2004]
Finite State Lattice
y t-1
yt
y t+1
y t+2
output sequence
y t+3
ORG
OTHER
...
PERSON
Lattice of
FSM states
TITLE
observations
x
t -1
said
x
t
Arden
x
t +1
Bement
x
t +2
NSF
x
t +3
Director …
1 T
p(y | x)
y (yt , yt1)xy (xt , yt )
Z(x) t1
input sequence
Confidence Estimation in
Linear-chain CRFs
[Culotta, McCallum 2004]
Constrained Forward-Backward
y t-1
yt
y t+1
y t+2
output sequence
y t+3
ORG
OTHER
...
PERSON
Lattice of
FSM states
TITLE
observations
x
t -1
said
x
t
Arden
x
t +1
Bement
x
t +2
NSF
x
t +3
Director …
input sequence
1 T
p(Arden Bement PERSON| x)
y (yt , yt1)xy (xt , yt )
Z(x)
y C
t1
Forward-Backward Confidence Estimation
improves accuracy/coverage
our
forward-backward
confidence
traditional
token-wise
confidence
no use of
confidence
Confidence Estimation Applied
• New word discovery in
Chinese word segmentation
[Peng, Fangfang, McCallum
COLING 2004]
– Improves segmentation accuracy by ~25%
• Highlighting fields for
Interactive Information Extraction
[Kristiansen, Culotta,
Viola, McCallum
AAAI 2004]
Honorable Mention Award
– After fixing least confident field,
constrained Viterbi automatically reduces error by
another 23%.
Accomplishments, Discoveries & Results:
• Extracting answers, and also uncertainty/confidence.
– Formally justified as marginalization in graphical models
– Applications to new word discovery in Chinese word segmentation,
and correction propagation in interactive IE
• Joint inference, with efficient methods
–
–
–
–
Multiple, cascaded label sequences (Factorial CRFs)
Multiple distant, but related mentions (Skip-chain CRFs)
Multiple co-reference decisions (Affinity Matrix CRF)
Integrating extraction with co-reference (Graphs & chains)
• Put it into a large-scale, working system
– Social network analysis from Email and the Web
– A new portal: research, people, connections.
1. Jointly labeling cascaded sequences
Factorial CRFs
[Sutton, Khashayar,
McCallum, ICML 2004]
Named-entity tag
Noun-phrase boundaries
Part-of-speech
English words
1. Jointly labeling cascaded sequences
Factorial CRFs
[Sutton, Khashayar,
McCallum, ICML 2004]
Named-entity tag
Noun-phrase boundaries
Part-of-speech
English words
1. Jointly labeling cascaded sequences
Factorial CRFs
[Sutton, Khashayar,
McCallum, ICML 2004]
Named-entity tag
Noun-phrase boundaries
Part-of-speech
English words
But errors cascade--must be perfect at every stage to do well.
1. Jointly labeling cascaded sequences
Factorial CRFs
[Sutton, Khashayar,
McCallum, ICML 2004]
Named-entity tag
Noun-phrase boundaries
Part-of-speech
English words
Joint prediction of part-of-speech and noun-phrase in newswire,
matching accuracy with only 50% of the training data.
Inference:
Tree reparameterization BP
[Wainwright et al, 2002]
2. Jointly labeling distant mentions
Skip-chain CRFs [Sutton, McCallum, SRL 2004]
…
Senator Joe Green said today
…
.
Green ran
for …
Dependency among similar, distant mentions ignored.
2. Jointly labeling distant mentions
Skip-chain CRFs [Sutton, McCallum, SRL 2004]
…
Senator Joe Green said today
…
.
Green ran
14% reduction in error on most repeated field
in email seminar announcements.
Inference:
Tree reparameterization BP
[Wainwright et al, 2002]
for …
3. Joint co-reference among all pairs
Affinity Matrix CRF
“Entity resolution”
“Object correspondence”
. . . Mr Powell . . .
45
. . . Powell . . .
Y/N
99
Y/N
Y/N
11
25% reduction in error on
co-reference of
proper nouns in newswire.
. . . she . . .
Inference:
Correlational clustering
graph partitioning
[Bansal, Blum, Chawla, 2002]
[McCallum, Wellner, IJCAI WS 2003, NIPS 2004]
4. Joint segmentation and co-reference
Joint IE and Coreference from Research Paper Citations
Textual citation mentions
(noisy, with duplicates)
Paper database, with fields,
clean, duplicates collapsed
AUTHORS
TITLE
Cowell, Dawid…
Probab…
Montemerlo, Thrun…FastSLAM…
Kjaerulff
Approxi…
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
VENUE
Springer
AAAI…
Technic…
Citation Segmentation and Coreference
Laurel, B.
Interface Agents: Metaphors with Character , in
The Art of Human-Computer Interface Design , T. Smith (ed) ,
Addison-Wesley , 1990 .
Brenda Laurel . Interface Agents: Metaphors with Character , in
Smith , The Art of Human-Computr Interface Design , 355-366 , 1990 .
Citation Segmentation and Coreference
Laurel, B.
Interface Agents: Metaphors with Character , in
The Art of Human-Computer Interface Design , T. Smith (ed) ,
Addison-Wesley , 1990 .
Brenda Laurel . Interface Agents: Metaphors with Character , in
Smith , The Art of Human-Computr Interface Design , 355-366 , 1990 .
1)
Segment citation fields
Citation Segmentation and Coreference
Laurel, B.
Y
?
N
Interface Agents: Metaphors with Character , in
The Art of Human-Computer Interface Design , T. Smith (ed) ,
Addison-Wesley , 1990 .
Brenda Laurel . Interface Agents: Metaphors with Character , in
Smith , The Art of Human-Computr Interface Design , 355-366 , 1990 .
1)
Segment citation fields
2)
Resolve coreferent citations
Citation Segmentation and Coreference
Laurel, B.
Y
?
N
Interface Agents: Metaphors with Character , in
The Art of Human-Computer Interface Design , T. Smith (ed) ,
Addison-Wesley , 1990 .
Brenda Laurel . Interface Agents: Metaphors with Character , in
Smith , The Art of Human-Computr Interface Design , 355-366 , 1990 .
AUTHOR =
TITLE =
PAGES =
BOOKTITLE =
EDITOR =
PUBLISHER =
YEAR =
Brenda Laurel
Interface Agents: Metaphors with Character
355-366
The Art of Human-Computer Interface Design
T. Smith
Addison-Wesley
1990
1)
Segment citation fields
2)
Resolve coreferent citations
3)
Form canonical database record
Resolving conflicts
Citation Segmentation and Coreference
Laurel, B.
Y
?
N
Interface Agents: Metaphors with Character , in
The Art of Human-Computer Interface Design , T. Smith (ed) ,
Addison-Wesley , 1990 .
Brenda Laurel . Interface Agents: Metaphors with Character , in
Smith , The Art of Human-Computr Interface Design , 355-366 , 1990 .
AUTHOR =
TITLE =
PAGES =
BOOKTITLE =
EDITOR =
PUBLISHER =
YEAR =
Perform
Brenda Laurel
Interface Agents: Metaphors with Character
355-366
The Art of Human-Computer Interface Design
T. Smith
Addison-Wesley
1990
1)
Segment citation fields
2)
Resolve coreferent citations
3)
Form canonical database record
jointly.
IE + Coreference Model
AUT AUT YR TITL TITL
CRF Segmentation
s
Observed citation
x
J Besag 1986 On the…
IE + Coreference Model
AUTHOR = “J Besag”
YEAR =
“1986”
TITLE = “On the…”
Citation mention attributes
c
CRF Segmentation
s
Observed citation
x
J Besag 1986 On the…
IE + Coreference Model
Smyth
,
P Data mining…
Structure for each
citation mention
c
s
x
Smyth . 2001 Data Mining…
J Besag 1986 On the…
IE + Coreference Model
Smyth
,
P Data mining…
Binary coreference variables
for each pair of mentions
c
s
x
Smyth . 2001 Data Mining…
J Besag 1986 On the…
IE + Coreference Model
Smyth
,
P Data mining…
Binary coreference variables
for each pair of mentions
y
n
n
c
s
x
Smyth . 2001 Data Mining…
J Besag 1986 On the…
IE + Coreference Model
Smyth
,
P Data mining…
AUTHOR = “P Smyth”
YEAR =
“2001”
TITLE = “Data Mining…”
...
Research paper entity
attribute nodes
y
n
n
c
s
x
Smyth . 2001 Data Mining…
J Besag 1986 On the…
Such a highly connected graph makes
exact inference intractable, so…
Approximate Inference 1
• Loopy Belief
Propagation
m1(v2)
v1
m2(v3)
v2
m2(v1)
v4
v3
m3(v2)
v5
messages passed
between nodes
v6
Approximate Inference 1
• Loopy Belief
Propagation
• Generalized Belief
Propagation
m1(v2)
v1
m2(v3)
v2
m2(v1)
v3
m3(v2)
messages passed
between nodes
v4
v5
v6
v1
v2
v3
v4
v5
v6
v7
v8
v9
messages passed
between regions
Here, a message is a conditional probability table passed among nodes.
But when message size grows exponentially with size of overlap between regions!
Approximate Inference 2
• Iterated Conditional
Modes (ICM)
v1
v2
v3
= held constant
[Besag 1986]
v4
v6i+1
= argmax
v6i
P(v6i |
v\
v6i)
v5
v6
Approximate Inference 2
• Iterated Conditional
Modes (ICM)
v1
v2
v3
= held constant
[Besag 1986]
v4
v5j+1
= argmax
v5j
P(v5j |
v\
v5j)
v5
v6
Approximate Inference 2
• Iterated Conditional Modes
(ICM)
v1
v2
v3
= held constant
[Besag 1986]
v4
v4k+1
= argmax
P(v4k |
v\
v5
v6
v4k)
v4k
Structured inference scales well here,
but greedy, and easily falls into local minima.
Approximate Inference 2
•
Iterated Conditional Modes
(ICM)
v1
v2
v3
= held constant
[Besag 1986]
v4
v4k+1
= argmax
P(v4k |
v\
v5
v6
v4k)
v4k
•
Iterated Conditional Sampling (ICS)
(our name)
Instead of selecting only argmax, sample of argmaxes of P(v4k | v \ v4k)
e.g. an N-best list (the top N values)
v1
v4
v2
v5
v3
v6
Can use “Generalized Version” of this;
doing exact inference on a region of
several nodes at once.
Here, a “message” grows only linearly
with overlap region size and N!
IE + Coreference Model
Smyth
,
P Data mining…
Exact inference on
these linear-chain regions
From each chain
pass an N-best List
into coreference
Smyth . 2001 Data Mining…
J Besag 1986 On the…
IE + Coreference Model
Smyth
,
P Data mining…
Approximate inference
by graph partitioning…
Make scale to 1M
citations with Canopies
…integrating out
uncertainty
in samples
of extraction
Smyth . 2001 Data Mining…
[McCallum, Nigam, Ungar 2000]
J Besag 1986 On the…
IE + Coreference Model
Smyth
,
P Data mining…
Exact (exhaustive) inference
over entity attributes
y
n
n
Smyth . 2001 Data Mining…
J Besag 1986 On the…
IE + Coreference Model
Smyth
,
P Data mining…
Revisit exact inference
on IE linear chain,
now conditioned on
entity attributes
y
n
n
Smyth . 2001 Data Mining…
J Besag 1986 On the…
Parameter Estimation
Separately for different regions
IE Linear-chain
Exact MAP
Coref graph edge weights
MAP on individual edges
Entity attribute potentials
MAP, pseudo-likelihood
y
n
n
In all cases:
Climb MAP gradient with
quasi-Newton method
4. Joint segmentation and co-reference
[Wellner, McCallum,
Peng, Hay, UAI 2004]
o
Extraction from and matching of
research paper citations.
s
Laurel, B. Interface Agents:
Metaphors with Character, in
The Art of Human-Computer Interface
Design, B. Laurel (ed), AddisonWesley, 1990.
World
Knowledge
c
y
Brenda Laurel. Interface Agents:
Metaphors with Character, in
Laurel, The Art of Human-Computer
Interface Design, 355-366, 1990.
p
Co-reference
decisions
y
Database
field values
c
s
c
y
s
o
Citation attributes
Segmentation
o
35% reduction in co-reference error by using segmentation uncertainty.
6-14% reduction in segmentation error by using co-reference.
Inference:
Variant of Iterated Conditional Modes
[Besag, 1986]
Accomplishments, Discoveries & Results:
• Extracting answers, and also uncertainty/confidence.
– Formally justified as marginalization in graphical models
– Applications to new word discovery in Chinese word segmentation,
and correction propagation in interactive IE
• Joint inference, with efficient methods
– Multiple, cascaded label sequences (Factorial CRFs)
– Multiple distant, but related mentions (Skip-chain CRFs)
– Multiple co-reference decisions (Affinity Matrix CRF)
– Integrating extraction with co-reference (Graphs & chains)
• Put it into a large-scale, working system
– Social network analysis from Email and the Web
– A new portal: research, people, connections.
One Application Project:
Workplace effectiveness ~ Ability to leverage network of acquaintances
“The power of your little black book”
But filling Contacts DB by hand is tedious, and incomplete.
Contacts DB
Email Inbox
QuickTi me™ and a
T IFF (Uncompressed) decompressor
are needed to see thi s pi cture.
Automatically
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
WWW
Qu i c k Ti m e ™ a n d a
TIF F (Un c o m p re s s e d ) d e c o m p re s s o r
a re n e e d e d to s e e th i s p i c tu r e .
System Overview
WWW
Email
CRF
Qu i c k Ti m e ™ a n d a
TIF F (Un c o m p re s s e d ) d e c o m p re s s o r
a re n e e d e d to s e e th i s p i c tu r e .
Keyword
Extraction
Person
Name
Extraction
Name
Coreference
Homepage
Retrieval
names
Contact
Info and
Person
Name
Extraction
Social
Network
Analysis
An Example
To: “Andrew McCallum” [email protected]
Subject ...
Search for
new people
First
Name:
Andrew
Middle
Name:
Kachites
Last
Name:
McCallum
JobTitle:
Associate Professor
Company:
University of Massachusetts
Street
Address:
140 Governor’s Dr.
City:
Amherst
State:
MA
Zip:
01003
Company
Phone:
(413) 545-1323
Links:
Fernando Pereira, Sam
Roweis,…
Key
Words:
Information extraction,
social network,…
Summary of Results
Keywords
William Cohen
Logic programming
Text categorization
Data integration
Rule learning
Daphne Koller
Bayesian networks
Relational models
Probabilistic models
Hidden variables
Deborah
McGuiness
Semantic web
Description logics
Knowledge representation
Ontologies
Tom Mitchell
Machine learning
Cognitive states
Learning apprentice
Artificial intelligence
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Contact info and name extraction performance (25 fields)
CRF
1.
2.
Token
Acc
Field
Prec
Field
Recall
Field
F1
94.50
85.73
76.33
80.76
Example keywords extracted
Person
Expert Finding:
When solving some task, find friends-of-friends with relevant expertise.
Avoid “stove-piping” in large org’s by automatically suggesting collaborators.
Given a task, automatically suggest the right team for the job. (Hiring aid!)
Social Network Analysis:
Understand the social structure of your organization.
Suggest structural changes for improved efficiency.
Main Application Project:
Main Application Project:
Cites
Research
Paper
Main Application Project:
Expertise
Cites
Research
Paper
Grant
Venue
Person
University
Groups
Main Application Project:
Status:
• Spider running. Over 1.5M PDFs in hand.
• Best-in-world published results in IE from
research paper headers and references.
• First version of multi-entity co-reference running.
• First version of Web servlet interface up.
• Well-engineered: Java, servlets, SQL, Lucene,
SOAP, etc.
• Public launch this Fall.
Software Infrastructure
MALLET:
Machine Learning for Language Toolkit
• ~80k lines of Java
• Document classification, information extraction, clustering, coreference, POS tagging, shallow parsing, relational classification, …
• New package: Graphical models and modern inference methods.
– Variational, Tree-reparameterization, Stochastic sampling, contrastive
divergence,…
• New documentation and interfaces.
• Unlike other toolkits (e.g. Weka) MALLET scales to millions of
features, 100k’s training examples, as needed for NLP.
Released as Open Source Software.
http://mallet.cs.umass.edu
In use at UMass, MIT, CMU,
Stanford, Berkeley, UPenn,
UT Austin, Purdue…
Publications and Contact Info
• Conditional Models of Identity Uncertainty with Application to Noun Coreference. Andrew McCallum
and Ben Wellner. Neural Information Processing Systems (NIPS), 2004.
• An Integrated, Conditional Model of Information Extraction and Coreference with Application to
Citation Matching. Ben Wellner, Andrew McCallum, Fuchun Peng, Michael Hay. Conference on
Uncertainty in Artificial Intelligence (UAI), 2004.
• Collective Segmentation and Labeling of Distant Entities in Information Extraction. Charles Sutton
and Andrew McCallum. ICML workshop on Statistical Relational Learning, 2004.
• Extracting Social Networks and Contact Information from Email and the Web. Aron Culotta, Ron
Bekkerman and Andrew McCallum. Conference on Email and Spam (CEAS) 2004.
• Dynamic Conditional Random Fields: Factorized Probabilistic Models for Labeling and Segmenting
Sequence Data. Charles Sutton, Khashayar Rohanimanesh and Andrew McCallum. ICML 2004.
• Interactive Information Extraction with Constrained Conditional Random Fields. Trausti
Kristjannson, Aron Culotta, Paul Viola and Andrew McCallum. AAAI 2004.
(Winner of Honorable Mention Award.)
• Accurate Information Extraction from Research Papers using Conditional Random Fields. Fuchun
Peng and Andrew McCallum. HLT-NAACL, 2004.
• Chinese Segmentation and New Word Detection using Conditional Random Fields. Fuchun Peng,
Fangfang Feng, and Andrew McCallum. International Conference on Computational Linguistics
(COLING 2004), 2004.
• Confidence Estimation for Information Extraction. Aron Culotta and Andrew McCallum. (HLTNAACL), 2004,
http://www.cs.umass.edu/~mccallum
End of Talk