Scalable Information Extraction and Integration

Transcript Scalable Information Extraction and Integration

Towards Web-Scale
Information Extraction
Eugene Agichtein
Mathematics & Computer Science
Emory University
[email protected]
http://www.mathcs.emory.edu/~eugene/
Eugene Agichtein
KDD Webinar: Towards Web-Scale Information Extraction
1
The Value of Text Data

“Unstructured” text data is the primary form of human-generated
information
 Blogs, web pages, news, scientific literature, online reviews, …
 Semi-structured data (database generated): see Prof. Bing Liu’s
KDD webinar: http://www.cs.uic.edu/~liub/WCM-Refs.html
 The techniques discussed here are complimentary to structured
object extraction methods

Need to extract structured information to effectively manage,
search, and mine the data

Information Extraction: mature, but active research area
 Intersection of Computational Linguistics, Machine Learning,
Data mining, Databases, and Information Retrieval
 Traditional focus on accuracy of extraction
Eugene Agichtein
KDD Webinar: Towards Web-Scale Information Extraction
2
Example: Answering Queries Over Text
For years, Microsoft
Corporation CEO Bill
Gates was against open
source. But today he
appears to have changed
his mind. "We can be
open source. We love the
concept of shared
source," said Bill Veghte,
a Microsoft VP. "That's a
super-important shift for
us in terms of code
access.“
Richard Stallman,
founder of the Free
Software Foundation,
countered saying…
Eugene Agichtein
Select Name
From PEOPLE
Where Organization = ‘Microsoft’
PEOPLE
Name
Bill Gates
Bill Veghte
Richard Stallman
Title
Organization
CEO
Microsoft
VP
Microsoft
Founder Free Soft..
Bill Gates
Bill Veghte
(from William Cohen’s IE tutorial, 2003)
KDD Webinar: Towards Web-Scale Information Extraction
3
Outline

Information Extraction Tasks




Entity tagging
Relation extraction
Event extraction
Scaling up Information Extraction


Focus on scaling up to large collections (where
data mining can be most beneficial)
Other dimensions of scalability
Eugene Agichtein
KDD Webinar: Towards Web-Scale Information Extraction
4
Information Extraction Tasks

Extracting entities and relations: this talk
 Entities: named (e.g., Person) and generic (e.g., disease name)
 Relations: entities related in a predefined way (e.g., Location of a
Disease outbreak, or a CEO of a Company)
 Events: can be composed from multiple relation tuples

Common extraction subtasks:
 Preprocess: sentence chunking, syntactic parsing, morphological
analysis
 Create rules or extraction patterns: hand-coded, machine
learning, and hybrid
 Apply extraction patterns or rules to extract new information
 Postprocess and integrate information

Co-reference resolution, deduplication, disambiguation
Eugene Agichtein
KDD Webinar: Towards Web-Scale Information Extraction
5
Entity Tagging

Identifying mentions of entities (e.g., person names, locations,
companies) in text
 MUC (1997): Person, Location, Organization, Date/Time/Currency
 ACE (2005): more than 100 more specific types

Hand-coded vs. Machine Learning approaches

Best approach depends on entity type and domain:
 Closed class (e.g., geographical locations, disease names, gene &
protein names): hand coded + dictionaries
 Syntactic (e.g., phone numbers, zip codes): regular expressions
 Semantic (e.g., person and company names): mixture of context,
syntactic features, dictionaries, heuristics, etc.
 “Almost solved” for common/typical entity types
Eugene Agichtein
KDD Webinar: Towards Web-Scale Information Extraction
6
Example: Extracting Entities from Text

Useful for data warehousing, data cleaning, web
data integration
Address
Citation
House
number
Building
Road
City
State
Zip
4089 Whispering Pines Nobel Drive San Diego CA 92122 1
Ronald Fagin, Combining Fuzzy Information from Multiple
Systems, Proc. of ACM SIGMOD, 2002
Segment(si)
Sequence
Label(si)
S1
Ronald Fagin
Author
S2
Combining Fuzzy Information from Multiple Systems
Title
S3
Proc. of ACM SIGMOD
Conference
S4
2002
Year
Eugene Agichtein
KDD Webinar: Towards Web-Scale Information Extraction
7
Hand-Coded Methods


Easy to construct in some cases
 e.g., to recognize prices, phone numbers, zip codes,
conference names, etc.
Intuitive to debug and maintain
 Especially if written in a “high-level” language:
ContactPattern  RegularExpression(Email.body,”can be reached at”)
[IBM Avatar]


Can incorporate domain knowledge
Scalability issues:




Labor-intensive to create
Highly domain-specific
Often corpus-specific
Rule-matches can be expensive
Eugene Agichtein
KDD Webinar: Towards Web-Scale Information Extraction
8
Machine Learning Methods

Can work well when lots of training data easy to construct

Can capture complex patterns that are hard to encode with
hand-crafted rules



e.g., determine whether a review is positive or negative
extract long complex gene names
Non-local dependencies
The human T cell leukemia lymphotropic virus type 1 Tax protein
represses MyoD-dependent transcription by inhibiting MyoDbinding to the KIX domain of p300.“
[From AliBaba]
Eugene Agichtein
KDD Webinar: Towards Web-Scale Information Extraction
9
Representation Models [Cohen and McCallum, 2003]
Classify Pre-segmented
Candidates
Lexicons
Abraham Lincoln was born in Kentucky.
Abraham Lincoln was born in Kentucky.
member?
Alabama
Alaska
…
Wisconsin
Wyoming
Boundary Models
Abraham Lincoln was born in Kentucky.
Sliding Window
Abraham Lincoln was born in Kentucky.
Classifier
Classifier
which class?
which class?
Try alternate
window sizes:
Finite State Machines
Abraham Lincoln was born in Kentucky.
Context Free Grammars
Abraham Lincoln was born in Kentucky.
BEGIN
Most likely state sequence?
NNP
NNP
V
V
P
Classifier
PP
which class?
VP
NP
BEGIN
END
BEGIN
NP
END
VP
S
…and beyond
Any of these models can be used to capture words, formatting or both.
Eugene Agichtein
KDD Webinar: Towards Web-Scale Information Extraction
10
Popular Machine Learning Methods
For details: [Feldman, 2006 and Cohen, 2004]

Naive Bayes
SRV [Freitag 1998], Inductive Logic Programming
Rapier [Califf and Mooney 1997]
Hidden Markov Models [Leek 1997]
Maximum Entropy Markov Models [McCallum et al. 2000]
Conditional Random Fields [Lafferty et al. 2001]

Scalability







Can be labor intensive to construct training data
At run time, complex features can be expensive to construct or
process (batch algorithms can help: [Chandel et al. 2006] )
Eugene Agichtein
KDD Webinar: Towards Web-Scale Information Extraction
11
Some Available Entity Taggers

ABNER:



Alias-I LingPipe



http://mallet.cs.umass.edu/index.php/Main_Page
Collection of NLP and ML tools, can be trained for name entity tagging
MinorThird:



http://www.alias-i.com/lingpipe/
MALLET:


http://www.cs.wisc.edu/~bsettles/abner/
Linear-chain conditional random fields (CRFs) with orthographic and contextual
features.
http://minorthird.sourceforge.net/
Tools for learning to extract entities, categorization, and some visualization
Stanford Named Entity Recognizer:


http://nlp.stanford.edu/software/CRF-NER.shtml
CRF-based entity tagger with non-local features
Eugene Agichtein
KDD Webinar: Towards Web-Scale Information Extraction
12
Alias-I LingPipe ( http://www.alias-i.com/lingpipe/ )

Statistical named entity tagger

Generative statistical model



Find most likely tags given lexical and linguistic features
Accuracy at (or near) state of the art on benchmark tasks
Explicitly targets scalability:



~100K tokens/second runtime on single PC
Pipelined extraction of entities
User-defined mentions, pronouns and stop list


Specified in a dictionary, left-to-right, longest match
Can be trained/bootstrapped on annotated corpora
Eugene Agichtein
KDD Webinar: Towards Web-Scale Information Extraction
13
Outline

Overview of Information Extraction




Entity tagging
Relation extraction
Event extraction
Scaling up Information Extraction


Focus on scaling up to large collections (where
data mining and ML techniques shine)
Other dimensions of scalability
Eugene Agichtein
KDD Webinar: Towards Web-Scale Information Extraction
14
Relation Extraction Examples

Extract tuples of entities that are related in predefined way
Disease Outbreaks relation
May 19 1995, Atlanta -- The Centers for Disease Control
and Prevention, which is in the front line of the world's
response to the deadly Ebola epidemic in Zaire, is finding
itself hard pressed to cope with the crisis…
Relation Extraction
Date
Disease Name
Location
Jan. 1995 Malaria
Ethiopia
July 1995 Mad Cow Disease U.K.
Feb. 1995 Pneumonia
May 1995 Ebola
U.S.
Zaire
„We show that CBF-A and CBF-C interact with each other
to form a CBF-A-CBF-C complex and that CBF-B does not
interact with CBF-A or CBF-C individually but that it
associates with the CBF-A-CBF-C complex.“
CBF-A
CBF-B
interact
complex
associates
CBF-C
CBF-A-CBF-C complex
Eugene Agichtein
[From AliBaba]
KDD Webinar: Towards Web-Scale Information Extraction
15
Relation Extraction Approaches
Knowledge engineering
 Experts develop rules, patterns:



Can be defined over lexical items: “<company> located in <location>”
Over syntactic structures: “((Obj <company>) (Verb located) (*) (Subj
<location>))”
Sophisticated development/debugging environments:

Proteus, GATE
Machine learning

Supervised: Train system over manually labeled data


Partially-supervised: train system by bootstrapping from “seed” examples:



Soderland et al. 1997, Muslea et al. 2000, Riloff et al. 1996, Roth et al 2005,
Cardie et al 2006, Mooney et al. 2005, …
Agichtein & Gravano 2000, Etzioni et al., 2004, Yangarber & Grishman 2001, …
“Open” (no seeds): Sekine et al. 2006, Cafarella et al. 2007, Banko et al. 2007
Hybrid or interactive systems:


Experts interact with machine learning algorithms (e.g., active learning family) to
iteratively refine/extend rules and patterns
Interactions can involve annotating examples, modifying rules, or any combination
Eugene Agichtein
KDD Webinar: Towards Web-Scale Information Extraction
16
Open Information Extraction [Banko et al., IJCAI 2007]

Self-Supervised Learner:




Single-Pass Extractor:



Classify all pairs of candidate entities for some (undetermined) relation
Heuristically generate a relation name from the words between entities
Redundancy-Based Assessor:


All triples in a sample corpus (e1, r, e2) are considered potential “tuples” for relation r
Positive examples: candidate triplets generated by a dependency parser
Train classifier on lexical features for positive and negative examples
Estimate probability that entities are related from co-occurrence statistics
Scalability

Extraction/Indexing




Query-time


No tuning or domain knowledge during extraction, relation inclusion determined at query time
0.04 CPU seconds pre sentence, 9M web page corpus in 68 CPU hours
Every document retrieved, processed (parsed, indexed, classified) in a single pass
Distributed index for tuples by hashing on the relation name text
Related efforts: [Cucerzan and Agichtein 2005], [Pasca et al. 2006], [Sekine et al.
2006], [Rozenfeld and Feldman 2006], …
Eugene Agichtein
KDD Webinar: Towards Web-Scale Information Extraction
17
Event Extraction

Similar to Relation Extraction, but:




Events can be nested
Significantly more complex (e.g., more slots) than relations/template
elements
Often requires coreference resolution, disambiguation, deduplication, and
inference
Example: an integrated disease outbreak event [Hatunnen et al. 2002]
Eugene Agichtein
KDD Webinar: Towards Web-Scale Information Extraction
18
Event Extraction: Integration Challenges

Information spans multiple documents
 Missing or incorrect values
 Combining simple tuples into complex events
 No single key to order or cluster likely duplicates while separating them
from similar but different entities.
 Ambiguity: distinct physical entities with same name (e.g., Kennedy)

Duplicate entities, relation tuples extracted
 Large lists with multiple noisy mentions of the same entity/tuple
 Need to depend on fuzzy and expensive string similarity functions
 Cannot afford to compare each mention with every other.

See Part II of KDD 2006 Tutorial “Scalable Information Extraction and
Integration” -- scaling up integration: http://www.scalability-tutorial.net/
Eugene Agichtein
KDD Webinar: Towards Web-Scale Information Extraction
19
Other Information Extraction Tutorials
See these tutorials for more details:

R. Feldman, Information Extraction – Theory and Practice,
ICML 2006
http://www.cs.biu.ac.il/~feldman/icml_tutorial.html

W. Cohen, A. McCallum, Information Extraction and
Integration: an Overview, KDD 2003
http://www.cs.cmu.edu/~wcohen/ie-survey.ppt
Eugene Agichtein
KDD Webinar: Towards Web-Scale Information Extraction
20
Summary: Accuracy of Extraction Tasks
[Feldman, ICML 2006 tutorial]

Errors cascade (errors in entity tag cause errors in relation extraction)

This estimate is optimistic:



Primarily for well-established (tuned) tasks
Many specialized or novel IE tasks (e.g. bio- and medical- domains) exhibit
lower accuracy
Accuracy for all tasks is significantly lower for non-English
Eugene Agichtein
KDD Webinar: Towards Web-Scale Information Extraction
21
Multilingual Information Extraction

Active research area, beyond the scope of this talk. Nevertheless, a few (incomplete)
pointers are provided.

Closely tied to machine translation and cross-language information retrieval efforts.

Language-independent named entity tagging and related tasks at CoNLL:



Global Autonomous Language Exploitation program (GALE):



Tools and data for building MT and IE systems for six languages
http://aitc.aitcnet.org/nsf/iamtc/index.html
REFLEX project: NER for 50 languages



http://www.darpa.mil/ipto/Programs/gale/concept.htm
Interlingual Annotation of Multilingual Text Corpora (IAMTC)


2006: multi-lingual dependency parsing (http://nextens.uvt.nl/~conll/)
2002, 2003 shared tasks: language independent Named Entity Tagging
(http://www.cnts.ua.ac.be/conll2003/ner/)
Exploit for training temporal correlations in weekly aligned corpora
http://l2r.cs.uiuc.edu/~cogcomp/wpt.php?pr_key=REFLEX
Cross-Language Information Retrieval (CLEF)

http://www.clef-campaign.org/
Eugene Agichtein
KDD Webinar: Towards Web-Scale Information Extraction
22
Outline

Overview of Information Extraction




Entity tagging
Relation extraction
Event Extraction
Scaling up Information Extraction


Focus on scaling up to large collections (where
data mining and ML techniques shine)
Other dimensions of scalability
Eugene Agichtein
KDD Webinar: Towards Web-Scale Information Extraction
23
Scaling Information Extraction to the Web

Dimensions of Scalability

Corpus size:



Document accessibility:



Deep web: documents only accessible via a search interface
Dynamic sources: documents disappear from top page
Source heterogeneity:



Applying rules/patterns is expensive
Need efficient ways to select/filter relevant documents
Coding/learning patterns for each source is expensive
Requires many rules (expensive to apply)
Domain diversity:



Extracting information for any domain, entities, relationships
Some recent progress (e.g., see slide 17)
Not the focus of this talk
Eugene Agichtein
KDD Webinar: Towards Web-Scale Information Extraction
24
Scaling Up Information Extraction

Scan-based extraction
 Classification/filtering to avoid processing documents
 Sharing common tags/annotations

General keyword index-based techniques
 QXtract, KnowItAll

Specialized indexes
 BE/KnowItNow, Linguist’s Search Engine

Parallelization/distributed processing
 IBM WebFountain, UIMA, Google’s Map/Reduce
Eugene Agichtein
KDD Webinar: Towards Web-Scale Information Extraction
25
Efficient Scanning for Information Extraction
Output Tuples
Text Database
Classifier
1. Retrieve docs
from database
2. Filter documents
Extraction
…
System
3. Process filtered
documents
4. Extract output
tuples

80/20 rule: use few simple rules to capture majority of the instances
[Pantel et al. 2004]

Train a classifier to discard irrelevant documents without processing
[Grishman et al. 2002]
 (e.g., the Sports section of NYT is unlikely to describe disease
outbreaks)

Share base annotations (entity tags) for multiple extraction tasks
Eugene Agichtein
KDD Webinar: Towards Web-Scale Information Extraction
26
Exploiting Keyword and Phrase Indexes

Generate queries to retrieve only relevant documents

Data mining problem!

Some methods in literature:




Traversing Query Graphs [Agichtein et al. 2003]
Iteratively refine queries [Agichtein and Gravano 2003]
Iteratively partition document space [Etzioni et al., 2004]
Case studies: QXtract, KnowItAll
Eugene Agichtein
KDD Webinar: Towards Web-Scale Information Extraction
27
Simple Strategy: Iterative Set Expansion
Output Tuples
Text Database
…
Extraction
Query
System
Generation
2. Process retrieved
documents
1. Query
database with
seed tuples
3. Extract tuples
from docs
(e.g., <Malaria, Ethiopia>)
4. Augment seed
tuples with new
tuples
(e.g., [Ebola AND Zaire])
Execution time = |Retrieved Docs| * (R + P) + |Queries| * Q
Time for retrieving a
document
Eugene Agichtein
Time for processing
a document
Time for answering
a query
KDD Webinar: Towards Web-Scale Information Extraction
28
Reachability via Querying
Tuples
t1
Documents
d1
[Agichtein et al. 2003b]
Reachability Graph
t1
<SARS, China>
t2
d2
t2
t3
t5
t4
<Ebola, Zaire>
t3
<Malaria, Ethiopia>
d3
t4
d4
t5
d5
<Cholera, Sudan>
t1 retrieves document d1
that contains t2
<H5N1, Vietnam>
Upper recall limit: determined by the size
Eugene
KDD Webinar:
Towards Web-Scale
Information Extraction
ofAgichtein
the biggest
connected
component
29
Reachability Graph for DiseaseOutbreaks
Eugene Agichtein
DiseaseOutbreaks,
KDD Webinar: Towards Web-Scale Information
Extraction
30
New York
Times 1995
Getting Around Reachability Limits

KnowItAll [Etzioni et al., WWW 2004]:



Add keywords to partition documents into
retrievable disjoint sets
Submit queries with parts of extracted instances
QXtract [Agichtein and Gravano 2003a]


General queries with many matching documents
Assumes many documents retrievable per query
Eugene Agichtein
KDD Webinar: Towards Web-Scale Information Extraction
31
QXtract
User-Provided Seed Tuples
[Agichtein and Gravano 2003]
1. Get document sample with
“likely negative” and “likely
positive” examples.
2. Label sample documents using
information extraction system
as “oracle.”
3. Train classifiers to “recognize”
useful documents.
4. Generate queries from classifier
model/rules.
Eugene Agichtein
Seed Sampling
?
?
?
?
? ?
?
?
Search Engine
Text Database
Information Extraction
tuple1
tuple2
tuple3
tuple4
tuple5
+
+
-
-
+
+
- -
Classifier Training
+
+
-
-
+
+
-
-
Query Generation
Queries
KDD Webinar: Towards Web-Scale Information Extraction
32
Using Generic Indexes: Summary

Order of magnitude speed up in runtime
But: keyword indexes are approximate (so the
queries are not precise)
Require many documents to retrieve

Can we do better?


Eugene Agichtein
KDD Webinar: Towards Web-Scale Information Extraction
33
Index Structures for Information Extraction

Bindings Engine [Cafarella and Etzioni 2005]

Indexing and querying entities: [K. Chakrabarti et al. 2006]

IBM Avatar project:
 http://www.almaden.ibm.com/cs/projects/avatar/

Other indexing schemes:



Linguist’s search engine (P. Resnik): http://lse.umiacs.umd.edu:8080/
FREE: Indexing regular expressions: [Cho and Rajagolapan, ICDE 2002]
Indexing and querying linguistic information in XML: [Bird et al., 2006]
Eugene Agichtein
KDD Webinar: Towards Web-Scale Information Extraction
34
Bindings Engine (BE) [Cafarella and Etzioni 2005]


Variabilized search query language
Integrates variable/type data with inverted index, minimizing query seeks


Index <NounPhrase>, <Adj-Term> terms
Key idea: neighbor index

At each position in the index, store “neighbor text” – both lexemes and tags
Query: “cities such as <NounPhrase>”
as
#docs
billy
docid0
19
pos0
docid1
pos1
…
docid#docs-1
pos#docs-1
cities
friendly
give
#posns #posns
pos0
12
pos
neighbor
pos1 pos…
0
0
1
neighbor
pos#pos-1
1
…
pos#pos-1
…
mayors
nickels
philadelphia
blk_offset
#neighbors
neighbor0
str0
neighbor1
str1
<offset>
3
AdjTleft
such
NPright
Philadelphia
such
words
Result: in document 19: “I love cities such as Philadelphia.”
Eugene Agichtein
KDD Webinar: Towards Web-Scale Information Extraction
35
Related Approach: [K. Chakrabarti et al. 2006]


Support “relationship” keyword queries over indexed entities
Top-K support for early processing termination
Eugene Agichtein
KDD Webinar: Towards Web-Scale Information Extraction
36
Workload-Driven Indexing [S. Chakrabarti et al. 2006]
Indexing Thousands of Entity Types
Eugene Agichtein
KDD Webinar: Towards Web-Scale Information Extraction
37
Workload-Driven Indexing (continued)
Eugene Agichtein
KDD Webinar: Towards Web-Scale Information Extraction
38
Parallelization/Adaptive Processing

Parallelize processing:



WebFountain [Gruhl et al. 2004]
UIMA architecture
Map/Reduce
Eugene Agichtein
KDD Webinar: Towards Web-Scale Information Extraction
39
IBM WebFountain
[Gruhl et al. 2004]






Dedicated share-nothing 256-node cluster
Blackboard annotation architecture
Data pipelined and streamed past each
“augmenter” to add annotations
Merge and index annotations
Index both tokens and annotations
Between 25K-75K entities per second
Eugene Agichtein
KDD Webinar: Towards Web-Scale Information Extraction
40
UIMA (IBM Research)

Unstructured Information Management Architecture (UIMA)

http://www.research.ibm.com/UIMA/

Open component software architecture for development, composition,
and deployment of text processing and analysis components.

Run-time framework allows to plug in components and applications and
run them on different platforms. Supports distributed processing, failure
recovery, …

Scales to millions of documents – incorporated into IBM OmniFind, grid
computing-ready

The UIMA SDK (freely available) includes a run-time framework, APIs,
and tools for composing and deploying UIMA components.

Framework source code also available on Sourceforge:

http://uima-framework.sourceforge.net/
Eugene Agichtein
KDD Webinar: Towards Web-Scale Information Extraction
41
Map/Reduce ([Dean & Ghemawat, OSDI 2004])
Eugene Agichtein
KDD Webinar: Towards Web-Scale Information Extraction
42
Map/Reduce (continued)

General framework



Scales to 1000s of machines
Recently implemented in Nutch and other open source efforts
“Maps” nicely to information extraction

Map phase:




Parse individual documents
Tag entities
Propose candidate relation tuples
Reduce phase


Merge multiple mentions of same relation tuple
Resolve co-references, duplicates
Eugene Agichtein
KDD Webinar: Towards Web-Scale Information Extraction
43
Summary

Presented a brief overview of information extraction

Techniques and ideas to scale up information extraction




Scan-based techniques (limited impact)
Exploiting general indexes (limited accuracy)
Building specialized index structures (most promising)
Scalability is a data mining problem




Querying graphs  link discovery
Workload mining for index optimization
Automatically optimizing for specific text mining application
Considerations for building integrated data mining systems
Eugene Agichtein
KDD Webinar: Towards Web-Scale Information Extraction
44
References and Supplemental Materials

References, slides, and additional information
available at:
http://www.mathcs.emory.edu/~eugene/kdd-webinar/

Will also post answers to questions
Eugene Agichtein
KDD Webinar: Towards Web-Scale Information Extraction
45

Scalable Information Extraction and Integration

Transcript Scalable Information Extraction and Integration

Directory