Information Extraction 3

Transcript Information Extraction 3

Web scale Information
Extraction
The Value of Text Data

“Unstructured” text data is the primary form of human-generated
information
 Blogs, web pages, news, scientific literature, online reviews, …
 Semi-structured data (database generated): see Prof. Bing Liu’s
KDD webinar: http://www.cs.uic.edu/~liub/WCM-Refs.html
 The techniques discussed here are complimentary to structured
object extraction methods

Need to extract structured information to effectively manage, search,
and mine the data

Information Extraction: mature, but active research area
 Intersection of Computational Linguistics, Machine Learning,
Data mining, Databases, and Information Retrieval
 Traditional focus on accuracy of extraction
Outline

Information Extraction Tasks
 Entity tagging
 Relation extraction
 Event extraction

Scaling up Information Extraction
 Focus on scaling up to large collections (where data
mining can be most beneficial)
 Other dimensions of scalability
Information Extraction Tasks

Extracting entities and relations: this talk
 Entities: named (e.g., Person) and generic (e.g., disease name)
 Relations: entities related in a predefined way (e.g., Location of a
Disease outbreak, or a CEO of a Company)
 Events: can be composed from multiple relation tuples

Common extraction subtasks:
 Preprocess: sentence chunking, syntactic parsing, morphological
analysis
 Create rules or extraction patterns: hand-coded, machine learning, and
hybrid
 Apply extraction patterns or rules to extract new information
 Postprocess and integrate information
 Co-reference resolution, deduplication, disambiguation
Entity Tagging

Identifying mentions of entities (e.g., person names, locations, companies) in
text
 MUC (1997): Person, Location, Organization, Date/Time/Currency
 ACE (2005): more than 100 more specific types

Hand-coded vs. Machine Learning approaches

Best approach depends on entity type and domain:
 Closed class (e.g., geographical locations, disease names, gene & protein
names): hand coded + dictionaries
 Syntactic (e.g., phone numbers, zip codes): regular expressions
 Semantic (e.g., person and company names): mixture of context, syntactic
features, dictionaries, heuristics, etc.
 “Almost solved” for common/typical entity types
Example: Extracting Entities from Text

Useful for data warehousing, data cleaning, web data
integration
Address
Citation
House
number
Building
Road
City
State
Zip
4089 Whispering Pines Nobel Drive San Diego CA 92122 1
Ronald Fagin, Combining Fuzzy Information from Multiple
Systems, Proc. of ACM SIGMOD, 2002
Segment(si)
Sequence
Label(si)
S1
Ronald Fagin
Author
S2
Combining Fuzzy Information from Multiple Systems
Title
S3
Proc. of ACM SIGMOD
Conference
S4
2002
Year
Hand-Coded Methods

Easy to construct in some cases


e.g., to recognize prices, phone numbers, zip codes,
conference names, etc.
Intuitive to debug and maintain

Especially if written in a “high-level” language:
ContactPattern  RegularExpression(Email.body,”can be reached at”)
[IBM Avatar]
 Can incorporate domain knowledge

Scalability issues:




Labor-intensive to create
Highly domain-specific
Often corpus-specific
Rule-matches can be expensive
Machine Learning Methods

Can work well when lots of training data easy to construct

Can capture complex patterns that are hard to encode with hand-crafted
rules



e.g., determine whether a review is positive or negative
extract long complex gene names
Non-local dependencies
The human T cell leukemia lymphotropic virus type 1 Tax protein represses
MyoD-dependent transcription by inhibiting MyoD-binding to the KIX domain of
p300.“
[From AliBaba]
Popular Machine Learning Methods

For details: [Feldman, 2006 and Cohen, 2004]
Naive Bayes
SRV [Freitag 1998], Inductive Logic Programming
Rapier [Califf and Mooney 1997]
Hidden Markov Models [Leek 1997]
Maximum Entropy Markov Models [McCallum et al. 2000]
Conditional Random Fields [Lafferty et al. 2001]

Scalability







Can be labor intensive to construct training data
At run time, complex features can be expensive to construct or process
(batch algorithms can help: [Chandel et al. 2006] )
Some Available Entity Taggers

ABNER:
 http://www.cs.wisc.edu/~bsettles/abner/
 Linear-chain conditional random fields (CRFs) with orthographic and contextual
features.

Alias-I LingPipe
 http://www.alias-i.com/lingpipe/

MALLET:
 http://mallet.cs.umass.edu/index.php/Main_Page
 Collection of NLP and ML tools, can be trained for name entity tagging

MinorThird:
 http://minorthird.sourceforge.net/
 Tools for learning to extract entities, categorization, and some visualization

Stanford Named Entity Recognizer:
 http://nlp.stanford.edu/software/CRF-NER.shtml
 CRF-based entity tagger with non-local features
Alias-I LingPipe ( http://www.alias-i.com/lingpipe/ )

Statistical named entity tagger
 Generative statistical model



Find most likely tags given lexical and linguistic features
Accuracy at (or near) state of the art on benchmark tasks
Explicitly targets scalability:
 ~100K tokens/second runtime on single PC
 Pipelined extraction of entities
 User-defined mentions, pronouns and stop list


Specified in a dictionary, left-to-right, longest match
Can be trained/bootstrapped on annotated corpora
Outline

Overview of Information Extraction
 Entity tagging
 Relation extraction
 Event extraction

Scaling up Information Extraction
 Focus on scaling up to large collections (where data
mining and ML techniques shine)
 Other dimensions of scalability
Relation Extraction Examples

Extract tuples of entities that are related in predefined way
Disease Outbreaks relation
May 19 1995, Atlanta -- The Centers for Disease Control
and Prevention, which is in the front line of the world's
response to the deadly Ebola epidemic in Zaire, is finding
itself hard pressed to cope with the crisis…
Date
Disease Name
Location
Jan. 1995
Malaria
Ethiopia
July 1995
Mad Cow Disease
U.K.
Feb. 1995
Pneumonia
U.S.
May 1995
Ebola
Zaire
Relation Extraction
„We show that CBF-A and CBF-C interact with each other
to form a CBF-A-CBF-C complex and that CBF-B does not
interact with CBF-A or CBF-C individually but that it
associates with the CBF-A-CBF-C complex.“
CBF-A
CBF-B
interact
complex
associates
CBF-C
CBF-A-CBF-C complex
[From AliBaba]
Relation Extraction Approaches
Knowledge engineering
 Experts develop rules, patterns:



Can be defined over lexical items: “<company> located in <location>”
Over syntactic structures: “((Obj <company>) (Verb located) (*) (Subj
<location>))”
Sophisticated development/debugging environments:

Proteus, GATE
Machine learning

Supervised: Train system over manually labeled data
 Soderland et al. 1997, Muslea et al. 2000, Riloff et al. 1996, Roth et al 2005,
Cardie et al 2006, Mooney et al. 2005, …

Partially-supervised: train system by bootstrapping from “seed” examples:
 Agichtein & Gravano 2000, Etzioni et al., 2004, Yangarber & Grishman 2001,
…
 “Open” (no seeds): Sekine et al. 2006, Cafarella et al. 2007, Banko et al.
2007

Hybrid or interactive systems:
 Experts interact with machine learning algorithms (e.g., active learning
family) to iteratively refine/extend rules and patterns
 Interactions can involve annotating examples, modifying rules, or any
combination
Open Information Extraction [Banko et al., IJCAI
2007]

Self-Supervised Learner:

All triples in a sample corpus (e1, r, e2) are considered potential “tuples” for relation r

Positive examples: candidate triplets generated by a dependency parser

Train classifier on lexical features for positive and negative examples

Single-Pass Extractor:

Classify all pairs of candidate entities for some (undetermined) relation

Heuristically generate a relation name from the words between entities

Redundancy-Based Assessor:

Estimate probability that entities are related from co-occurrence statistics

Scalability

Extraction/Indexing




Query-time


No tuning or domain knowledge during extraction, relation inclusion determined at query time
0.04 CPU seconds pre sentence, 9M web page corpus in 68 CPU hours
Every document retrieved, processed (parsed, indexed, classified) in a single pass
Distributed index for tuples by hashing on the relation name text
Related efforts: [Cucerzan and Agichtein 2005], [Pasca et al. 2006], [Sekine et al. 2006], [Rozenfeld and
Feldman 2006], …
Event Extraction

Similar to Relation Extraction, but:




Events can be nested
Significantly more complex (e.g., more slots) than relations/template elements
Often requires coreference resolution, disambiguation, deduplication, and
inference
Example: an integrated disease outbreak event [Hatunnen et al. 2002]
Event Extraction: Integration Challenges

Information spans multiple documents





Duplicate entities, relation tuples extracted




Missing or incorrect values
Combining simple tuples into complex events
No single key to order or cluster likely duplicates while separating
them from similar but different entities.
Ambiguity: distinct physical entities with same name (e.g., Kennedy)
Large lists with multiple noisy mentions of the same entity/tuple
Need to depend on fuzzy and expensive string similarity functions
Cannot afford to compare each mention with every other.
See Part II of KDD 2006 Tutorial “Scalable Information Extraction and
Integration” -- scaling up integration: http://www.scalability-tutorial.net/
Summary: Accuracy of Extraction
Tasks
[Feldman, ICML 2006 tutorial]

Errors cascade (errors in entity tag cause errors in relation extraction)

This estimate is optimistic:
 Primarily for well-established (tuned) tasks
 Many specialized or novel IE tasks (e.g. bio- and medical- domains) exhibit lower
accuracy
 Accuracy for all tasks is significantly lower for non-English
Multilingual Information Extraction

Closely tied to machine translation and cross-language information retrieval efforts.

Language-independent named entity tagging and related tasks at CoNLL:

2006: multi-lingual dependency parsing (http://nextens.uvt.nl/~conll/)

2002, 2003 shared tasks: language independent Named Entity Tagging
(http://www.cnts.ua.ac.be/conll2003/ner/)

Global Autonomous Language Exploitation program (GALE):

http://www.darpa.mil/ipto/Programs/gale/concept.htm

Interlingual Annotation of Multilingual Text Corpora (IAMTC)

Tools and data for building MT and IE systems for six languages

http://aitc.aitcnet.org/nsf/iamtc/index.html

REFLEX project: NER for 50 languages

Exploit for training temporal correlations in weekly aligned corpora

http://l2r.cs.uiuc.edu/~cogcomp/wpt.php?pr_key=REFLEX

Cross-Language Information Retrieval (CLEF)

http://www.clef-campaign.org/
Outline

Overview of Information Extraction
 Entity tagging
 Relation extraction
 Event Extraction

Scaling up Information Extraction
 Focus on scaling up to large collections (where data
mining and ML techniques shine)
 Other dimensions of scalability
Scaling Information Extraction to the
Web

Dimensions of Scalability
 Corpus size:



Document accessibility:



Deep web: documents only accessible via a search interface
Dynamic sources: documents disappear from top page
Source heterogeneity:



Applying rules/patterns is expensive
Need efficient ways to select/filter relevant documents
Coding/learning patterns for each source is expensive
Requires many rules (expensive to apply)
Domain diversity:



Extracting information for any domain, entities, relationships
Some recent progress (e.g., see slide 17)
Not the focus of this talk
Scaling Up Information Extraction

Scan-based extraction
 Classification/filtering to avoid processing documents
 Sharing common tags/annotations

General keyword index-based techniques
 QXtract, KnowItAll

Specialized indexes
 BE/KnowItNow, Linguist’s Search Engine

Parallelization/distributed processing
 IBM WebFountain, UIMA, Google’s Map/Reduce
Efficient Scanning for Information Extraction
Output Tuples
Text Database
Classifier
1. Retrieve docs
from database
2. Filter documents
Extraction
…
System
3. Process filtered
documents
4. Extract output
tuples

80/20 rule: use few simple rules to capture majority of the instances [Pantel et
al. 2004]

Train a classifier to discard irrelevant documents without processing [Grishman
et al. 2002]
 (e.g., the Sports section of NYT is unlikely to describe disease outbreaks)

Share base annotations (entity tags) for multiple extraction tasks
Exploiting Keyword and Phrase Indexes

Generate queries to retrieve only relevant documents

Data mining problem!

Some methods in literature:
 Traversing Query Graphs [Agichtein et al. 2003]
 Iteratively refine queries [Agichtein and Gravano 2003]
 Iteratively partition document space [Etzioni et al., 2004]

Case studies: QXtract, KnowItAll
Simple Strategy: Iterative Set Expansion
Output Tuples
Text Database
…
Extraction
Query
System
1. Query
database with
seed tuples
Generation
2. Process retrieved
documents
3. Extract tuples
from docs
(e.g., <Malaria, Ethiopia>)
4. Augment seed
tuples with new
tuples
(e.g., [Ebola AND Zaire])
Execution time = |Retrieved Docs| * (R + P) + |Queries| * Q
Time for retrieving a
document
Time for processing
a document
Time for answering
a query
Some IE tools Available

MALLET (UMass)
 statistical natural language processing,
 document classification,
 clustering,
 information extraction


other machine learning applications to text.
Sample Application:
GeneTaggerCRF: a gene-entity tagger based on
MALLET (MAchine Learning for LanguagE Toolkit). It
uses conditional random fields to find genes in a text file.
MinorThird



http://minorthird.sourceforge.net/
“a collection of Java classes for storing text,
annotating text, and learning to extract entities and
categorize text”
Stored documents can be annotated in independent
files using TextLabels (denoting, say, part-of-speech
and semantic information)
GATE

http://gate.ac.uk/ie/annie.html

leading toolkit for Text Mining

distributed with an Information Extraction component set
called ANNIE (demo)
Used in many research projects
 Long list can be found on its website
 Under integration of IBM UIMA

Sunita Sarawagi's CRF package


http://crf.sourceforge.net/
A Java implementation of conditional random fields for
sequential labeling.
UIMA (IBM)

Unstructured Information Management Architecture.

A platform for unstructured information
management solutions from combinations of
semantic analysis (IE) and search
components.
Some Interesting Website based on IE




ZoomInfo
CiteSeer.org (some of us using it everyday!)
Google Local, Google Scholar
and many more…
UIMA (IBM Research)

Unstructured Information Management Architecture (UIMA)
 http://www.research.ibm.com/UIMA/

Open component software architecture for development, composition, and
deployment of text processing and analysis components.

Run-time framework allows to plug in components and applications and run
them on different platforms. Supports distributed processing, failure
recovery, …

Scales to millions of documents – incorporated into IBM OmniFind, grid
computing-ready

The UIMA SDK (freely available) includes a run-time framework, APIs, and
tools for composing and deploying UIMA components.

Framework source code also available on Sourceforge:
 http://uima-framework.sourceforge.net/
UIMA – Quick Overview
Architecture, Software
Framework and Tooling
Analytics Bridge the Unstructured & Structured
Worlds
Text and Multi-Modal
Analytics
UIMA
Structured
Information
Unstructured
Information
Text, Chat,
Email, Audio,
Video
•
•
•
•
•
•
•
Discover Relevant Semantics → Build into Structure
Docs, Emails, Phone Calls, Reports
Topics, Entities, Relationships
People, Places, Org, Times, Events
Customer Opinions, Products, Problems
Threats, Chemicals, Drugs, Drug Interactions....
High-Value
Most Current
Fastest Growing
...BUT ...
Buried in Huge Volumes (Noise)
Implicit Semantics
Inefficient Search
Indices
DBs
KBs
•
Explicit Semantics
•
Efficient Search
•
Focused Content
...BUT...
•
Slow Growing
•
Narrow Coverage
•
Less Current/Relevant
Analytics: The kinds of things they do
• Independently developed
• From an increasing # of sources
• Different technologies & interfaces
• Highly specialized & fine grained
Capability Specializations
Analysis Capabilities









Language, Speaker Identifiers
Tokenizers
Classifiers
Part of Speech Detectors
Document Structure Detectors
Parsers, Translators
Named-Entity Detectors
Face Recognizers
Relationship Detectors








Modality
Human Language
Domain of Interest
Source: Style and Format
Input/Output Semantics
Privacy/Security
Precision/Recall Tradeoffs
Performance/Precision
Tradeoffs...
The right analysis for the job will likely be a best-of-breed
combination integrating capabilities across many dimensions.
UIMA’s Basic Building Blocks are Annotators. They iterate over an artifact to
discover new types based on existing ones and update the Common
Analysis Structure (CAS) for upstream processing.
UIMA CAS
Representation now
Aligned
with XMI standard
Common Analysis Structure (CAS)
CeoOf
Relationship
Arg2:Org
Arg1:Person
Analysis Results
(i.e., Artifact Metadata)
Person
Named Entity
Organization
NP
Parser
Fred
Center
PP
VP
is
the
CEO
Artifact (e.g., Document)
of
Center
Micros
• Analyzed by a collection of text analytics
• Detected Semantic Entities and Relations Highlighted
• Represented in UIMA Common Analysis Structure (CAS)
UIMA: Unstructured Information Management
Architecture



Open Software Architecture and Emerging Standard
 Platform independent standard for interoperable text and multi-modal analytics
 Under Development: UIMA Standards Technical Committee Initiated under
OASIS
 http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=uima
Software Framework Implementation
 SDK Available on IBM Alphaworks
 http://www.alphaworks.ibm.com/tech/uima
 Tools, Utilities, Runtime, Extensive Documentation
 Creation, Integration, Discovery, Deployment of analytics
 Java, C++, Perl, Python (others possible)
 Supports co-located and service-oriented deployments (eg., SOAP)
 x-Language High-Performances APIs to common data structure (CAS)
 Embeddable on Systems Middleware (e.g., ActiveMQ, WebSphere, DB2)
Apache UIMA open-source project
 http://incubator.apache.org/uima/
Any UIMA-Compliant
Readers, Segmenters
Web Crawler
Entity & Relation
Detector(s)
File System
Reader
Connect, Read
& Segment
Sources
Index Tokens
& Annotations
in IR Engine
Deep
Parser
Arabic-English
Translator
(Web Service)
Streaming
Speech Segmenter
Text, Chat,
Email,
Speech, Video
Any UIMA-Compliant
CAS Consumer(s)
Any UIMA-Compliant
Transcription
Video
Analysis Engine(s)
Engine
Object Detector
Analyze Content
Assign
Task-Relevant
CAS
Semantics
CAS
Index
Entities & Relations
in RDB or OWL KB
Index or
Process
Results
UIMA: Pluggable Framework, User-defined Workflows
CAS: Common UIMA Data Representation & Interchange
Aligned with OMG & W3C standards (i.e., XMI, SOAP, RDF)
Query Interface(s)
End-User
Application
Relevant Knowledge
Interfaces
Query
Services
Relational
Database
Text IR
Engine
Index
OWL
KnowledgeBase
Video
Search
Index
UIMA Component Architecture
Collection Processing Engine (CPE)
Aggregate Analysis Engine
CAS Consumer
Analysis Engine
Text, Chat,
Email,
Audio,
Video
Collection
CAS Consumer
Annotator
CAS
Analysis Engine
Annotator
Flow
Controller
Flow
Controller
Key
Framework
Construction
Developer Codes
Indices
CAS Consumer
Reader
CAS
Ontologies
CAS
DBs
Knowledge
Bases

Information Extraction 3

Transcript Information Extraction 3

Directory