Information Extraction

Download Report

Transcript Information Extraction

Pharos Summer School
Fundamentals
of
Social Applications
June 2009
Avaré Stewart
[email protected]
http://www.l3s.uni-hannover.de/~stewart/pharos/
Roadmap
• Part I: Overview Social Applications
– current shortcomings, solutions
• Part II : Information Extraction (IE)
– tasks, techniques, tools
• Part III: Evaluation
• Part IV: IE & IR Applications in Context
Overview of Social
Applications
The Social Applications Phenomena
The Social Application
Phenomena today is driven
by Social Media
Social Media:
• information content of the
“citizen journalist”, user
generated content
• popular way, people
connect in online world,
personal & business
relationships
Avaré Stewart
09/04/17
4
What ‘s the Social Media Hype?
Capitalize on Social Processes Diffusion / Cascade
• Coverage:
– Reach small or large audiences
– Breaks publication barriers
• Business / Advertisement
– Repeated Visiting: best links readers will
come back
• Information Gathering / Sharing:
– Cut time you spend looking
media
– Link economy is real…Give some, get The core concepts of socialEspoo,
April 2007
some
– Dynamic Content: not endpoint of
conversation, but the beginning…
• Social Intervention / Detection
– Rumors , fads, infectious disease
The Many Faces of Social Applications
Domain:
• Music, politics, cycling, medicine
Media Type:
• Video: YouTube, Daily MotionFacebook
Services:
• meeting people
• expressing point view
• serendipitous discovery
What Are Some Limitations
with Social Applictions?
Where's the “Social” Web ?
Social Sites intentionally seek distinction
Problem:
sheer number: redundancy, overlap:
• type of media, resources
• topics
Social
Networking
Divide
Overlaps exists: untapped to the benefit
of those who actually constitute the
social networking ecosystem
The ,so called, Social Web is
ironically divided
09/04/17
Open Social Networking (OSN)
Aspects of an Open Social Network
• Unified Data Spaces
• Personal Identity Unification
• Unified Applications
Unified Data Spaces Linking Open Data Cloud
http://esw.w3.org/topic/SweoIG/TaskForces/Commu
nityProjects/LinkingOpenData
Personal Indentity Unification
• OpenID : a single digital
• Retaggr : social media
profile card
• Geek Chart : graphical
profile - pie chart
• DandyID : collect online
profiles in one place
• FriendFeed : real-time
aggregator, consolidates
the updates from sites
Unified Applications
Multi-Site APIs: common API for social
applications across multiple websites
– OpenSocial
– Data Portability Project
Single Site –APIs: partner / interact
programmatically
– YouTube Data API: videos
– Spinn3r: indexing blogosphere
– etc....
Pharos Scenario
Bloggers Who
Don’t Tag
Taggers Who
Don’t Blog
???
Social Network Divide
Missing Link: Cross-Tagging
Exploit the tags
assertions
made by users of one
social site
to personalize the
experience
for users in another,
comparable site
Avaré Bonaparte Stewart
Overview: Cross Tagging
Better
Browsing
Better
Search
Better
Recommendations
Cross-Tagging for Personalized Open Social
Networking, Stewart, Diaz, Balby Marinho 2008
15
What More Can We Do with
Social Applications?
Social Medial Communities & Content
 Social media: examined, primarily for
popularity in connecting people
Espoo, April 2007
 In Pharos: examine blogs improved,
personalized information access
Complex Information Needs & Social
Media Search
•
•
•
•
•
•
Polarity, opinion
Meme and themes
Related, multi-lingual resources
Entities: people, organizations, etc.
Relationships between entities
Event: who, what, where, when, how
Events ? ... Momentum is Shifting
• Industry:
– Complex Event Processing (CEP)
– Event correlation:
• Event Filtering , Event Aggregation
• Event Masking, Root Cause Analysis
• Research:
– Event detection
– Associations
– De-duplicate
Humans think in
terms of events
and entities
Events - natural
abstraction of real
world
Information Retrieval, Meet Information
Extraction ... from Blogs
• Information Extraction IE :
– a subarea of Natural Language
Processing (NLP)
– Needed to solve complex (eventdriven) information needs
– hard, because natural language is
complex, vague and ambiguous,
i.e.: unstructured
• potentially harder, for blogs &
informal sources
IR
IE
Social Media
Anatomy of a Blog
Rich Source for Personalized Information
Archive
Author
Tag
Content
Trackback
Permalink
Comment
Timestamp
Blogroll
Feed
Title
Part II: Information Extraction
Tasks, Techniques and Tools
What is Information
Extraction ?
Unstructured Data
•
Encoded in a way that
makes is difficult for
computers to
immediately interpret
•
Multiple languages,
across multiple
documents
Why Information Extraction?
• Large amount of unstructured or semistructured information
– Web pages, email, news articles, call-center text records, business
reports, annotations, spreadsheets, research papers, blogs, tags,
instant messages (IM), …
• High impact applications
– Business intelligence, personal information management, Web
communities, Web search and advertising, scientific data
management, e-government, medical records management, …
• Open ended and growing rapidly
• Information Extraction:
– Superimpose formal meaning on unstructured information
– Elicit facts and relationships
– Feed database/knowledgebase
Why? ... Information is Locked Away...
Events, Facts,
Relationships
Information
System
Human Tasks &
Distillation /
Extraction
System
Pre-Filtering
Inaccesible data .... growing and sophisticated needs ... growing
What is Information Extraction (IE) ?
• ...isolates relevant text fragments, extracts
relevant information from the fragments,
and pieces together the targeted
information in a coherent framework
• ... build systems that finds and link
relevant information while ignoring
extraneous and irrelevant information
•
IE is used to get some information out of unstructured data
Cowie and Lehnert, 1996 p.81
Information Extraction : i.e. Disaster
Unstructured
Text
Information Extraction (IE) System
Structured
Text
Information Extraction: Major Tasks
• Segmentation
– Tokenization, Sentence Splitting
• Classification
– POS Tagging, Lemmatization, Disambiguation, …
– Entity Detection
• Association
– Noun Phrase Chunking
– Parsing
– Relationship Detection
• Normalization & Deduplication
– Anaphora Resolution
– Normalization of Formats, Schema
– Record Linkage, Record Deduplication
– Mention Tracking
What are the
Components and Tasks
of an
Information Extraction
System?
General View of IE System
Training Phase
Deployment Phase
INPUT:
Source Text
INPUT:
Training corpus
Preprocessing
Preprocessing
External
Knowledge
Thesaurus
Aquisition
Learning
Extraction
Grammar
Feedback
Extraction
Ontology
Knowledge
Base
OUTPUT:
Structured
Information
Inforamtion
Extraction ,
Moens
Moen 06
Common IE Tasks: Preprocessing &
Recognition
Pre-Processing Tasks
Recognition Tasks
Normalization
Named Entity (NE)
Sentence Splitting
Co-reference Resolution(CO)
Tokenization
Template Element Construction (TE)
POS Tagging
Template Relation Construction (TR)
Chunking
Scenario Template (ST)
Parsing
Semantic Role
Sense Disambiguation
Timex Line Recognition
Ex: Text Normalization
AVIAN INFLUENZA, HUMAN (101): EGYPT, 79TH, 80TH CASES
*****************************************************
A ProMED-mail post
<http://www.promedmail.org>
ProMED-mail is a program of the
Clean junk
International Society for Infectious Diseases
formatting
http://www.isid.org
Date: Mon 8 Jun 2009
Source: Egyptian Chronicles [edited]
<http://egyptianchronicles.blogspot.com/2009/06/h5n1-follow-up-no80.html>
•Transformed to make it consistent
•Performed before text is processed
Sentence Splitting
• Segments text into
sentences
• Required for the tagger
• Domain- and applicationindependent
He called Mr.
White at 4p.m. in
Washington, D.C.
Mr. Green
responded.
The computer must
tell which of the
dots denote an
actual sentence
Tokenization
• Tokenization / Word
Segmentation:
– Numbers,
punctuation,
symbols
– string of
contiguous
alphanumeric
characters with
space on either
side?
Words are not always surrounded by
whitespace:
Abbreviation are etc.
and Calif.
A text-based
medium.
White space not indicating a word break:
Phone: 0171 378 0647
San Franciso
Ditto: in spite of
Parts of Speech (POS)
• POS: category / class
• Words in same class have similar syntactic
behavior
• Ex: Noun: person, place, thing, animal
• Ex: verbs express action
Ex: Penn Treebank POS Tagset
Tag
Description
Example
Tag
Description
Example
Tag
Description
CC
Coord conjuction
and, but, or
NNP
Proper noun, sing
IBM
VBD
Verb, past tense
CD
Cardinal number
one, two
NNPS
Proper noun, plural
West Indies
VBG
Verb, gerund
DT
Determiner
a , the
PDT
predeterminer
All, both
VBN
Verb, past partici
EX
Existential there
There
POS
Possesive ending
´s
VBP
Verb non-3prs
FW
Foreign Word
Mea culpa
PRP
Personal pronoun
I , you , he
VBZ
Verb, 3prs
IN
Prep/ subordinate
conjunction
of, in, by
RB
Adverb
Quickly, never
WDT
Wh-determ
JJ
Adjective
Yellow
RBR
Adverb,
comparative
faster
WP
Wh-pronoun
JJR
Adjective,
comparative
Bigger
RBS
Adverb, superlative
fastest
WP$
Possesive-wh
JJS
Adjective,
superlative
Wildest
RP
Particle
Up, off
WRB
Wh-adverb
LS
List item marker
1, 2, One
SYM
Symbol
+, %, &
$
MD
Modal
Can, should
TO
To
to
#
NN
Noun, Sing
Dog
UH
Interjection
Ah, oops
(
NNS
Noun, plural
dogs
VB
Verb base form
eat
)
Chunking
• Words are organized into groups
• Phrases: word groupings, clumped as a unit
S
NP
That
man
VP
VBD
NP
caught
the
butterfly
PP
IN
NP
with
a net
Parsing
• Labeled syntactic
tree corresponding to
the interpretation of
the sentence
• Resolution of
syntactic ambiguities
Sense Disambiguation
S
NP
VP
S
Time
NP
flies
PP
VP
like
Fruit
flies
like
NP
NP
an
a
Fruit flies like a banana
banana
arrow
Time flies like an arrow
What are Some Basic
RecognitionTasks?
IE Recognition Tasks
ACE Recognition Tasks
MUC Recognition Tasks
Named Entity (NE)
Entity detection and tracking
(EDT)
Co-reference Resolution
(CO)
Relation detection and
characterization (RDC)
Template Element
Construction (TE)
Template Relation
Construction (TR)
Event detection and
characterization (EDC)
Scenario Template (ST)
Temporal expression
detection (TERN)
MUC-1
MUC-2
MUC-3
Year
1987
1989
1991
MUC-4
1992
MUC-5
MUC-6
MUC-7
1993
1995
1998
1999
ACE
2002
...
Event
ACE
Pilot
ACE +
Text Analysis
Conference (TAC)
2009
Named Entity Recognition (NE)
• recognition of entity
names:
– people, organizations
– place names
– temporal expressions &
numerical expressions
Co-reference Resolution (CO)
• Identify chains of noun phrases that refer
to the same object
John saw Mary. The
• Scope:
girl was very
– Within document
– Across document
beautiful; she wore
a new red dress.
• Types:
 Pronominal : ’they’, ’it’, ’he’, ’hers’,
’themselves’, etc. resolve to : proper nouns,
common nouns , other pronouns
Proper Noun Coreference
• Names of people, places, products and
companies referred to in many different
variations.
3M
Minnesota Mining and Manufacturing
3M Corp.
NYC
New York
New York City
N.Y.C
Ref: Coreference as a Foundation for Link
Analysis over Free Text
Other Coreference Types
• Apposition:
 noun phrases, side by
side
 one define or modified
the other
John Smith, chairman
of General Electric,
resigned yesterday.
• Predicate Nominal:
 noun phrase is main predicate of a sentence
 subject and predicate nominal connected by a
linking verb (copula)
John is the finest juggler in the world.
Template Element Construction (TE)
• Specified classes and attributes
of entities:
–
–
–
–
person : name (name variants),
title, nationality,
description in the text
subtype
Template Relation Construction (TR)
• Two-slot template
representing a binary
relation:
– e.g., employee_of,
product_of, location_of
Fei-Yu Xu 08
– pointers to template
elements
Scenario Template Production (ST)
• information involving
several relations or events:
– Joint venture
– Partners
– Products
– Profits
Fei-Yu Xu 08
Can We Extract
Temporal Expressions?
Temporal expression detection (TERN)
• Time Expression Recognition and Normalization
– recognize and normalize expressions that refer to date
and time
– Timestamp of events
– Meaning of temporal expressions
– Conditions associating time with a relation / event
• TIMEX2 Standard
• XML tags + time
• second generation TIMEX
Some Examples: TIMEX2 Time
Thursday, July 15, 1999
Precise Time:
I was sick <TIMEX2 VAL="1999-07-14"> yesterday </TIMEX2>.
Duration:
I will be on vacation for <TIMEX2 VAL="P3W" ANCHOR_DIR="AFTER"
ANCHOR_VAL="1999-07-15"> three weeks </TIMEX2>.
Pronouns:
The contractor submitted a proposal on <TIMEX2 VAL="1999-07-13">
Tuesday </TIMEX2>.
<TIMEX2 VAL="1999-07-14"> The day after <TIMEX2 VAL="1999-07-13">
that </TIMEX2> </TIMEX2>, the contract was awarded.
State of the Art Performance
• Named entity recognition
– Person, Location, Organization, …
– F1 in high 80’s or low- to mid-90’s
• Binary relation extraction
– Contained-in (Location1, Location2)
Member-of (Person1, Organization1)
– F1 in 60’s or 70’s or 80’s
• N-ary relation extraction, event detection
– Much lower -> errors accumulate!
How Can Information
Extraction Be Performed?
Common IE Techniques
• Knowledge Engineering
• Corpus Based / Machine Learning
Classification for IE
• Many problems needed
• Features: object
for IE can be redescription, context
formulated as a
classification problem • Class: which object
belongs
• Input: Training Data
• Classifier : Learning
Algorithm
• Output: Hypothesis fits
the data
Classification Scheme
• The class /semantic disctintion that we want
to assign information unit:
–
–
–
–
–
Named Entitiy: protein, drug, disease
Semantic Role: i.e verb : agent
Grammatic Role: object, subject
Domain Independent: person, organization
Sentence boundary : {!,.,-}
Ex: Features
Semantic Role Recognition
Feature
Value
Phrase type
Noun / Verb phase, determined by the POS tag of
syntactic head
Syntactic head
Word that composes syntactic head of the phrase that
represents i
Voice
Active or passive
Named Entity Class
Class : person, organization of syntactic head
The actual set of features used is determined by a feature
selecton strategy
Specific to the problem at hand
Moens06
Ex. Features
Coreference Resolution (CO)
Feature
Value
Number Agreement
True if i and j agree in number
Gender Aggrement
True if i and j agree in gender
Alias
True if is an alias of j, vice versa
Pronoun i ( j)
True if i (j ) is a pronoun
Appositive
True if j is appositve of i
Definitieness
True is j is preceeded by „the“ or demonstrative pronoun
Grammatical
Role
True if grammatical role of i and j match
i.e: subject, direct /indirect object,
Proper name
True is both are proper names
Name entity class
True is both have the same semantic class
Discourse distance
Number of sentences or words that i and j are apart
Moens06
Do It Yourself: IE Task
Text:
•
New York Times Co. named Russell T.
Lewis, 45, president and general
manager of its flagship New York Times
newspaper, responsible for all
business-side activities.
A sample of text from the Wall
Street Journal is given, together
with a template
•
The task is to fill the template
with information about
succession events extracted from
the text
•
There are six events in total,
although complete information is
not available for all of them
He was executive vice president and
deputy general manager. He succeeds
Lance R. Primis, who in September was
named president and chief operating
officer of the parent.
Template:
<ORGANIZATION-1>
NAME
: "New York Times Co.“
<ORGANIZATION-2>
NAME
: "New York Times"
<PERSON-1>
NAME
: "Russell T. Lewis“
<PERSON-2>
NAME
: "Lance R. Primis"
http://gate.ac.uk/ie/ie_example.html
Some Techniques : At a Glance
Maximum
Entropy
Hidden
Markov
Conditional
Random Field
Classification
Tree
Learning
Relational
Learning
Support
Vector
Machine
What Tools Can I Use to
Perform Information
Extraction?
An IE Toolkit: Lexical Resources
Machine Readable corpus, dictionary, etc..
and tools for processing them
Tools
Dictionary
GATE
WordNet
VerbNet
Tagger
NER
Parser
UIMA
Comlex
Treebank
Linguistic Data
Brown
Consortium
GENIA
(LDC)
Penn Treebank
Ontology
BCO
UMLS
Open Biomedical
Ontology
Part III: Evaluation in Information Extraction
Evaluation
• We evaluate our systems to:
– See how they are behaving w.r.t
golden standard
– Compare them with other systems
• Types of Evaluations:
– Intrinsic: specific to extraction task
– Extrinsic: task on which extraction relies,
e.g.: Information Retrieval task
Evaluation Precision / Recall
Expert
Yes
Expert
No
System
Yes
TP
FP
System
No
FN
TN
Recall = TP / (TP + FN) Precision = TP / (TP + FP) Fall Out = FP / (FP + TN)
fraction of
correct/relevant
answers which are
predicted
fraction of predictions
which are
correct/relevant
proportion of incorrect
class members given
the number of incorrect
class members i.e.,
Expert No
F Measure
Combine measure for Precision and
Recall
F=
(B2 + 1) PR
B2 P + R
P = precision
R = recall
B = a factor that indicates the relative
importance of recall and precision
When B = 1, recall and precision are of
equal importance = > harmonic mean
(F1-measure)
What Other Types of
Metrics Exist Besides
Precision and Recall?
Vilain Metric : Pron. Coreference
• Equivalence Class
evaluation
– Groups built by system
compared against gold
standard (Key)
– Compare equivalence
classes defined by links in
key and computed values
(Response)
A Model-Theoretic Coreference Scoring Schem e
Marc Vilain, John Burger, John Aberdeen, Dennis Connolly, Lynette
Hirschman
John saw Mary. He
thought she was a
very beautiful girl
and she wore a new
red dress.
John
Mary
he
girl
she
Coreference Chains
Vilain Recall: Concepts
Key Links: <A-B , B-C>
Response Links: { (A-C) }
S : equivalence class relative to Key
S = {A,B,C}, where |S| = 3
p(S): Response partition on S (from Key)
• intersection of S and Response
• elements in Key, not Response
p(S) = { (A-C) , (B) }
c(S): minimal number of "correct links”
to generate S
c(S) = (|S| - 1) = 2
m(S): no. "missing" Response Links
m(S) = (|p(S)| - 1)
|p(S)| = 2
Vilain: Recall / Precision
Recall
Precision
Recall : links added Response
Key
Equiv
Class
Response
Equiv
Class
Precision : links added to Key
Do it Youself: Vilain Metric
Part IV: Exploiting Information Extraction with
IR in Social Applications
IE in Context
Create ontology
Spider
Filter by relevance
IE
Segment
Classify
Associate
Cluster
Database
Load DB
Document
collection Train extraction models
Label training data
Query,
Search
Datamine
What does an Entity
Extraction Scenario Look
Like?
Scenario I: OKKAM tackling the Flood
of Identifiers
http://www.reuters.com/news/globalcoverage/barackobama
http://en.wikipedia.org/wiki/Barack_obama
http://www.OPENCALAIS.com/watch?v=z4W2_raF_iw
??
http://www.facebook.com/home.php#/barackobama?ref=s
http://dbpedia.org/resource/Barack_Obama
http://www.linkedin.com/in/barackobama
http://farm4.static.flickr.com/3193/2437394249_824e76ed76.jpg?v=0
http://current.com/index.php/items/89822170/obama_to_sign_stimulus_bill_today_in_denver.htm
OKKAM & Information Extraction
79
Information Extraction & OKKAMization
http://www.okkam.org/
send ID Request
(based on entity
name, type + context
information)
http://www.okkam.org/ens/idb3016709-b9e1-42c0-ac5f-6383d2e5b235
OKKAM
return OKKAM ID
(or list of
candidates)
ENS
decide about
type
(e.g.)
Person
attach ID to entity
reference in text
NER:
detect named
entity
=> prepare for
information integration,
entity cenrtic search,
semantic infusion (attachment
of information about entity)
OKKAM & Information Extraction
80
What Does an Event
Extraction Scenario Look
Like?
Scenaio II: Epidemic Intelligence
Goal: early identification of
potential health threats:
• verification, assessment,
investigation
State of Art: Event-Based
• web data
• NLP, Data Mining, Machine
Learning techniques
• extract epidemic events from
the unstructured text..
• News, domain-specific reports,
blogs
09/04/17
online news
Event Mining for Early Detection, Rapid
Response ...
How Can Events Be Used in
Pharos Audio-Visual
Search?
Scenario III: Facets in Pharos
• Event-Centric Search / Browsing
– Document representation no longer Bag-of-Words:
– Events => N-ary relations between entities or classes
Scenario III: Extraction from Informal
Text
• Transcribed Speech
– Discourse structure of „Speech Text“ differs from
written text
– Transcription errors
– Missing orthographic features
• Sentence Boundaries difficult to detect
• Automatic Speech Recognition (ASR) Vocabulary Problem
• Blogs
–
–
–
–
–
Affective, opinionated
Topic fluctuating, prose
Many authors, different style
Inconsistent capitalization patterns
Malformed sentences & phrases, Slang, .....
• Part V: Wrap Up & Conclusion
What Considerations Do I
Need to Make for My
Information Extraction
System?
Consideration for IE System
Description
Dimension
document structure of the
input text
• free text
• semi-structured
richness of the natural
language processing (NLP)
• shallow NLP
• deep NLP
complexity of the pattern
rules
• single slot
• multiple slots
data size
• training data
• application data
degree of automation
• supervised
• semi-supervised
• unsupervised
type of evaluation
• gold standard corpus?
• evaluation measures used ?
• evaluation of machine learning
What Are Some Important
Directions in Information
Extracation?
Research Trends in IE
Concept
Description
[1] Semi / Un – Supervised, Self
Learning
Supervised methods assume:
• annotated documents
• broad coverage
• suffcient data redundancy
[2] Open Information Extraction
•Target relations not know in advance
[3] Web Scale Systems
• Number of relations is large
Research trends in IE
• Selfsupervised Information Extraction at
WebScale
– KnowItAll: Extracting closed set of relations
[Etzioni 2005]
– TextRunner: Extracting open set of of relations
[Banko 2007]
– Open IE : The Tradeoffs Between Open and
Traditional Relation Extraction [Banko 2008]
– SRES [Feldman 2006], LEILA [Suchanek 2006]:
Extracting closed relation set with more elaborate
linguistic preprocessing
 Scalability:
• Large set of seed relations (e.g. entire IMDB)
• Open ended corpora
 Noise: Incorrect seed interpretations
In Summary ....
Information is No Longer Locked
Away...
Information
System
Human
Decision
System
Pre-Filtering
Information
Extraction
Events, Facts, Relationships,
Opinions
Social Application
Integration
IR and EI Tradeoffs
• IE needs more CPU power, suitable tradeoff
between data size, analysis depth,
complexity , time, etc.
• Deeper analysis , complex template
structures consumes more time than shallow
analysis and simple named entity recognition
or binary relation extraction
• Ease of use needs improvement
… Lighting the Way …
IE is acknowledged: an urgently needed information
technology - a constantly growing digitized world
Globalized information
society winners ?
…Those who outstrip competitors, comprehensive,
integrated and precise access to digital
information for decision making processes!
Thank You
Useful Tools
• ANNIE : Information Extraction System
– http://gate.ac.uk/ie/annie.html
• Stanford Parser
– http://nlp.stanford.edu:8080/parser/
• WhatsWhyWithMyNLP?
– http://code.google.com/p/whatswrong/
• LingPipe
– http://alias-i.com/lingpipe/html
– http://www-nlp.stanford.edu/downloads/
Useful Links
• Software Tools for NLP
– http://www-a2k.is.tokushimau.ac.jp/member/kita/NLP/nlp_tools.html
• Statistical NLP / corpus-based computational
linguistics resources
– http://nlp.stanford.edu/links/statnlp.html
• Stanford NLP Group
– http://www-nlp.stanford.edu/downloads/
• Linguist List - Language and Resources
– http://www.linguistlist.org/langres/index.html
Selected References
• Foundations of Statistical Natural
Language Processing, Manning and
Schutze
• Information Extraction, Moens
• Text Mining Handbook, Feldman, Sanger
• Maximum Entropy Model for NLP,
Ratnaparkhi