chapter 11 Information Extraction

Download Report

Transcript chapter 11 Information Extraction

Information Extraction:
Beyond Document Retrieval
Robert Gaizauskas and Yorick Wilks
Computational Linguistics and
Chinese Language Processing
vol. 3, no. 2, 1998, pp. 17-60
Journal of Documentation, Vol 54, No.
1, 1998, pp. 70-105.
IE (Wilks)-1
IE and IR
• IE
– extracting pre-specified sorts of information
from short, natural language texts
– example
• business newswire texts for retirements,
appointments, promotions, …
• extract the names of the participating companies and
individuals, the post involved, the vacancy reason,
and so on
IE (Wilks)-2
IE and IR (Continued)
– Populating a structured information source (or
database) from an unstructured, or free text,
information source
– the structured database is used
• for searching or analysis using conventional
database queries or data-mining techniques
• for generating a summary
• for constructing indices into the source texts
• ...
IE (Wilks)-3
IE and IR (Continued)
• IR
– Given a user query selects a relevant subset of
documents from a larger set.
– The user then browses the selected documents in
order to fulfil his or her information need.
• Differences
– IR retrieves relevant documents from collections
– IE extracts relevant information from documents
IE (Wilks)-4
(a)
(b)
In combination of IR and IE
an IR query
chief executive officer had president chairman post succeed name
a retrieved text
<DOC>
<DOCNO> 940413-0062. </DOCNO>
<HL> Who’s News: @ Burns Fry Ltd. </HL>
<DD> 04/13/94 </DD>
<SO> WALL STREET JOURNAL (J), PAGE B10 </SO>
<TXT>
<p>
BURNS FRY Ltd. (Toronto) -- Donald Wright, 46 years
old, was named executive vice president and director of fixed
income at this brokerage firm. Mr. Wright resigned as president
Merrill Lynch Canada Inc., a unit of Merrill Lynch & Co., to
succeed Mark Kassirer, 48, who left Burns Fry last month. A
Merrill Lynch spokerswoman said it has named a successor
Mr. Wright, who is expected to begin his new position by the end
IE (Wilks)-5
of month.
</p> </TCT> </DOC>
(c)
an empty template
<TEMPLATE> :=
DOC_NR:
CONTENT:
<SUCCESSION_EVENT> :=
SUCCESSION_ORG:
POST:
IN_AND_OUT:
VACANCY_REASON:
<IN_AND_OUT> :=
IO_REASON:
NEW_STATUS:
ON_THE_JOB:
OTHER_ORG:
REL_OTHER_ORG:
<ORGANIZATION> :=
ORG_NAME:
ORG_ALIAS:
ORG_DESCRIPTOR:
IE (Wilks)-6
ORG_TYPE:
ORG_LOCALE:
ORG_COUNTRY:
<PERSON> :=
PER_NAME:
PER_ALIAS:
PER_TITLE:
(d)
a fragment of the filled template
<TEMPLATE-9404130062-1>:=
DOC_NR: “940413062”
CONTENT: <SUCCESSION_EVENT- 9404130062-1>
<SUCCESSION_EVENT- 9404130062-1>:=
SUCCESSION_ORG:<ORGANIZATION- 9404130062-1>
POST: “executive vice president”
IN_AND_OUT: <IN_AND_OUT- 9404130062-1>
<IN_AND_OUT- 9404130062-2>
VACANCY_REASON: OTH_UNK
IE (Wilks)-7
<IN_AND_OUT- 9404130062-2>:=
IO_PERSON: <PERSON- 9404130062-1>
NEW_STATUS: IN
ON_THE_JOB: NO
OTHER_ORG: <ORGANIZATION- 9404130062-2>
REL_OTHER_ORG: OUTSIDE_ORG
<ORGANIZATION- 9404130062-1>:=
ORG_NAME: “Burns Fry Ltd.”
ORG_ALIAS: “Burns Fry”
ORG_DESCRIPTOR: “this brokerage firm”
ORG_TYPE: COMPANY
ORGLOCALE: Toronto CITY
ORG_COUNTRY: Canada
<ORGANIZATION- 9404130062-2>:=
ORG_NAME: “Merrill Lynch”
ORG_ALIAS: “Merrill Lynch”
ORG_DESCRIPTOR: “a unit of Merril Lynch & Co.”
IE (Wilks)-8
ORG_TYPE: COMPANY
<PERSON- 9404130062-1>:=
PER_NAME: “Donald Wright”
PER_ALIAS: “Wright”
PER_TITLE: “Mr.”
<PERSON- 9404130062-2>:=
PER_NAME: “Mark Kassirer”
(e) a summary generated from the filled template
BURNS FRY Ltd. Named Donald Wright as executive vice president.
Donald Wirght resigned as president of Merrill Lynch Canada Inc.
Mark Kassirer left as president of BURNS FRY Ltd.
IE (Wilks)-9
History of Information Extraction
• Early work on template filling
– work carried out or under way before the
DARPA programme
• work carries out in response to the DARPA
MUC programme
• recent work on IE outside the DARPA
programme
IE (Wilks)-10
Early Work on Template Filling
• The Linguistic String Project at New York
University
– Derive information formats (regularised tablelike forms) from the profusion of natural
language forms
– Permit “fact retrieval” (as opposed to document
retrieval) on such a database
IE (Wilks)-11
Early Work on Template Filling
(Continued)
– the information formats are not predefined a
priori by experts in the field
– the information formats are induced by using
distributional analysis to discover word classes
in a set of texts of a sub-language
IE (Wilks)-12
Early Work on Template Filling
(Continued)
• Language understanding research at Yale
University by Roger Schank
– stories followed certain stereotypical patterns
called scripts
– knowing the script, language comprehenders
are able to fill in details and make inferential
leaps where the information required to make
the leap is not present in the text
– first attempt using this approach: FRUMP
(Gerald De Jong)
IE (Wilks)-13
Message Understanding
Conferences (Continued)
• MUC-1 (May 1987, San Diego)
– six systems participated
– tactical naval operations reports on ship
sightings and engagements
– 12 training reports, 2 unseen messages
• MUC-2 (May 1989, San Diego)
– eight systems participated
– the same domain as MUC-1
– 105 training messages, 20 blind messages (1st
run), 5 blind messages (2nd run)
IE (Wilks)-14
– a template and fill rules for the slots
Message Understanding
Conferences (Continued)
• MUC-3 (May 1991, San Diego)
– fifteen systems participated
– newswire stories about terrorist attacks in nine
Latin American countries
– 1,300 development texts,
three blind test sets of 100 texts
– a template consisting of 18 slots
– formal evaluation criteria (precision & recall)
– semi-automated scoring program available
IE (Wilks)-15
Message Understanding
Conferences (Continued)
• MUC-4 (June 1992 McLean, Virginia)
– seventeen sites participated
– domain and template structures unchanged
– changes to the task definitions, corpus,
measures of performance, and test protocols
IE (Wilks)-16
Message Understanding
Conferences (Continued)
• MUC-5 (August 1993 Baltimore, Maryland)
– 17 systems participated (14 American, 1 British,
1 Canadian, 1 Japanese)
– financial newswire stories and microelectronics
products announcements
– English and Japanese
– development and test corpora increased
– new evaluation metrics and scoring programs
IE (Wilks)-17
Message Understanding
Conferences (Continued)
• MUC-6 (Nov 1995 Columbus, Maryland)
– 17 sites took part
– named entity recognition, coreference
identification, template and scenario template
extraction tasks
– management succession events in financial
news stories
IE (Wilks)-18
Task complexity measures
• text corpus complexity (vocabulary size,
average sentence length)
• text corpus dimensions (volume of texts,
total number of sentences/words)
• template characteristics (number of object
types, number of slots)
• difficulty of tasks (hard to measure, but
considered number of pages of relevance
rules and template fill definitions)
IE (Wilks)-19
Evaluation Metrics
• Recall
– a measure of the fraction of the required information that
has been correctly extracted
• Precision
– a measure of the fraction of the extracted information that is
correct
• Beyond Precision and Recall
– correct, partially correct, incorrect, missing, spurious, noncommittal
– overgeneration
• fraction of extracted information that is spurious
– undergeneration
• fraction of information to have been extracted is missing
– substitution
(Wilks)-20
• fraction of the nonspurious extracted information is notIEcorrect
• Tasks
MUC-5
– two domains: joint ventures and microelectronics
– two languages: Japanese and English
– acronyms: EJV, JJV, EME, JME
• Resources
– EJV materials: Wall Street Journal, Lexus/Nexus,
Prompt
– gazetteer of place names, list of corporate names
and nationalities, list of corporate designators, list
of countries, list of nationalities, list of
international organizations, definitions of standard
industry codes, list of currency names/nationalities,
list of female forenames, list of male forenames,
IE (Wilks)-21
CIA world fact book.
MUC-6
• Tasks
– named entity recognition
• recognition and classification of definite named entities
such as organizations, persons, locations, dates and
monetary amounts
• <enamex type=“organization”>Bridgestone Sports
Co.</enamex> said <timex
type=“date”>Friday</timex>it has set up a joint
venture in <enamex
type=“location”>Taiwan</enamex>with a local
concern and a Japanese trading house to produce golf
clubs to be shipped to <pnamex>Japan</pnamex>
IE (Wilks)-22
MUC-6 (Continued)
– coreference resolution
• identification of expressions in the text that referred
to the same object, set or activity
• <coref id=“100”> Galactic Enterprises</coref> said
<coref id=“101” type=“ident” ref=“100”> it</coref>
would build a new space station before the year
2016
– template element filling
– scenarios template filling
IE (Wilks)-23
The Generic IE System
• text zoner
– divide the input text into a set of segments
• preprocessor
– convert a text segment into a sequence of sentences,
where each sentence is a sequence of lexical items, with
associated lexical attributes (e.g., part-of-speech)
• filter
– eliminate some of the sentences from the previous stage
by filtering out irrelevant ones
• preparser
– detect reliable small-scale structures in sequences of
lexical items (e.g., noun groups, verb groups, etc.)
IE (Wilks)-24
The Generic IE System
• fragment combiner
– turn a set of parse tree of logical form fragments into a
parse tree or logical form for the whole sentence
• semantic interpreter
– generate a semantic structure of meaning representation
of logical form from a parse tree or parse tree fragments
• lexical disambiguation
– disambiguate any ambiguous predicates in the logical
form
• coreference resolution or discourse processing
– build a connected representation of the text by linking
different descriptions of the same entity in different
parts of the text
• template generator
IE (Wilks)-25
• Lexical Processing
LaSIE: A Case Study
– Tokenisation
• text segmentation: distinguish the document header and segment the text into
paragraphs
• tokenisation: identify which sequences of characters will be treated as individual
tokens
– Sentence splitting
• determine sentence boundaries in the text
• the full stops are not sufficient guides, e.g., Allan J. Smith, Mr.
– Part-of-speech tagging
• process one sentence at a time, and associate with each token one of the 48 partof-speech tags in University of Pennsylvania
– Morphological analysis
• determine root forms of nouns and verbs
– Gazetteer lookup
• employ 5 gazeetteers (lists of names) to facilitate the process of recognizing and
classifying named entities
• organization names, location names, personal given names, company designators,
and personal titles
IE (Wilks)-26
LaSIE: Parsing
• Parsing with a special named entity grammar
– recognize multi-word structures which identify
organizations, persons, locations, dates, and
monetary amounts
– ORGAN\_NP --> ORGAN\_NP LOC\_NP CDG
Merrill Lynch Canada Inc.
– PERSON\_NP --> FIRST\_NAME NNP
Donald
Wright
– organization(e17), name(e17, “Burns Fry Ltd.”)
IE (Wilks)-27
LaSIE: Parsing (Continued)
• Parsing with a more general phrasal grammar
– recognize noun phrases, verb phrases, prepositional
phrases, adjective phrases, sentences, and relative
clauses
– [NP Donald Wright], [ADJP 46 years old], [VP [VP was
named][NP executive vice president and director of
fixed income]][PP at this brokerage firm]
– person(e21), name(e21, “Donald Wright”)
name(e22), lobj2(e22,e23)
title(e23, “executive vice president”)
firm(e24), det(e24, this)
IE (Wilks)-28
LaSIE: Parsing (Continued)
• Select a “best parse” from the set of partial,
fragmentary, and possibly overlapping
phrasal analyses
– choose that sequence of non-overlapping
phrases of semantically interpretable categories
(sentence, noun phrase, verb phrase and
prepositional phrase) which covers the most
words and consists of the fewest phrases
IE (Wilks)-29
LaSIE: Discourse Processing
IE (Wilks)-30
IE (Wilks)-31
Application Areas of Information Extraction
• Finance
– categorize newswire stories of relevance to stock traders
• Military Intelligence
• Medicine
– help classification of patient records and discharge summaries to
assist in public health research and in medical treatment auditing
• Law
– support intelligent retrieval from legal texts
• Police
– extract information about road traffic incidents from police
incident log
• Technology/product tracking
– track commodity price changes and factors affecting changes in the
relevant newsfeeds
IE (Wilks)-32
Application Areas of Information Extraction
(Continued)
• Fault Diagnosis
– extract information from reports of car faults
• Software system requirements specification
– NLP techniques used to assist in the process of deriving formal
software specifications from less formal, natural language
specifications
– the formal specification is viewed as a template which needs to be
filled from a natural language specifications, supplemented with a
dialogue with the user
• Academic research
– Academic journals and publications are increasingly becoming
available on-line and offer a prime source of material for IE
technology
IE (Wilks)-33
Challenges for the future
• Higher precision and recall
• User-defined IE
– permit users to define the extraction task and
then adapts to the new scenario
• Integration with other technologies
–
–
–
–
information retrieval
natural language generation
machine translation
data mining
IE (Wilks)-34