Artificial Intelligence Research Centre Program Systems

Download Report

Transcript Artificial Intelligence Research Centre Program Systems

Artificial Intelligence Research
Centre
Program Systems Institute
Russian Academy of Science
152020 Pereslavl-Zalessky
Russia
AiReC
INEX:
Tools for Information Extraction
Artificial Intelligence Research Centre
Program Systems Institute
Russian Academy of Science
152020 Pereslavl-Zalessky
Russia
+7 48535 98065
[email protected]
Information extraction
Objective:
 extract meaningful information of a prespecified type from (typically large
amounts of) texts for further analytical
purposes
Output:
 data structures of a pre-specified
format (filled scenario templates)
Examples

Sports report: <winner>, <loser>,
<score>, <location>, <date>…

Database on rental accommodation
opportunities: <location>,<renting
price>, <bedrooms number>, <phone
number>…
Possible IE application
scenarios:
inference of new information
(knowledge acquisition)
query formulation and answering in
human-computer systems
automatic generation of abstracts and
summaries
visualization of document content, etc.
The `Newsmaking’ task




<newsmaker>
<type of newsmaker> (person or
organization)
<message>
<type of message> (original, cited, a
reference to another newsmaker)
IE system architecture
Input text
Linguistic
information
Coreference
resolution rules
Collection of
texts
Filtering
Disambiguation (partial)
Extraction of task-specific
information
Linguistic processor
Morphological analysis
Named entity recognition
Semantic analyser
Tokenisation & sentence segmentation
Information
extraction rules
Applying information extraction rules
Coreference resolution
Merging partial results
Microsyntactic analysis
Macrosyntactic analysis
Results
Tokenisation & sentence
segmentation


Tokenisation
identification of words, punctuation
marks, delimiters, special characters
Sentence segmentation
recognizing sentence boundaries
Morphological analysis
maps every word-form of the input text
to (a) canonical form(s)
recognizes the word's morphological
properties
Results are typically ambiguous.
Filtering

reduces the text to be subjected to
further processing to potentially
relevant portions
Disambiguation


a side effect of other processes (e.g.,
microsyntactic analysis)
a stand-alone stage
Microsyntactic analysis


identifies noun phrases (NP)
identifies some regularly formed
constructions (numbers, dates, personal
proper names)
Macrosyntactic analysis


identifies clause boundaries
constructs clause hierarchy within a
sentence
Named entity recognizer


identifies proper names
assigns semantic features to certain
items
Information extraction rules


a domain knowledge representation
formalism (scenario templates)
a set of patterns to identify template
elements in a text (covering the many
possible ways to talk about the target
event elements)
IE pattern includes:


a set of rules that define how to
retrieve this pattern in a text
a set of constraints imposed on textual
elements to fit into a particular slot of
the target
Coreference Resolver

recognizes different occurrences of the
same entity in a text
Merging partial results

merging partially filled templates to
produce a final, maximally filled
template