Artificial Intelligence Research Centre Program Systems
Download
Report
Transcript Artificial Intelligence Research Centre Program Systems
Artificial Intelligence Research
Centre
Program Systems Institute
Russian Academy of Science
152020 Pereslavl-Zalessky
Russia
AiReC
INEX:
Tools for Information Extraction
Artificial Intelligence Research Centre
Program Systems Institute
Russian Academy of Science
152020 Pereslavl-Zalessky
Russia
+7 48535 98065
[email protected]
Information extraction
Objective:
extract meaningful information of a prespecified type from (typically large
amounts of) texts for further analytical
purposes
Output:
data structures of a pre-specified
format (filled scenario templates)
Examples
Sports report: <winner>, <loser>,
<score>, <location>, <date>…
Database on rental accommodation
opportunities: <location>,<renting
price>, <bedrooms number>, <phone
number>…
Possible IE application
scenarios:
inference of new information
(knowledge acquisition)
query formulation and answering in
human-computer systems
automatic generation of abstracts and
summaries
visualization of document content, etc.
The `Newsmaking’ task
<newsmaker>
<type of newsmaker> (person or
organization)
<message>
<type of message> (original, cited, a
reference to another newsmaker)
IE system architecture
Input text
Linguistic
information
Coreference
resolution rules
Collection of
texts
Filtering
Disambiguation (partial)
Extraction of task-specific
information
Linguistic processor
Morphological analysis
Named entity recognition
Semantic analyser
Tokenisation & sentence segmentation
Information
extraction rules
Applying information extraction rules
Coreference resolution
Merging partial results
Microsyntactic analysis
Macrosyntactic analysis
Results
Tokenisation & sentence
segmentation
Tokenisation
identification of words, punctuation
marks, delimiters, special characters
Sentence segmentation
recognizing sentence boundaries
Morphological analysis
maps every word-form of the input text
to (a) canonical form(s)
recognizes the word's morphological
properties
Results are typically ambiguous.
Filtering
reduces the text to be subjected to
further processing to potentially
relevant portions
Disambiguation
a side effect of other processes (e.g.,
microsyntactic analysis)
a stand-alone stage
Microsyntactic analysis
identifies noun phrases (NP)
identifies some regularly formed
constructions (numbers, dates, personal
proper names)
Macrosyntactic analysis
identifies clause boundaries
constructs clause hierarchy within a
sentence
Named entity recognizer
identifies proper names
assigns semantic features to certain
items
Information extraction rules
a domain knowledge representation
formalism (scenario templates)
a set of patterns to identify template
elements in a text (covering the many
possible ways to talk about the target
event elements)
IE pattern includes:
a set of rules that define how to
retrieve this pattern in a text
a set of constraints imposed on textual
elements to fit into a particular slot of
the target
Coreference Resolver
recognizes different occurrences of the
same entity in a text
Merging partial results
merging partially filled templates to
produce a final, maximally filled
template