Effect of Dependency Relationships and Ordered Co

Download Report

Transcript Effect of Dependency Relationships and Ordered Co

Two Applications of Information
Extraction to Biological Science Journal
Articles: Enzyme Interactions and
Protein Structures
Kevin Humphreys, George Demetriou,
& Robert Gaizauskas
Department of Computer Science,
University of Sheffield
(Pacific Symposium on Biocomputing,
Vol 5, Pages 502-513, 2000)
1
Abstract
• The application of technology to the
extraction of information from scientific
journal papers in the area of molecular
biology.
• Two bioniformatics applications: EMPathIE,
concerned with enzyme and metabolic
pathways; and PASTA, concerned with
protein structure.
2
1. Introduction
• The prototypical IE tasks are those defined by the
U.S. DARPA MUCs, requiring the filling of a
complex template from newswire texts on subjects
such as joint venture announcements, management
succession events, or rocket launchings.
• This paper described the use of the technology
developed through MUC evaluations in two
bioinformatics applications.
3
2. IE Technology
• MUC-7 specified five separate component tasks:
– Named Entity recognition: organizations, persons,
locations, dates and monetary amounts.
– Coreference resolution: the identification of
expressions that refer to the same object, set or activity.
– Template Element filling: the filling of small scale
templates for specified classes of entity in the texts.
– Template Relation filling: fill a two slot template
representing a binary relation with pointers.
– Scenario Template filling: the detection and
construction of relations between template elements as
participants in a particular type of event, or scenario.
4
3. Two Bioinformatics
Applications of IE (1/2)
• EMPathIE
– Enzyme and Metabolic Pathways Information
Extraction.
– Aimed to extract details of enzyme reactions from
articles in the journals Biochimica et Biophysica Acta
and FEMS Microbiology Letters.
– Typically, journal articles in this domain describe
details of a single enzyme reaction, often with little
indication of related reactions and which pathways the
reaction may be part of. => Combine details from
several articles for pathway identification.
5
3. Two Bioinformatics
Applications of IE (2/2)
• PASTA
– Protein Active Site Template Acquisition
– Aimed to extract information concerning the roles of
amino acids in protein molecules, and to create a database
of protein active sites from both scientific journal
abstracts and full articles.
– New protein structures are being reported at very high
rates and the number of co-ordinate sets (currently about
9000) in the Protein Data Bank (PDB) can be expected to
increase ten-fold in the next five years.
– Computational methods would be very useful to biologists
in comparison classification work and to those engaged in
modeling studies.
6
3.1 EMPathIE (1/2)
• The EMP database contains over 20,000
records of enzyme reactions, collected from
journal articles published since 1964. =>
provide for training data.
• Template definitions:
– Three Template Elements: enzyme, organism and
compound.
– A single Template Relation: source, relating
enzyme and organism elements
– A scenario Template for the specific metabolic
pathway task.
7
3.1 EMPathIE (2/2)
• A manually produced sample Scenario Template, taken from an article on
‘isocitrate lyase activity’ in FEMS Microbiology Letters.
乙醛酸循環
8
3.2 PASTA (1/3)
• The entities to be extracted:
–
–
–
–
proteins
amino acid residues
species
types of structural characteristics
• secondary structure, quaternary structure
–
–
–
–
active sites
other (probably less important) regions
chains
Interactions
• hydrogen bonds, disulphide bonds etc.
9
3.2 PASTA (2/3)
10
3.2 PASTA (3/3)
11
4. EMPathIE and PASTA (1/2)
• The IE systems are both derived from the
LaSIE system, a general purpose IE system,
under development at Sheffield since 1994.
• The processing modules:
12
4. EMPathIE and PASTA (2/2)
• Both systems have a pipeline architecture
consisting of four principal stages.
– Text preprocessing
• SGML/structure analysis, tokenisation
– Lexical and terminological processing
• Terminology lexicons, morphological analysis,
terminology grammars
– Parsing and semantic interpretation
• Sentence boundary detection, part-of-speech
tagging, phrase grammars, semantic interpretation
– Discourse interpretation
• Coreference resolution, domain modeling
13
4.1 Text Preprocessing
• Both the SGML and sectioniser modules
may specify that certain text regions are to
be excluded from any subsequent
processing, avoiding detailed processing of
apparently irrelevant text.
• The tokenisation of the input needs to
identify tokens within compound names.
14
4.2 Lexical and Terminological
Preprocessing (1/3)
• The main information sources used for
terminology identification:
– Case-insensitive terminology lexicons
– Listing component terms of various categories
– Morphological cues: standard biochemical
suffixes
– Hand-constructed grammar rules for each
terminology class
15
•
•
•
4.2 Lexical and Terminological
Preprocessing (2/3)
The enzyme name mannitol-1-phosphate 5dehydrogenase would be recognized firstly by
the classification of mannitol as a potential
compound modifier, and phosphate as a
compound, both by being matched in the
terminology lexicon.
Morphological analysis would suggest
dehydrogenase as a potential enzyme head, due
to its suffix -ase.
Grammar rules would apply to combine the
enzyme head with a known compound and
modifier which can play the role of enzyme
modifier.
16
4.2 Lexical and Terminological
Preprocessing (3/3)
• The biochemical terminology lexicons,
assembled from various publicly available
resources (e.g. SWISS-PROT), have been
structured to distinguish various term
components which are then assembled by
grammar rules.
• The total number of lexicon entries is
approximate 25,000 component terms at
present in 52 categories.
17
4.3 Parsing and Semantic
Interpretation
• The syntactic processing modules treat any
terms recognized in the previous stage as
non-decomposable units, with a syntactic
role of proper noun.
• The POS tagger only attempts to assign
tags to tokens which are not part of
proposed terms.
• The phrasal grammar includes
compositional semantic rules, which are
used to construct a semantic representation
of the ‘best’, possibly partial.
18
4.4 Discourse Interpretation(1/2)
•
•
•
The discourse interpreter adds the semantic
representation of each sentence to a predefined
domain model, made up of ontology, or concept
hierarchy, plus inheritable properties and
inference rules associated with concepts.
The domain model is gradually populated with
instances of concepts from the text to become a
discourse model.
Coreference mechanism attempts to merge each
newly introduced instance with an existing one,
subject to various syntactic and semantic
19
constraints.
4.4 Discourse Interpretation(2/2)
• The template writer module reads off the required
information from the final discourse model and
formats it as in the template specification.
• An initial domain model for the EMPathIE
metabolic pathway task has been manually
constructed, directly from the template definition,
and subsequent refinement will involve extending
the concept subhierarchies and the addition of
coreference constraints on the hypothesised
instances, based on available training data.
20
5. Results & Evaluation(1/2)
• A complete prototype EMPathIE system exists
which can produce filled templates.
• The terminology recognition portion has been
informally reviewed by molecular biologists.=>
remarkably good
• The PASTA system has been implemented as far as
the terminology recognition stage. Preliminary
template design has been carried out, and being
starting to build a domain model.
• A corpus of 52 abstracts of journal articles has been
manually annotated with classes.=>allow an
automatic evaluation of the PASTA terminology
system using the MUC scoring software.
21
5. Results & Evaluation(2/2)
Initial Named Entity results for the PASTA system
22
6. Conclusion
• These two projects move IE systems into the
molecular biology domain much of the low-level
work.
• Generalize the software to longer, multi-sectioned
articles with embedded SGML.
• Generalize tokenisation routines to cope with
scientific nomenclature.
• Generalize terminology recognition procedures to
deal with a broad range of molecular biological
terminology.
• Make good progress in designing template elements,
template relations, and scenario templates.
23