Tools for Ontology-based Corpus Annotation

Download Report

Transcript Tools for Ontology-based Corpus Annotation

Tools for Ontology-based
Corpus Annotation
Tomoko OHTA, Yuka TATEISI, and
Jun’ichi TSUJII
Department of Information Science, Graduate
School of Science, University of Tokyo
Abstract
Introduction:
Automatic information extraction is a key technology to help researchers access the information contained in research
papers and to extend databases on substances and biological processes. We aim to build information extraction systems
[1,2] from biology papers and their abstracts available from the MEDLINE [3] database. As a part of a project on
information extraction from the research papers in biology domain, we are creating an expert-tagged corpus of
MEDLINE abstracts, which will be used for training and testing the information extraction systems [4,5]. The markup
scheme is based on a conceptual domain model (ontology) and implemented in XML [6].
Tools:
The task of annotation can be regarded as identifying and classifying the terms that appears in the texts according to a
pre-defined classification. For a reliable classification, the classification must be well defined and easy to understand by
the domain experts who annotate the texts. To fulfill this requirement, we think that the tag-set should be based on a
concrete data model (ontology) of the biology domain, which serves as a standardized representation of background
knowledge of domain experts.
Although a XML-tagged text can be created by using text editors, semantically annotated corpora must be created by
domain experts who are not always familiar with XML tag scheme. Thus an easy-to-use tagging tool to help annotators is
indispensable for efficiency and accuracy.
We developed a GUI-based tag definition tool TagEdit and tagging tool JTag in JAVA language. In the tag definition tool
TagEdit, definition of new tag-set, refinement of definition, enhancement of the tag-set by adding or removing tags, and
enrichment of tags by adding or removing attributes are available. The tagging tool JTag has two frames: one is a tag
selection frame, and the other is an annotation frame. In the tag selection frame, a tag-set based on ontology defined by
using TagEdit is appeared as a concept hierarchy. Tag data including the class of tag, the position of tag, and values of
attributes is saved as XML document and annotated text can be saved as tag-embedded form.
Introduction
Information need
Information retrieval (IR) & filtering (IF)
Information extraction (IE)
Document / term classification & categorization
Summarization, …
Overview of GENIA system
Background knowledge design
Ontology, Data model, Markup language, …
Resource building
Corpus annotation (aid tool), Database construction, …
Core module
Information extraction, Information retrieval, …
Web-based integrated interface
Overview of GENIA System
Retrieval Module
Corpus Module
•Markup generation / compilation
•Annotated corpus construction
Text
Structure
Interface Module
Annotated
Event
Security
Data model
Concept Module
•BK design / construction / compilation
•IR Request
•Abstract
•Full Paper
Database
Document Named-Entity
Markup
language
User
•GUI
•HTML conversion
•System integration
Background Knowledge
Ontology
MEDLINE
•Identify & classify terms
•Identify events
Corpus
Raw(OCR)
•Request enhancement
•Spawn request
•Classify documents
Information Extraction
Module
Database Module
•DB design / access / management
•DB construction
Overview of GENIA Ontology
We aim to construct an ontology to model bio-molecular reactions in
human.
The ontology will be used by biological event information extraction
systems from online research papers and documents
The ontology consists of multiple taxonomies, relation between their
nodes, and corresponding linguistic representations.
We are implementing the ontology on a prolog-like typed-feature
manipulation language LiLFeS(Makino, et al. Proc. COLING-ACL '98, 807811, 1998.), on which various natural language processing programs
are implemented.
By using LiLFeS, we aim to seamlessly incorporate the ontology into
natural language processing systems.
Name Ontology
Taxonomies
SUBSTANCE1
attribute1
attribute2
:
SUBSTANCE2
attribute1
attribute2
:
Terms
SUBSTANCE3
attribute1
attribute2
:
SUBSTANCE4
attribute1
attribute2
:
ROLE1
attribute3
attribute4
:
•AGENT
•ENZYME
•PHOSPHATASE
•TRANSCRIPTION FACTOR
ROLE2
attribute3
attribute4
:
ROLE3
attribute3
attribute4
:
•AMINO ACID
•DNA
•ORGANIC COMPOUND
•PROTEIN
ROLE4
attribute3
attribute4
:
Event Ontology
REACTION1
attribute1
attribute2
:
REACTION2
attribute1
attribute2
:
REACTION3
attribute1
attribute2
:
REACTION4
attribute1
attribute2
:
REACTION5
attribute1
attribute2
:
• substance ACTIVATE substance
• substance ACTIVATE protein
• protein ACTIVATE pathway
• PHOSPHORYLATE
•INHIBIT
•REGULATE
Name hierarchy
+-name-+-source-+-natural-+-organism-+-multi-cell organism
|
|
|
+-mono-cell organism
|
|
|
+-virus
|
|
|
|
|
+-tissue
|
|
+-cell
type
Part-of
|
|
+-sub-location of cells
Is-a|
|
+-other (natural source)
|
+-artificial-+-cell line
|
+-other (artificial source)
+-substance-+-compound-+-organic-+-amino-+-protein-+-protein family or group
|
|
|
|
|
+-protein complex
|
|
|
|
|
+-individual protein molecule
|
|
|
|
|
+-subunit of protein complex
|
|
|
|
|
+-substructure of protein
|
|
|
|
|
+-domain or region of protein
|
|
|
|
+-peptide
|
|
|
|
+-amino acid monomer
|
|
|
+-nucleic-+-DNA-+-DNA family or group
|
|
|
|
|
+-individual DNA molecule
|
|
|
|
|
+-domain or region of DNA
|
|
|
|
+-RNA-+-RNA family or group
|
|
|
|
|
+-individual RNA molecule
|
|
|
|
|
+-domain or region of RNA
|
|
|
|
+-other polymer of nucleic acids
|
|
|
|
+-nucleic acid monomer
|
|
|
+-lipid-+-steroid
|
|
|
+-carbohydrate
|
|
|
+-other organic compounds
|
|
+-inorganic
|
+-atom
+-other (name)
Seamless Incorporation into
Natural Language Processing System
Practical NLP applications
Knowledge Acquisition Module
Event Extraction
from Biology
Research Papers
Grammar
Sequential
HPSG Parser
Programming
Language
Parallel HPSG Parser
Parallel Programming
Environment
Semantics
Domain
Ontology
Ontology and Texts
Top level ontology
e.g. Gene Ontology
Middle level ontology
e.g. Database model
of Pathway Databases
Concept
Bottom level ontology
e.g. Fact Databases
Granularity
Text
Textbook
Review article etc.
Research article etc.
Case report form etc.
Corpus Annotation
Purpose
Provide Semantically Annotated Corpus
Markup the Instances of GENIA Ontology
Learning and Testing Data for Information Extraction
Programs
Outline
Definition of GPML (GENIA Project Markup Language):
New mechanisms for handling overlaps and
complicated attribute structures
Target Objects: Named Entities
Substance: protein, DNA, RNA, …
Source(location): organism, cell, tissue, …
Target Texts: 1,000 MEDLINE abstracts
GPML(GENIA Project Markup Language)
Text structure & information markup
Named-entity markup
Coreference markup
Event markup
Text structure & info. markup
Document structure
[document]
a document header, author names, a publication date, a
title, an abstract, keywords, a body
[body]
sections with a title, captions with a title
[abstract/section/caption]
lines
Document information
[document header]
a unique document id
a source, a language, a domain
document categories or classes
…
Named-entity markup
Attributes of NE element
A unique ID
For referring to tag element.
A name
Close to the canonical form (as possible as can)
Zero or more classes
To determine the class of this named-entity
A equivalence link
To synonym or abbreviation / full-form
Extra information
An annotator name
The time of annotation
Assurance
…
Coreference markup
Attributes of coreference (REFEXP) element
A unique ID
One or more links to referred objects (for both
conjunction and disjunction)
Which can be named-entities and events.
Extra information
An annotator name, The time of annotation, Assurance,
…
Auxiliary element (REFAUX)
To handle complicated coordination of referred
objects
Underlying principle: Disjunction of conjunction
( … and … ) or … or ( … and … )
Event markup
Attributes of event element
A unique ID
A class and a type
To determine the form of this event.
Zero or more links to from-molecules, to-molecules, fromtissues, to-tissues, components, and enzymes
To describe this event
Zero or more effect names
To determine the effect of this event
Affirmative mode & Definiteness
To recognize positive & negative sentence and quantity &
quality
Extra information
An annotator name, The time of annotation, Assurance, …
Example of NE Annotation
UI - 85146267
TI - Characterization of <NE ti="3" class="protein" nm="aldosterone binding site" mt="SV" subclass="family_or_group"
unsure="Class" cmt="">aldosterone binding sites</NE ti="3"> in circulating <NE ti="2" class="cell_type" nm="human
mononuclear leukocyte" mt="SV" unsure="OK" cmt="">human mononuclear leukocytes</NE ti="2">.
AB - <NE ti="4" class="protein" nm="Aldosterone binding sites" mt="SV" subclass="family_or_group" unsure="Class"
cmt="">Aldosterone binding sites</NE ti="4"> in <NE ti="1" class="cell_type" nm="human mononuclear leukocyte"
mt="SV" unsure="OK" cmt="">human mononuclear leukocytes</NE ti="1"> were characterized after separation of cells from
blood by a Percoll gradient. After washing and resuspension in <NE ti="5" class="other_organic_compounds" nm="RPMI1640 medium" mt="SV" unsure="OK" cmt="">RPMI-1640 medium</NE ti="5">, cells were incubated at 37 degrees C for 1 h
with different concentrations of <NE ti="6" class="other_organic_compounds" nm="[3H]aldosterone" mt="SV" unsure="OK"
cmt="">[3H]aldosterone</NE ti="6"> plus a 100-fold concentration of <NE ti="7" class="other_organic_compounds"
nm="RU-26988" mt="SV" unsure="OK" cmt="">RU-26988 </NE ti="7">(<NE ti=“17" class="other_organic_compounds"
nm="11 alpha, 17 alpha-dihydroxy-17 beta-propynylandrost-1,4,6-trien-3-one" mt="SV" unsure="OK" cmt="">11 alpha, 17
alpha-dihydroxy-17 beta-propynylandrost-1,4,6-trien-3-one</NE ti=“17">), with or without an excess of unlabeled <NE ti="8"
class="other_organic_compounds" nm="aldosterone" mt="SV" unsure="OK" cmt="">aldosterone</NE ti="8">. <NE ti="9"
class="other_organic_compounds" nm="Aldosterone" mt="SV" unsure="OK" cmt="">Aldosterone</NE ti="9"> binds to a
single class of <NE ti="10" class="protein" nm="receptor" mt="SV" subclass="family_or_group" unsure="OK"
cmt="">receptors</NE ti="10"> with an affinity of 2.7 +/- 0.5 nM (means +/- SD, n = 14) and a capacity of 290 +/- 108
sites/cell (n = 14). The specificity data show a hierarchy of affinity of <NE ti="11" class="other_organic_compounds"
nm="desoxycorticosterone" mt="SV" unsure="OK" cmt="">desoxycorticosterone</NE ti="11"> = <NE ti="12"
class="other_organic_compounds" nm="corticosterone" mt="SV" unsure="OK" cmt="">corticosterone</NE ti="12"> = <NE
ti="13" class="other_organic_compounds" nm="aldosterone" mt="SV" unsure="OK" cmt="">aldosterone</NE ti="13">
greater than <NE ti="14" class="other_organic_compounds" nm="hydrocortisone" mt="SV" unsure="OK"
cmt="">hydrocortisone</NE ti="14"> greater than <NE ti="15" class="other_organic_compounds" nm="dexamethasone"
mt="SV" unsure="OK" cmt="">dexamethasone</NE ti="15">. The results indicate that <NE ti="17" class="cell_type"
nm="mononuclear leukocyte" mt="SV" unsure="OK" cmt="">mononuclear leukocytes</NE ti="17"> could be useful for
studying the physiological significance of these <NE ti="16" class="protein" nm="mineralocorticoid receptor" mt="SV"
subclass="family_or_group" unsure="OK" cmt="">mineralocorticoid receptors</NE ti="16"> and their regulation in humans.
TagEdit: Tag Definition Tool
Implemented in JAVA language
Functions:
Definition of new tag sets
Refinement of tag definition
Enhancement of tag sets
Features:
Tag sets in conformity with XML or GPML
can be defined and modified
Tag definitions is saved as a file
Definition of new tag
Select class and
click right button
Click “Create
Child Tag” to
create new class
Refinement of tag
Select class and
click right button to
refine the tag
Jtag: Tagging Tool
Implemented in JAVA Language
Functions:
Insertion, deletion and edition of tags
Features:
Tag data is saved as XML document
Annotated text can be saved as GPML
document (tag-embedded form)
Screen Capture of JTag
Insertion of Tag
Click “Insert” to
insert a new tag
Edition and Deletion of tag
Click “Edit”
to edit the tag
Click
“Delete” to
delete the tag
Saving data
Click “save” or
“save as” to save
the tag data as a
XML document
Click “Export” to save
GPML document
Searching the terms
Click “Highlight
Special Words” to
search a term
Click “Position Jump”
to jump to any position