Pax Terminologica -- Ontology-Based Alignment of Semantic Roles

Download Report

Transcript Pax Terminologica -- Ontology-Based Alignment of Semantic Roles

Pax Terminologica
Barry Smith
Institute for Formal Ontology and Medical
Information Science (IFOMIS), Saarland University /
University at Buffalo
1
Overview





systems for semantic annotation
linguistics vs. science
semantic annotation in biomedical
informatics
improving systems for semantic
annotation
conclusions
2
The Penn Treebank Project

annotates naturally occurring text for
linguistic structure, producing skeletal
parses showing syntactic and semantic
information in tree form
3
Automatic Content Extraction
Program (ACE)

develops text corpora in English,
Chinese and Arabic annotated for
entities, the relations among them and
the events in which they participate.
4
High Accuracy Retrieval from
Documents (HARD)

creates corpora and annotations
including topics, metadata and relevance
judgements
5
Annotation Graph Toolkit (AGTK)

formal framework for representing
linguistic annotations of time series data.
6
TimeML



robust specification language for markup
of natural language to support:
time stamping of events (identifying and
anchoring in time);
ordering events with respect to one
another
reasoning about persistence
7
SpaceML
provides facilities for annotating
 category attributions to spatial regions
(self-connected, bounded, regular, etc.)
 ascription to regions of topological,
distance, morphological and orientation
relations;
 the definition of a region in terms of its
boundary.
8
WordNet

annotates English nouns, verbs,
adjectives and adverbs to synonym sets,
each representing one underlying lexical
concept.
9
FrameNet

documents the range of semantic and
syntactic combinatory possibilities
(valences) of each word in each of its
senses
10
is there order in this chaos?
11
ISO/TC 37 / SC 4 N 076

Ide, N., Romary, L., de la Clergerie, E.
(2003). International Standard for a
Linguistic Annotation Framework.
HLT-NAACL 2003 (Edmonton)
12
OntoGloss (influenced by ISO
Linguistic Annotation Framework)

an ontology based annotation tool that
uses pre–defined terms in an ontology
to mark-up a document
No standard portal for semantic
annotation tools/projects (?)
13
Purposes of semantic annotation






information retrieval (incl. semantic indexing =
answering queries that use words not used in
the text, including words from other languages)
automatic translation
disambiguation
topic extraction and text summarization
information integration
reasoning
14
for linguistics





fiction no less important than fact
English has no privileged status
regimentation not allowed
annotation frameworks may be
competitive
cross-framework consistency is not
important
15
for science





factual discourse alone important
English is language par excellence
regimentation is allowed
goal of truth: to create a single computerprocessable map of reality
truth is one  must strive for consistency
of annotations and additivity of
annotation frameworks
16
for science


must end the terminology wars
Plant Ontology (PO)
cell =def. structural and physiological unit of a
plant
what should PO do when it needs to study
bacteria in plants?
answer: all shall use the word ‘cell’ to mean the
same thing!
 (all = in biology)
17
the ideal (of additivity)
WordNet for single word forms
 FrameNet for valencies/combination
forms
 SpaceNet for spatial structures
 TimeNet for temporal structures
 ChemNet for chemical structures
 CellNet for cellular structures
etc.

18
a scientific problem: huge swarms of
biomedical data at different
granularities, from molecule to clinic

methods for data integration needed to
enable reasoning across data at multiple
granularities

(genomic medicine ...)
19
orthodox solutions to this problem
dumb statistical number-crunching
or:
 Semantic Web, Unified Medical
Language System (UMLS), Moby, etc.



let a million flowers bloom
and rely on mappings between already
existing controlled vocabularies/annotation
systems
20
an alternative solution



use the peer-reviewed biomedical
literature
contains both textual descriptions of
biological functions (incl. diseases) and
references to entities represented in the
biochemical databases
use high-quality semantic annotations of
the former to integrate across the latter
 the Gene Ontology
21
22
23
The methodology of annotations

Model organism databases employ scientific
curators, who use the experimental observations
reported in the biomedical literature to link gene
products (such as proteins) with GO terms in
annotations.
24
The process of annotations



leads to improvements and extensions of the
ontology, which in turn leads to better annotations
a virtuous cycle of improvement in the quality and
reach of both future annotations and the ontology
itself,
yielding a slowly growing computerinterpretable map of biological reality within
which major databases are automatically
integrated in semantically searchable form
25
need to extend GO by means of other ontologies,
e.g. Cell Ontology, via integrated definitions
GO
id: CL:0000062
name: osteoblast
def: "A bone-forming cell which secretes an extracellular matrix.
Hydroxyapatite crystals are then deposited into the matrix to form
bone." [MESH:A.11.329.629]
is_a: CL:0000055
relationship: develops_from CL:0000008
relationship: develops_from CL:0000375
+
Cell type
=
Osteoblast differentiation: Processes whereby an
osteoprogenitor cell or a cranial neural crest cell
acquires the specialized features of an osteoblast, a
New Definition
bone-forming cell which secretes extracellular matrix.
26
need to extend GO also to semantic
annotation of clinical literature
unfortunately, available (UMLS)
clinical vocabularies are of variable
quality and low mutual consistency
27
 need for prospective standards to
assure consistency and high quality



create rules for high-quality
controlled vocabularies for the
annotation of scientific literature
make everyone follow these rules
regimentation !
28
first step
a shared portal for (so far) 58 ontologies
(low regimentation)
http://obo.sourceforge.net
29
30
Second step:
The OBO Foundry
http://obofoundry.org/
31
The OBO Foundry
scientific standards and principlesbased coordination of systems for
semantic annotation of biomedical
literature to create a single
interoperable family of gold standard
reference ontologies
32
The OBO Foundry
A subset of OBO ontologies, whose developers
have agreed in advance to accept a common set of
principles designed to ensure
–
–
–
–
–
formal robustness
stability
compatibility
interoperability
support for logic-based reasoning
33
The OBO Foundry
– Custodians
• Michael Ashburner (Cambridge)
• Suzanna Lewis (Berkeley)
• Barry Smith (Buffalo/Saarbrücken)
34
The OBO Foundry
A prospective standard
designed to guarantee interoperability of
ontologies from the very start
established March 2006; already 13
OBO ontologies have joined the
Foundry and are being corresponding
reformed; three new ontologies are
being constructed ab initio in its terms
35
The OBO Foundry
Initial Candidate Members
– GO Gene Ontology
– CL Cell Ontology
– SO Sequence Ontology
– ChEBI Chemical Ontology
– PATO Phenotype (Quality) Ontology
– FuGO Functional Genomics Investigation Ontology
– FMA Foundational Model of Anatomy
– RO Relation Ontology
– ChEBI Chemical Entities of Biological Interest
– CARO Common Anatomy Reference Ontology
– FuGO Functional Genomics Investigation Ontology
– PrO Protein Ontology
– RnaO RNA Ontology
36
The OBO Foundry
Under development
– Disease Ontology
– Mammalian Phenotype Ontology
– OBO-UBO / Ontology of Biomedical Reality
– Organism (Species) Ontology
– Plant Trait Ontology
– Environment Ontology
– Behavior Ontology
– Biomedical Image Ontology
– Clinical Trial Ontology
37
38
The
OBO Foundry
The OBO
Foundry
CRITERIA
•
The ontology is open and available to be used by
all.
•
The ontology is in, or can be instantiated in, a
common formal language.
•
The developers of the ontology agree in advance
to collaborate with developers of other OBO
Foundry ontology where domains overlap.
39
The OBO Foundry
CRITERIA
• The developers of each ontology commit to its
maintenance in light of scientific advance, and to
soliciting community feedback for its improvement.
• They commit to working with other Foundry
members to ensure that, for any particular domain,
there is community convergence on a single
controlled vocabulary
40
The OBO Foundry
CRITERIA
• The ontology possesses a unique identifier
space within OBO.
• The ontology provider has procedures for
identifying distinct successive versions.
• The ontology includes textual definitions for
all terms.
41
The OBO Foundry
CRITERIA
•
The ontology has a clearly specified and
clearly delineated content.
•
The ontology is well-documented.
•
The ontology has a plurality of
independent users.
42
The OBO Foundry
CRITERIA
•
The ontology uses relations which are
unambiguously defined following the
pattern of definitions laid down in the
OBO Relation Ontology.*
*Genome Biology 2005, 6:R46
43
OBO Relation Ontology
Foundational is_a
part_of
Spatial
Temporal
Participation
located_in
contained_in
adjacent_to
transformation_of
derives_from
preceded_by
has_participant
has_agent
44
analogy with FrameNet
• the constituent ontologies in the OBO
Foundry are focused overwhelmingly on
single nouns
• the OBO Relation Ontology is designed to
ensure a common structure of relations
shared by all Foundry ontologies –
comparable to SpaceML, TimeML ...
• need something like (Bio)FrameNet to pull
the different levels of granularity together
45
The
OBO Foundry
The OBO
Foundry
CRITERIA
• Further criteria will be added over time in
order to bring about a gradual
improvement in the quality of the
ontologies in the Foundry
46
The OBO Foundry
GOALS
• semantic alignment of OBO Foundry
ontologies through a common system of
formally defined relations
• to enable reasoning both within and across
ontologies, and thus also within and between
the literature annotated in its terms
• and thus also to support reasoning across
associated data
47
The OBO Foundry
GOALS
• to promote re-usability of data
• if data-schemas are formulated using a
single well-integrated framework for
semantic annotation in widespread use,
then this data will be to this degree itself
become more widely accessible and
usable
48
The OBO Foundry
GOALS
• to help in creating better mappings e.g.
between human and model organism
phenotypes:
S Zhang, O Bodenreider, “Alignment of Multiple
Ontologies of Anatomy: Deriving Indirect Mappings from
Direct Mappings to a Reference Ontology”, AMIA 2005
49
The OBO Foundry
GOALS
• to introduce the scientific method into the
development of semantic annotation
frameworks
• to introduce some of the features of scientific
peer review into biomedical ontology
development
50
The OBO Foundry
GOALS
• to aid literature search:
http://www.gopubmed.org/
• to subvert the current policy of ad hoc creation
of new annotation schemas by each clinical
research group by providing a common shared
framework
51
The OBO Foundry
GOALS
• to use the Foundry ontologies as benchmark
for improving existing terminologies
• to create controlled vocabularies for
semantic annotation of clinical trial records,
scientific journal articles, ...
52
The OBO Foundry
GOALS
• to create an evolving map-like computable
representation of the entire domain of
biomedical reality
• to create the conditions for a step-by-step
evolution towards high quality ontologies in the
biomedical domain
• which will serve as stable attractors for clinical
and biomedical researchers in the future
53
The OBO Foundry
GOALS
• to end the terminology wars; and to
advance regimentation of clinical and
other vocabularies in a scientific spirit
54
Conclusion 1


existing linguistic resources for semantic
annotation are scattered to the four
winds
 need for something like the OBO
Library to ensure that the different
available tools are available for
comparison and alignment
55
Conclusion 2

linguists developing tools for semantic
annotation with scientific purposes need
something like the Foundry to ensure a
complete set of interoperable tools which
allow for additivity of annotations
56
the ideal






BioWordNet for single word forms
SpaceNet for spatial structures
TimeNet for temporal structures
ChemNet for chemical structures
CellNet for cellular structures
BioFrameNet for valencies/combination
forms
57
58