Integrating Existing Ontologies

Download Report

Transcript Integrating Existing Ontologies

Copyright © 1997 Pangea Systems, Inc. All rights reserved.
Building Ontologies
Building Ontologies
Copyright © 1998 Pangea Systems, Inc. All rights reserved.
 No
field of Ontological Engineering equivalent to
Knowledge or Software Engineering;
 No standard methodologies for building ontologies;
 Such a methodology would include:
 a set of stages that occur when building
ontologies;
 guidelines and principles to assist in the
different stages;
 an ontology life-cycle which indicates the
relationships among stages.
 Gruber's guidelines for constructing ontologies are
well known.
The Development Lifecycle


Two kinds of complementary methodologies emerged:
 Stage-based, e.g. TOVE [Uschold96]
 Iterative evolving prototypes, e.g. MethOntology
[Gomez Perez94].
Most have TWO stages:
1.
Informal stage
Copyright © 1998 Pangea Systems, Inc. All rights reserved.

2.
Formal stage


ontology is sketched out using either natural language
descriptions or some diagram technique
ontology is encoded in a formal knowledge representation
language, that is machine computable
An ontology should ideally be communicated to people and
unambiguously interpreted by software
 the informal representation helps the former
 the formal representation helps the latter.
A Provisional Methodology
skeletal methodology and life-cycle for building
ontologies;
 Inspired by the software engineering V-process
model;
Copyright © 1998 Pangea Systems, Inc. All rights reserved.
A
The left side
charts the
processes in
building an
ontology
 The
The right side
charts the
guidelines, principles
and evaluation used
to ‘quality assure’
the ontology
overall process moves through a life-cycle.
The V-model Methodology
Identify purpose and scope
Knowledge acquisition
Ontology in Use
Evaluation: coverage,
verification, granularity
User Model
Copyright © 1998 Pangea Systems, Inc. All rights reserved.
Conceptualisation
Integrating existing
ontologies
Conceptualisation
Principles: commitment,
conciseness, clarity,
extensibility, coherency
Conceptualisation Model
Encoding
Representation
Implementation Model
Encoding/Representation
principles: encoding bias,
consistency, house styles
and standards, reasoning
system exploitation
The ontology building life-cycle
Identify purpose and scope
Knowledge acquisition
Building
Copyright © 1998 Pangea Systems, Inc. All rights reserved.
Conceptualisation
Encoding
Evaluation
Integrating
existing
ontologies
Language and
representation
Available
development
tools
User Model: Identify purpose and scope
what applications the ontology will support
 EcoCyc: Pathway engineering, qualitative simulation
of metabolism, computer-aided instruction,
reference source
 TAMBIS: retrieval across a broad range of
bioinformatics resources
Copyright © 1998 Pangea Systems, Inc. All rights reserved.
 Decide
 The
use to which an ontology is put affects its
content and style
 Impacts re-usability of the ontology
User Model: Knowledge Acquisition
biologists; standard text books;
research papers and other ontologies and database
schema.
 Motivating scenarios and informal competency
questions – informal questions the ontology must
be able to answer
Copyright © 1998 Pangea Systems, Inc. All rights reserved.
 Specialist
 Evaluation:
Fitness for purpose
 Coverage and competency

Conceptualisation Model: Conceptualisation


Copyright © 1998 Pangea Systems, Inc. All rights reserved.


Identify the key concepts, their properties and the
relationships that hold between them;
 Which ones are essential?
 What information will be required by the applications?
Structure domain knowledge into explicit conceptual models.
Identify natural language terms to refer to such concepts,
relations and attributes;
Determine naming conventions
 Consistent naming for classes and slots
 EcoCyc:


Classes are capitalized, hyphenated, plural
Slot names are uppercase
A quality ontology captures relevant biological distinctions with
high fidelity
Conceptualisation Model: Pitfalls
 Pitfall:
Missing ontological elements
 Missing classes: Swiss-Prot Protein complexes
 Missing attributes: Genetic code identifier
 Confuse 1:1 with 1:Many, or 1:Many with
Many:Many
Copyright © 1998 Pangea Systems, Inc. All rights reserved.

Cofactor as an attribute of reaction
Important data is stored within text/comment
fields
 Pitfall: Extra ontological elements
 Pitfall: Stop over-elaborating – when do I stop?
 Pitfall: Relevance – do I really need all this detail?

Integrating Existing Ontologies
or adapt existing ontologies when possible
 Save time
 Correctness
 Facilitate interoperation
Copyright © 1998 Pangea Systems, Inc. All rights reserved.
 Reuse
 Integration
of ontologies
 Ontologies have to be aligned
 Hindered by poor documentation and
argumentation
 Hindered by implicit assumptions
 Shared generic upper level ontologies should
make integration easier
Encoding: Implementation Toolkit

Construct ontology using an ontology-development system
 Does the data model have the right expressivity?




Copyright © 1998 Pangea Systems, Inc. All rights reserved.





Is it just a taxonomy or are relationships needed?
Is multiple parentage needed? Inverse relationships?
What types of constraints are needed?
Are reasoning services needed?
What are authoring features of the development tool?
Can ontology be exported to a DBMS schema?
Can ontology be exported to an ontology exchange
language?
Is simultaneous updating by multiple authors needed?
Size limitations of development tool?
Encoding:
Ontology Implementation Pitfalls
 Pitfall:
Semantic ambiguity
 Multiple ways to encode the same information
 Meaning of class definitions unclear
Encoding Bias
 Encoding the ontology changes the ontology
Copyright © 1998 Pangea Systems, Inc. All rights reserved.
 Pitfall:
Encoding:
Ontology Implementation Pitfalls
 Pitfall:
Redundancy (lack of normalization)
 Exact same information repeated
 Presence of computationally derivable
information


More effort required for entry and update
 Partial updates lead to inconsistency
 OK if redundant information is maintained
automatically

Copyright © 1998 Pangea Systems, Inc. All rights reserved.
Date of birth and age
DNA sequence and reverse complement
Encoding: The Interaction Problem
Copyright © 1998 Pangea Systems, Inc. All rights reserved.
 Task
influences what knowledge is represented and
how its represented
 Molecular biology: chemical and physical
properties of proteins
 Bioinformatics: accession number, function
gene
 Underlying perspectives mean they may not be
reconcilable
 If
an ontology has too many conflicting tasks it can
end up compromised – TaO experience
Evaluate it - A guide for reusability
 Conciseness
No redundancy
 Appropriateness – protein molecules at the
atomic resolution when amino acid level would
do
 Clarity
 Consistency
 Satisfiability – it doesn’t contradict itself
 Enzyme is a both a protein which catalyses a
reaction and does not catalyse a reaction
 Commitment
 Do I have to buy into a load of stuff I don’t
really need or want just to get the bit I do?
Copyright © 1998 Pangea Systems, Inc. All rights reserved.

Documentation: Make Ontology
Understandable!
clear informal and formal documentation
 An ontology that cannot be understood will not
be reused
 Genbank feature table
 NCBI ASN.1 definitions
Copyright © 1998 Pangea Systems, Inc. All rights reserved.
 Produce
 There
exists a space of alternative ontology design
decisions
 Semantics / Granularity
 Terminology
 Pitfall:
Neglecting to record design rationale
Publish the Ontology
and informal specifications
 Intended domain of application
 Design rationale
 Limitations
Copyright © 1998 Pangea Systems, Inc. All rights reserved.
 Formal
 See
EcoCyc paper in ISMB-93/Bioinformatics 00
 See TAMBIS paper in Bioinformatics 99
Macromolecule Reference Ontology
MacroMolecule
SequenceComponent
Gene
Lipid
Copyright © 1998 Pangea Systems, Inc. All rights reserved.
Nucleic Acid
RNA
Protein
Motif
Phosphorylation
site
Peptide Enzyme
Restriction site
mRNA
cDNA
DNA
gDNA
mDNA
componentOf
Discussion
Copyright © 1998 Pangea Systems, Inc. All rights reserved.
 What
is a macromolecule?
 Where does macromolecule fit into an upper level
ontology?
 Substance?
 Structure?
 Is lipid a macromolecule?
 If we replace macromolecule with biopolymer is
the placement of lipid legit?
 Is a peptide a protein and therefore a
macromolecule? If not, where does it go?
Taxonomy and Roles
Copyright © 1998 Pangea Systems, Inc. All rights reserved.
 Do
we want to assert everything in a taxonomy?
 Or do we want to define things in terms of their
properties?
 Enzyme = Protein catalyses Reaction
 gDNA = DNA hasLocation Chromosomal
 Sufficiency as well as necessary conditions
 Whats the relationship between
 cDNA and EST
 cDNA and some child of RNA ?
Axioms and constraints
Copyright © 1998 Pangea Systems, Inc. All rights reserved.
 Not
all RNA is translated to protein
 Do we want to say that DNA is translated to
protein?
 Do we want to model catalytic RNAs?
 Relationships – what other ones do we need?
 Genes express proteins
 Genes express rRNA, tRNA
 Genes are found on gDNA
 Genes are found on mDNA
 Genes have their own components – recursive
relationships with partitive semantics
 Reasoning? Instances?
 Reusable? Clear? Concise?
Ontological Pitfalls
– when do I stop over elaborating?
 Proteins  amino acid residues  side chains
 physical chemical properties ….
 Relevance
 Do we need to mention all the types of nucleic
acid?
Copyright © 1998 Pangea Systems, Inc. All rights reserved.
 Stop-over
EcoCyc
Chemicals
MacroMolecule
Compounds-And-Elements
Nucleic-Acids
Compounds
Copyright © 1998 Pangea Systems, Inc. All rights reserved.
Proteins
Lipids
RNA
Misc-RNA
DNA
PolyPeptides
Protein-Complexes
DNA-Segments
Genes
Macromolecule in other Ontologies
Gene Ontology
to add attributes to gene instances in
databases
 Doesn’t need to talk about molecules or
components of molecules
Copyright © 1998 Pangea Systems, Inc. All rights reserved.
 Used
TAMBIS Ontology
 Models
it in a similar way to our reference
macromolecule ontology
 Because it asks questions of bioinformatics
sources