Building Ontologies from the Ground Up When users set out to

Download Report

Transcript Building Ontologies from the Ground Up When users set out to

QuickTi me™ and a
T IFF (Uncom pressed) decom pressor
are needed to see t his pict ure.
DYI Ontology Development
Mark A. Musen
Professor of Medicine and Computer Science
Stanford University
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Porphyry’s depiction of Aristotle’s
Categories
Supreme genus:
Differentiae:
SUBSTANCE
material
immaterial
Subordinate genera:
Differentiae:
BODY
animate
inanimate
Subordinate genera:
LIVING
Differentiae:
sensitive
Proximate genera:
ANIMAL
Differentiae:
rational
Species:
Individuals:
SPIRIT
MINERAL
insensitive
PLANT
irrational
HUMAN
Socrates
Plato
BEAST
Aristotle
…
QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture.
Creating Ontologies
in Machine-Processable Form
• Provides a mechanism for developers to codify
salient distinctions about the world or some
application area
• Provides a structure for knowledge bases that can
enable
–
–
–
–
Information retrieval
Information integration
Automated translation
Decision support
The New Philosophers
• Categorizing “what exists” in machineunderstandable form
• Providing a structure that enables
– Developers to locate and update relevant
descriptions
– Computers to infer relationships and properties
• Creating new abstractions to facilitate the
creation of this structure
There is a misconception …
• That people building ontologies are all well versed
in metaphysics, computer science, knowledge
representation, and the content domain
• That ontologies in the real world are “clean” and
well defined
• That most people who are creating ontologies
understand all the ramifications of what they are
doing!
Lots of ontology builders are not
very good philosophers
• Nearly always, ontologies are created to address
pressing professional needs
• The people who have the most insight into
professional knowledge may have little
appreciation for metaphysics, principles of
knowledge representation, or computational logic
• There simply aren’t enough good philosophers to
go around
The pressing need to standardize
the names of human genes
But the human genome is only
part of the problem …
• Biologists maintain huge databases of gene
sequences and gene expression for a wide range of
“model organisms” (e.g., mouse, rat, yeast, fruit
fly, round worm, slime mold)
• Database entries are annotated with the entries
such as the name of a gene, the function of the
gene, and so on
• How do you ensure uniformity in the nature of
these annotations?
Gene Ontology Consortium
• Founded in 1998 as a collaboration among
scientists responsible for developing different
databases of genomic data for model organisms
(fruit fly, yeast, mouse)
• Now, essentially all developers of all modelorganism databases participate
• Goal: To produce a dynamic, controlled
vocabulary that can be applied to all organism
databases even as knowledge of gene and protein
roles in cells is accumulating and changing
GO = Three Ontologies
• Molecular Function
– elemental activity or task
– example: DNA binding
• Cellular Component
– location or complex
– example: cell nucleus
• Biological Process
– goal or objective within cell
– example: secretion
GO has been wildly successful!!
• Dozens of biologists around the world
contribute to GO on a regular basis
• The ontology is updated every 30 minutes!
• It’s now impossible to work in most areas of
computational biology without making use
of GO terms
But GO has real problems …
• Ontologies are represented in an idiosyncratic format that
is not compatible with standard knowledge-representation
systems
• The format is based on directed acyclic graphs of concepts,
without the general ability to specify machine interpretable
properties of concepts or definitions of concepts
• Because of the informal knowledge-representation system,
lots of errors have crept into GO
– Terms that are duplicated in different places
– Terms with no superclasses
– Uncertain relationships between terms
Tension in the GO Community
• Biologists around the world with pressing needs to
integrate research databases work together to add
terms to GO nearly continuously
– Using an impoverished, nonstandard knowledgerepresentation system
– Using no standards to assure uniform modeling
conventions from one part of GO to another
• Computer scientists bemoan all this
ad-hoc-ery and condemn GO as a hack that will
become increasingly unusable and unmaintainable
A wonderful keynote talk
from the recent meeting
on Standards and
Ontologies for Functional
Genomics
The Capulets and Montagues
A plague on both your houses?
Professor Carole Goble
University of Manchester, UK
Warning:
This talk contains sweeping generalisations
 Carole Goble
Prologue
Two households, both alike in dignity,
In fair genomics, where we lay our scene,
(One, comforted by its logic’s rigour,
Claims ontology for the realm of pure,
The other, with blessed scientist’s vigour,
Acts hastily on models that endure),
From ancient grudge break to new mutiny,
When “being” drives a fly-man to blaspheme.
From forth the fatal loins of these two foes
Researchers to unlock the book of life;
Whole misadventured piteous overthrows
Can with their work bury their clans’ strife.
The fruitful passage of their GO-mark'd love,
And the continuance of their studies sage,
Which, united, yield ontologies undreamed-of,
Is now the hours' traffic of our stage;
The which if you with patient ears attend,
What here shall miss, our toil shall strive to mend.
Based on an idea by Shakespeare
 Carole Goble
The Montagues
One, comforted by its logic’s rigour,
Claims ontology for the realm of pure
Computer Science, Knowledge engineering, AI
Logic and Languages
Theory
Top down, well-behaved neatness
Generic and lots of toys
Methodologies & patterns
Tools and standards
Technology push
Academic pursuit
 Carole Goble
The Capulets
The other, with blessed scientist’s vigour,
Acts hastily on models that endure
Life Scientists
Practice
Bottom up, real-world
Specific and many of them
Methodologies, community practice
Tools and standards
Application pull
Practical pursuit – build ‘n’ use it
 Carole Goble
The Philosophers
One, comforted by its logic’s rigour,
Claims ontology for the realm of pure
Philosophers
Theory
Truth
Generic – the one true ontology?
Methodologies, patterns & foundational ontologies
Not really into tools
No push or pull
Academic pursuit
 Carole Goble
The Princes of Genomics
Rebellious subjects, enemies to peace,
Profaners of this neighbour-stained steel,-Will they not hear? What, ho! you men, you beasts,
That quench the fire of your pernicious rage
With purple fountains issuing from your veins,
On pain of torture, from those bloody hands
Throw your mistemper'd weapons to the ground,
And hear the sentence of your moved prince.
Three civil brawls, bred of an airy word,
By thee, old Capulet, and Montague,
Have thrice disturb'd the quiet of our streets,
And made genomics's ancient citizens
Cast by their grave beseeming ornaments,
To wield old partisans, in hands as old,
Canker'd with peace, to part your canker'd hate:
A tragedy?
As in Romeo and Juliet,
the threats are political
and sociological
Creating ontologies has become a
widespread cottage industry
• Professional Societies
– MGED: Microarray Gene Expression Data Society
– HUPO: Human Protein Organization
• Government
– NCI Thesaurus
– NIST: Process Specification Language
• Open Biological Ontologies
– GO
– Three dozen (and growing) other ontologies
– Mostly in DAG-Edit, some in Protégé format
Moving from cottage industry
to the industrial age
• Government and professional societies must set
expectations regarding the need for appropriate
standards
• Government and professional societies must invest
in educational programs to teach Montagues to
identify with Capulets, and vice versa
• Demonstration projects must communicate to the
potential developers of future ontologies the
strengths and weaknesses of the guidelines, tools,
and languages that facilitated the development
work
A thousand flowers are blooming
from every corner of the landscape
• Ontologies are being developed by interested groups from
every sector of academia, industry, and government
• Many of these ontologies have been proven to be
extraordinarily useful to wide communities
• Many of these same ontologies have been shown to be
structurally flawed and of uncertain semantics
• We finally are at the stage where we have tools and
representation languages that can lift us out of the grass
roots to create durable and maintainable ontologies with
rich semantic content