slides - Bio-Ontologies 2016
Download
Report
Transcript slides - Bio-Ontologies 2016
Bio-Ontologies Meeting
Glasgow
30/07/04
Using ontologies to provide semantic richness
in biological image databases
(Sub-title: In Praise of Good Colleagues)
David Shotton
Director, Image Bioinformatics Research Laboratory
Oxford e-Science Centre
Department of Zoology, University of Oxford
Oxford OX1 3PS, UK
e-mail: david.shotton @zoo.ox.ac.uk
© David Shotton 2004
Acknowledgements
Chris Catton
BioImage Development Manager: ImageStore Ontology and SABO developer
Simon Sparks
BioImage Software Engineer: OWLBase query engine developer
John Pybus
BioImage Systems Manager
Chris Wilson
SABO research project
Chris Holland
ImageBLAST
research project
Ruth Dalton
SABO research project
European Commission funding of the ORIEL Project - IST-2001-32688
Outline of my presentation
Expert knowledge and tacit knowledge
The Semantic Web and ontologies
Ontologies in biology
The BioImage Database: its purpose, structure and ontology usage
Enabling ‘smart queries’ by importing external ontologies into BioImage
ImageBLAST: hypersearches across distributed biological databases
Concluding remarks and cautionary tales
This is a fairly straightforward article, but nowhere in it are you told that:
Caenorhabditis elegans is a nematode worm, one of the handful of model
organisms for which the complete genome has been sequenced
or that
A transcription factor bind to nuclear DNA to control the readout of genetic
information from a particular gene
These facts are so basic to the paper that they are assumed
Expert knowledge and tacit knowledge
Mutual understanding within any field of knowledge is based on a
shared conceptualisation developed by scholars over the years
This shared conceptualisation is often implicit through scholars’ choice of
vocabulary and theories when speaking or writing
Furthermore, in order to communicate at the highest level (as in the Nature
paper), scholars must assume that those listening to or reading their words
are part of this community and share the conceptualization
Much of what is communicated in a paper or an academic lecture is first a
reinforcement and then an extension of the shared tacit knowledge.
It is this assumed tacit knowledge, every bit as much as the technical
jargon, that makes scientific literature so impenetrable to non-specialists
My next few slides are designed to make explicit some of the key points
relating to ontologies, for the benefit of those for whom this may be new
Electronic communication of complex knowledge
In human society, much of our knowledge is implicit or tacit we know more than we think we know!
However, today, as more and more knowledge is held on-line, more and more
communication needs to be M2M, from one computer to another
To accomplish such communication successfully, and to permit semantic
reasoning over distributed information resources
such tacit knowledge must be made explicit, and
the meaning of information must be specified unambiguously
This is difficult, and demands anal attention to detail
The next slide illustrates what I mean . . .
This is This
a caption
is notfor
even
a projected
a projected
digital
digital
image
This is not
This
a is
photograph
not
a panda
of a panda
What
isofthis?
image
of a photograph
of a photograph
a panda
of a panda
In biology, meanings may be complex
In normal conversation, “daughter” means a female human child
conceived by sexual intecourse between mother and father, and then
born after a gestation of nine months within the mother’s uterus
In non-mammalian animal species, development is usually from eggs
But sex is not always required: female aphids can give birth to daughters
by parthenogenesis, without the need for fertilization of the eggs by male
sperm
And in the field of cell biology, the word “daughter” has an entirely
separate meaning: two genetically identical “daughter cells” are
produced every time a single cell divides
Biological ontologies have thus to understand the context in which the
word “daughter” is used, in order to apply the correct meaning
What is the Semantic Web, and how can it help?
The concept of the Semantic Web was first clearly articulated in 2001 in an
eponymous SciAm article by Tim Berners-Lee, Jim Hendler and Ora Lasilla
While the World Wide Web permits access to data in human-readable form,
the Semantic Web provides access to information structured in a formal
logical manner, such that computers can reason over it, extracting meaning
It involves three technologies, each resting hierarchically on the previous one:
The use of XML as a markup language more expressive than HTML
RDF triples that permits one to make simple logical statements (subjectverb-object) written in XML, in a form that a computer can understand
The use of ontologies – formal representations
of a particular domain of knowledge (e.g. the
GO ontology about genes and gene products)
– written in a high level ontology language
such as OWL (W3C’s Web Ontology
Language), which is itself expressed as a set
of RDF statements
RDF triples
An RDF triple might state that a mouse is_a mammal, informing the computer
that an entity ‘mouse’ is included in the more general category of ‘mammal’
This has the advantage that mouse inherits all class properties previously
defined for mammal, such as the possession of four legs and fur
By using several RDF triples referring to the same subject, multiple attributes
can be defined:
Subject (Entity)
=
Mouse (class)
This mouse (instance)
Property (Attribute) =
is_a
/ has_location / has_identifier
Object (Value)
Mammal
/ Oxford
=
/ 667
In RDF, the statement “This mouse is located in Oxford” is simply:
<rdf:RDF>
<rdf:Description about=“Mouse”>
<Location>Oxford</Location>
</rdf:Description>
</rdf:RDF>
What type of animal is shown in this image?
Ailuropoda melanoleuca
German
taxonomists
claimed it was
a bear
British
taxonomists
claimed it was a
racoon
US taxonomists weren’t quite sure
Today, the balance of opinion is “bear”
A panda is only a bear
because we all now say it is!
So what is an ontology? “An ontology is a formal explicit specification of a
shared conceptualisation”
The role of an ontology is to facilitate the understanding, sharing, re-use and
integration of knowledge through the construction of an explicit domain model
We understand taxonomic hierarchies
Animal
is_a
Vertebrate
is_a
Mammal
is_a
Rodent
is_a
Mouse
In an ontology, one can
express more complex
relationships about a mouse,
other than just its taxonomy
A partial ontology of ‘mouse’
Group of
organisms
Mus musculus
is_a
Colony
has_species_name
member_of
Mouse
proper_part_of
Leg
(has_cardinality: 4)
(has_position: front / rear)
(has_handedness: left / right)
(has_length: number)
has_ID
has_mode_
of_locomotion
Locomotion
used_for
proper_
part_of
Fur
(default_colour: white)
(has_length: number unit)
(has_density: number per unit area)
667
is_a
Running
hypothesised_
function
Escape
How do you build an ontology?
You need to define all the terms within a domain of knowledge, and specify
the relationships they have to one another
The structure of these relationships is a Directed Acyclic Graph, in which child
terms can have more that one parent
The relationships of a child term to its two (or more) parent terms can be
different, as shown in the previous example:
mouse is_a rodent
–
type relationship
mouse member_of colony
–
collective relationship
The thinking crow problem
To properly annotate videos of Betty, we need to be able to structure not only
people’s interpretations of the world, but also Betty’s view of what is going on!
Biological ontologies
There is good ontological coverage of the genes and gene products of model
organisms in the form of the Gene Ontology (http://www.geneontology.org)
But until very recently little work had been done at the other end of the
biological spectrum, in the field of animal behaviour
However, my department is full of people undertaking whole animal biology
To be able to include their images and videos within the BioImage Database,
we decided to develop a draft standard animal behaviour ontology, SABO
SABO is an upper level ontology designed to cover all of animal behaviour,
build around Otto Tinbergen’s four questions: “How does it work? How did it
develop? How is it used? and How did it evolve?”
Because interpretations of behavioural events can be very subjective, we
have been careful to separate fact from hypothesis in the design of SABO,
with emphasis on the authority for any claims
Fact and hypothesis in SABO
For example, a courtship event
Courtship behaviour in ducks
Male mallard ducks attract their mates using a “grunt-whistle”, which
Konrad Lorenz hypothesised in 1941 was derived from body shaking
Using the SABO ontology, this can be recorded in the following RDF triples:
Grunt-Whistle (a type of courtship behaviour)
generates hypothesis
Hypothesis About Evolutionary Origin (an ontology class)
Hypothesis About Evolutionary Origin
hypothesised evolutionary origin
Body Shaking (a type of behaviour)
Hypothesis About Evolutionary Origin
has author
“Lorenz, Konrad” (instance data)
Hypothesis About Evolutionary Origin
has date
“1941” (instance data)
The Ethodata Ontology
SABO was used as one of the two starting points for a recent Animal Behaviour
Metadata Workshop held at Cornell University, at which leading international
ethologists worked together to create an Animal Behavior Metadata Standard
Our introduction of formal ontologies to this community was greatly helped by
the fact that Chris Wilson, who had worked with us on SABO, recently started a
Ph. D. at Cornell with Jack Bradbury, the workshop organiser
The Workshop output is a human-readable hierarchy of defined ethological
terms, the draft Animal Behavior Metadata Standard (ethodata.comm.nsdl.org)
The Workshop has commissioned us to develop this hierarchy into a fullyfledged computable ontology of animal behaviour, for the benefit of the whole
ethological community
Based on the draft Animal Behavior Metadata Standard and on SABO, and
written in OWL, this has the new agreed name of the Ethodata Ontology
We have already made a start on this work, and will use it to enter structured
ethological image metadata into the BioImage Database
A view of
the
BioImage
home page
structure
www.bioimage.org
Note the
hierarchical
browse categories
and the alternative
Browse / Search
arrangement
The BioImage Database Project
The value of digital image information depends upon how easily it can be
located, searched for relevance, and retrieved
Detailed descriptive metadata about the images are essential, and without
them, digital image repositories become little more than meaningless and
costly data graveyards
The BioImage Database aims to provide a searchable database of highquality multidimensional research images of biological specimens, both
‘raw’ and processes, with detailed supporting metadata concerning:
the biological specimen itself
the experimental procedure
details of image formation and subsequent digital processing
the people, institutions and funding agencies involved
the curation and provenance of the image and its metadata
to provide rich and accurate search results to queries over our data
and to integrate such multi-dimensional digital image data with other life
science resources by providing links to literature and ‘factual’ databases
The organisation within BioImage
The basic unit of organisation within the BioImage Database is the
BioImage Study, roughly equivalent to a scientific publication
A BioImage Study will contain one or more Image Sets, each
corresponding to a particular scientific experiment or investigation
Each Image Set will contain one or more Images on a common theme
Such an Image may be of any form or dimensionality
a 2D image, a 3D image, a video, or a 4D (x, y, z, time) image set
Users may browse or search the BioImage Database
by Study, by Image Set or by Image
For each representation, a thumbnail representative image and core
metadata of the results (title, authors, description, LSID) are initially
presented, and deeper metadata is available by clicking the title
Browses and searches may then be progressively refined
The basic BioImage metadata model
Cell or organism
Researcher
Preparation
Photographer or
microscopist
Experimental study
conditions or
manipulations
Subject or
specimen
Image capture
Camera or
microscope,
illumination,
focus, etc
Image sets of multidimensional images, including videos
So people are related to objects and conditions / equipment through events
The structure of the BioImage Database
BioImage Server
Apache Web server
VideoWorks Web server
Tomcat
Java appletsXSL, JSP and SiteMesh
View
Controller
Browser
interfaces
Local
image
filestore
Struts
Logic layer (servlets)
Submission servlet
Query servlet
Model
(Java
beans)
OBO server
Ontologies
NCBI server
SOAP
interfaces
SOAP clients
Key
HTTP
Administration servlet
Taxonomies
OWLBase query engine
SOAP protocols
Internal processes
BioImage
metadata PostgreSQL
Things to note about the architecture: external
User submission, searching and browsing activities are all mediated by the
ImageStore Ontology
Submission forms are generated dynamically from the ontology, to suit the
type of submission
Thus, for instance, people submitting light microscopy images are not asked
for the accelerating voltage of their electron microscope
There is complete separation of content from presentation
Presentation to users is via HTML, while SOAP is used to communicate with
Web Service clients
The Struts controller orchestrates data transfer between the system and the
user
This permits simple customization of the appearance of the data
Multilingual capabilities enabled by Struts
achieved simply by re-setting the default language of the user’s browser
This
shows
the
Access
Control
Interface
The
same
HTML
page is
being
viewed
in both
cases,
using
alternate
resource
bundles
Things to note about the architecture: internal
Data are exchanged within the system in XML format, using the BioImage
schema
There is no hard-coded ‘business logic’ - structures and semantics are
generated at run time
The ImageStore Ontology is the central data model
This single point of control greatly simplifies database maintenance, since
changes are automatically and dynamically propagated throughout the
system
The entire BioImage database structure can be automatically regenerated
from the ImageStore Ontology whenever this is required (for example in a
new form after updating the ImageStore Ontology), using metadata from a
previous XML dump
This allows easy migration to a new DBMS, e.g. from PostgreSQL to Oracle
OWLBase is used to reference the ontology and to mediate data transfers
OWLBase thus provides an abstraction layer for submissions and queries
The ImageStore Ontology
The ImageStore Ontology was constructed using the Jena toolkit
(www.hpl.hp.com/semweb) and our own open source Ontology
Organiser, an ontology constraint propagator and datatype manager
ImageStore:
uses a subset of the class model of the Advanced Authoring Format
(sourceforge.net/projects/aaf and www.aafassociation.org) to
describe media objects
uses a subset of MPEG-7 to describe multimedia content, and
has its own data model to describe scientific experiments
It is currently written in DAML+OIL
We are in the process of upgrading BioImage to use Jena 2, which will
permit us to convert the ImageStore Ontology into OWL
What is required of an image ontology?
Such a generic image ontology as the ImageStore Ontology must describe
all aspects of the images themselves:
their acquisition (including details of who took the original micrograph,
where, when, under what conditions, for what purpose, etc.)
the media object itself (source and derivation, image type, dynamic
range, resolution, format, codec, etc.)
the denotation of the referent (a description of exactly what is recorded
by the image, e.g. the nature, age and pre-treatment of the subject), and
the connotation of the referent (i.e. the interpretation, meaning, purpose
or significance imparted to the image by a human, its relevance to its
creator and others, and its semantic relationship to other images).
In addition to these ancillary metadata about the image, there is yet a
further need to record semantic content metadata related directly to the
information content of the images or videos themselves
These semantic content metadata carry very high information value, since
they relate directly to spatial (or spatio-temporal) features that are of most
immediate relevance to human understanding of media content, namely
“Where, when and why is what happening to whom?”
Image description – separating fact from hypothesis
BioImage Study title: Xklp1:a Xenopus kinesin-like protein essential
for spindle organisation and chromosome positioning
Denotation (raw fact):
Immunofluorescence
localization of Xklp1 in
XL177 cells
Connotation (interpretation):
Xklp1 is involved in
chromosome localization
during mitosis in embryonic
Xenopus cells, since it is
positioned at the
metaphase plate
Vernos et al., 1995
Representing fact and hypothesis within ImageStore
range
Class
Event
Class
Segment
ObjectProperty
Restriction
subClassOf
onProperty
has
range
subClassOf
Restriction
Restriction
subClassOf
subClassOf
onProperty
onProperty
ObjectProperty
has
Class
FormOfExpression
ObjectProperty
has
range
range
subClassOf
Class
EventContentDescription
Class
NarrativeContentDescription
subClassOf
subClassOf
subClassOf
Restriction
Class
Connotation
subClassOf
subClassOf
Class
Denotation
onProperty
subPropertyOf
subPropertyOf
subPropertyOf
Restriction
subPropertyOf
Restriction
DataTypePropery
subPropertyOf
CameraMotionType
subPropertyOf
subPropertyOf
subPropertyOf
ObjectProperty
Restriction
range
participant
ObjectProperty
participant
ObjectProperty
xsd:
Mpeg7:cameraMotionType
tool
onProperty
ObjectProperty
tool
onProperty
onProperty
DataTypePropery
ObjectProperty
onProperty
RegionOf Interest
states
ObjectProperty
onProperty
states
ObjectProperty
ObjectProperty
DataTypePropery
range
DataTypePropery
location
range
weather
location
weather
Collection
Collection
intersectionOf
DataTypePropery
range
intersectionOf
xsd:
Mpeg7:SpatialMask D
intersectionOf
DataTypePropery
intersectionOf
habitat
habitat
Rdf:Statement
Rdf:Statement
Rdf:Statement
Real World
Real world
Media
Media world
Rdf:Statement
Narrative World
Narrative world
The BioImage advanced search interface
The Advanced Search Interface permits Boolean searches, search restrictions,
and re-use of previous searches in combination with new terms
Automated SQL query generation
Stage one: user inputs a query “Find images of bears”
Stage two: the ontology reasons over the request
Stage three: OWLBase convert the request to SQL
Stage four: metadata is retrieved from the database
Stage five: metadata is returned to OWLBase as XML
In summary:
Queries are made by our ontology-driven database query engine, OWLBase
OWLBase passes a query via the ImageStore ontology to the underlying
PostgreSQL metadata relational database
The database returns metadata of studies matching the search term:
authors
title
description
network locator (URI) for the representative thumbnail image
IDs of all the component datasets and images
These XML data are then used to populate the HTML Study Results Web
page that is displayed to the user
Many of these items link to deeper metadata
If the user now clicks on one of the nodes linking to deeper metadata, a new
OWLBase query is initiated that returns information about that component
Search
result,
showing
Studies
What’s so special?
For each query, OWLBase builds in memory an RDF ‘knowledge graph’
representing the structure of the components of each of the matching studies
As the user clicks on nodes linking to deeper metadata, each new OWLBase
query return is used to extend the RDF graph of the resource
In this way, the in-memory representation of the relevant metadata is built up
dynamically and incrementally, as required
At present, this would not seem to provide much additional functionality over
and above a conventional relational database SQL query system
However, the fact that the searches use the ImageStore Ontology and build
up an OWLBase RDF graph opens the possibility to three novel advances:
Use of external third-party ontologies
Smart queries within the BioImage Database
and
Hypersearches across distributed resources
‘People’ metadata within BioImage
People have attributes:
First and last names, dates of birth, addresses, phone numbers, etc
People have various affiliations:
Current membership of an institution, e.g a university
Former membership of another institution – e.g. undertook
the research while a postdoc there
Simultaneous membership of a third organisation,
e.g. an international research project partnership
People have grants:
“The work in this BioImage Study was funded by BBSRC”
People may have different roles within a BioImage Study:
This person planned the study – Principal investigator
That person prepared the specimen – Technician
A third person undertook the electron microscopy – Postdoc
Together they wrote the Nature paper – Authors
Use of external ontologies
Because all BioImage queries are passed through the ImageStore ontology,
and because ImageStore can be extended using external third-party
ontologies, we have the possibility of using such external ontologies to
enhance BioImage searches
In its simplest form, this can just be used to simplify metadata submission
For example, an organisation such as a pharmaceutical company might
choose to use an instance of the BioImage Database System internally,
behind its own firewall, for the organization of its own confidential research
images
If that company already had an ontology-controlled database of all its
employees’ details, there would be no need to re-enter those metadata for
each image these people wished to record – all that would be required would
be to link the BioImage Database System to the employee records ontology
But external ontologies can do much more for us . . .
Using external biological ontologies within BioImage
Biological content can be described using external ontologies – currently
the GO ontology (www.geneontology.org) for genes and gene products, and
the NCBI taxonomy (www.ncbi.nlm.nih.gov/Taxonomy) to identify species
and soon others will also be used, e.g. the Ethodata Ontology
We have already implemented the display of an interactive taxonomic hierarchy
that permits the user to browse by narrowing or broadening the scope of the
results displayed after a query, by clicking at different points in the taxonomy
Thus the images of specimens derived from all rodents can be refined to
show only those from mice, or broadened to show all mammalian images
Similar modification of other parameters is also possible
For instance from confocal fluorescence images to real-time confocal
images or to all fluorescence images (these relationships being structured
within the ImageStore Ontology)
At present we can use third party ontologies only if we pre-import them
We wish now to extend this functionality by creating dynamic access to external
ontologies that are published in XML on the Web, thus ensuring that we always
access the most recent version
Smart queries within the BioImage Database
We propose next to use external ontologies to provide the ability to undertake
semantically rich searches of the BioImage Database that can handle
synonyms (‘mouse’ and ‘Mus musculus’)
hierarchies (‘rodent’ and ‘mammal’)
exclusions (not a computer mouse)
and related terms (‘laboratory animal’ and
‘model species’)
rather than being limited to conventional ‘Google-like’ searching by means
of exact keyword matching, results of which are rather unpredictable!
We do not yet know how this Semantic Web approach to database querying
will scale with increasing database size, and we will need to undertake
comparative research after implementing it
Hypersearches of distributed information sources
At present, the BioImage Database gives users the straightforward capability
of linking out from a BioImage study, dataset or image via standard Web
hyperlinks to relevant material elsewhere on the Web
For example, the Advanced Search Interface enables users to enter BioImage
queries of the type: “Retrieve all images of Drosophila testes showing
expression of the gene always early (aly)”, and then enable users to link out
from these BioImage studies both to the gene sequences and to literature
publications of relevance
What we cannot do at present, however, is to send complex queries across a
set of databases, of the type: “Retrieve images of whole Drosophila, Xenopus
and mouse embryos showing the comparative neural expression of the most
anterior of their Hox genes at different developmental stages, and show me
these gene sequences aligned to maximise homology”
We wish to investigate how to undertake complex integrated ‘hypersearches’
simultaneously over the BioImage Database and relevant ontology-enabled
and Web Services-enabled sequence, structural and literature databases
How to implement hypersearches
The conventional way to search across disparate databases would be to map
their schemas onto some common system, and then use that to distribute a
query across them in a manner that each database can understand.
Our approach is somewhat different, and relies on the fact that OWLBase
dynamically builds up an RFD representation of the information space of
interest, and that external ontologies can be integrated with ImageStore
Specifically, we plan to import relevant sub-graphs from published external
ontologies (i.e. class data rather than instance data) dynamically into the RDF
graph being built up within OWLBase during each query
We will then use this extended graph to structure the hypersearches, by
providing ‘internal’ knowledge about the structure of external databases
OWLBase will thus act as more than just a query engine.
It will build dynamic graphs of relationships between stuff
within BioImage and stuff outside, and then run queries
over that bigger graph
ImageBLAST
The ability to mount semantically rich queries over a variety of database
resources opens the possibility of developing new bioinformatics search tools
Our first proposal for this, initially envisioned by our collaborator Michael
Ashburner at the ORIEL Varenna conference last September, is ImageBLAST
By analogy with the BLAST tool for identifying homologous genes, Michael’s
vision was for a tool in which a researcher could enter a nucleotide sequence
and have returned images of the normal and mutant expression patterns of
the protein encoded by that sequence, from all the model organism image
databases, together with detailed metadata describing all that is known about
that gene and its protein
Recently, my student Chris Holland and I have been designing some possible
user interfaces for ImageBLAST
I will show them to you in fairly swift succession, to give you a glimpse of the
vision we have in mind
The ImageBLAST home page
The ImageBLAST hypersearch interface
Gene name disambiguation
‘SAP1’ is a synonym for three separate gene products:
beta 4 defensin (DEFB4, aka HBD-2)
EKT4 (aka ETS-domain protein), and
proposin (aka GLBA). Such homonyn / synonym ambiguities are common
We will use the system developed by our ORIEL partner Martijn Schuemie of
the Erasmus University in Rotterdam for gene name disambiguation, in
combination with the ‘conceptual fingerprinting’ software of our industrial
partner Collexis BV of Rotterdam
Conceptual fingerprinting involves weighting terms in a piece of text on the
basis of their frequency and proximity. Terms are defined using the MESH
system and the UMLS biomedical thesaurus
Comparing numerical conceptual fingerprints permits rapid matching of
related texts, and enables resolution of gene name ambiguity on the basis of
the context of its usage
Summary results on ‘adh’ in Drosophila
DNA results on ‘adh’ in Drosophila
Product results on ‘adh’ in Drosophila
Structure of Drosophila adh
Pathway results on ‘adh’ in Drosophila
Example of a specific pathway
Phenotype results on ‘adh’ in Drosophila
One phenotype study on ‘adh’ in Drosophila
Will ImageBLAST work?
To work, ImageBLAST will clearly requires intimate linkage between the
ImageStore Ontology, the Gene Ontology, and the forthcoming Cell Ontology
It will also require integration with the Bio-MOBY Web Services for sequence
bioinformatics (biomoby.org) developed by our Canadian colleague Mark
Wilkinson
At present, our vision seems far from risk free
However, the pace of Semantic Web developments in which we have
participated over the last two years has been truly astonishing
This gives reason to hope that, within a further two years, new developments
in information space representation, and new methods for ontology integration
and automated data extraction, will substantially aid us in attaining our goal
Such image bioinformatics tools, if indeed we succeed in developing them,
will enormously facilitate knowledge mining within biological images, and will
enable hitherto impossible types of on-line research to be undertaken
Populating the BioImage Database
But first the images must be made available in an ontology-driven database!
The BioImage Database will receive regular images from three main sources:
Journals: Three major scientific publications have already agreed to provide
the BioImage Database with biological images on a regular basis:
The EMBO Journal
EMBO Reports
The Journal of Microscopy
Research projects and specialist databases:
e.g. the Drosophila Testis Gene Expression Database
Laboratory image collections
The Open Microscopy Environment
If you have collections of high quality research images that you wish to
publish, please let me know or contact us via www.bioimage.org
Final words of caution
A cautionary tale
We recently wrote to a colleague requesting a copy of a beautiful confocal
image that he had collected some years ago
His reply typifies the wasteful fate of an unfortunately large proportion of
biological research images:
”Concerning the image data you requested - this is a tough one. The
image was recorded about ten years ago, and I never managed to write
a paper about the work so it was never published. The original data (if
they still exist) must be on some magneto-optical disk in one of many
boxes in my flat - quite hopeless to find at short notice. All I can
promise is that I’ll look into this once I am back from my travels – but
that will take a few months. Whether anyone still has hardware
capable of reading the disc is quite another matter! Sorry about this.“
It is perhaps the best possible argument
for the routine publication of images arising from publicly funded
research in databases such as the BioImage Database, that can
provide a safe repository for them and free access to them for the
community
and for the funding of such databases from the public purse
Ontologies are supposed to fit together neatly
- like irregular four-sided Penrose tiles
The blue
shape
represents
our
Ethodata
Ontology
– just one
among
many in the
information
landscape
. . . creating a harmonious whole
“Penroses” by Ruth McDowell
. . . but what if they don’t?
“Weeping woman” by Pablo Picasso
It is hoped that
ontologies from different
fields can be made
‘orthogonal’ to one
another non-overlapping and yet
with no gaps between
them
However, at present this
is just an optimistic hope
As yet, there is
insufficient ontological
coverage of the universe
of knowledge to know
whether this particular
vision of the Semantic
Web can be realised
The data deluge and the paradigm trap
The volume of data generated in the Life Sciences is now estimated to be
doubling every month
A single active cell biology lab may generate 10 to 100 Gbytes of
multidimensional image data a month
Soon the only way to handle the data will be through the presuppositional
‘lens’ of an ontology – people will never have time to look at the raw data
Does that matter? After all, the ontology is a specification of the accepted
paradigm established by the respected leading academics of the day
In other words, an ontology fossilized the prejudices of the old farts
Could this perhaps, maybe, just possibly, lead to a blinkered view of the
world?
Might this hamper the process of discovery and inhibit the overthrow of
incorrect hypotheses?
- what if Newton had written the ontology for physics?
BEWARE!
End
Additional slides of relevance
Entity-Attribute-Value storage
Entity-Attribute-Value databases have recently found favour among
healthcare professionals as a way of recording patient data
Like patient data, image descriptive data may be sparse – an image
represents a small subset of the objects in the real world, just as a patient
will have only a small subset of all possible diseases and treatments
Whereas in conventional relational database models, each description is
stored in a specific column, the EAV approach uses row modelling each description generates a row consisting of:
an entity (e.g. this_rose)
an attribute (a property of the entity, e.g. has_colour), and
a corresponding value of the attribute (e.g. red)
These EAV triples are easily encoded in RDF
For the BioImage Database, we use conventional relational tables for those
items upon which searches are frequently made – author, title, species, etc.
- and have adopted the EAV approach for those metadata items that are not
Patient records for blood parameters
A conventional relational database table, with lots of blanks
Adding new columns to the table to accommodate new tests is not easy
Patient Values
First
name
Last
name
Disease
Mary
Smith
Alcoholism
John
Smith
Cancer
Ken
Jones
Heart
disease
Barry
Brown
AIDS
White cell
count
Cholesterol
Ethyl
alcohol
Prostate
-specific
antigen
0.3 mg/dl
40 ng/ml
340 mg/dl
630 cells/µl
Lots more
columns
...
Lots of
blank
values . . .
EAV tables to record patient details
Auxiliary table
Person table
Resource
Name
Resource
ID
Property
ID
First
Last
ID
Person
125
1
Mary
Smith
125
Person
126
2
Person
127
3
Person
128
4
John
Smith
126
Ken
Jones
127
Attribute - Value table
Barry
Brown
128
Units appropriate to each
attribute are defined in the
ontology, and so do not need
to be specified in the table
ID
Attribute
Value
1
Ethyl alcohol
0.3
2
Prostate specific antigen
40
3
Cholesterol
340
4
White cell count
630
An example from everyday life . . .
BIRTHS, MARRIAGES AND DEATHS
Born to Revd John and Mrs Marjorie Sanders of St Paul’s
Vicarage, Tadcaster Road, Leeds: a daughter Emily Jane,
at 11:25 a.m. on 25th December 2003, weight 3.6 Kg.
“Is Emily Jane’s father a Yorkshire clergyman?”
Note that the only common element between the question and the press
announcement is the child’s name
No conventional electronic query, formulated to interrogate a relational
database containing the information within the press announcement, could
possibly come up with the correct answer to this question
Why? People are able to employed deductive reasoning and extensive
linguistic, cultural and geographical knowledge
Use of the correct ontology could help a computer to reach the same
conclusion
What would that ontology have to ‘know’?
That a daughter is a female child, and that a male parent is a father
That “John” is a man’s name
That “Revd” is an abbreviation for “The Reverend”, the title given to an
ordained minister of religion
That a typical employment for a minister of religion in the Anglican
Church is to be a vicar, i.e. the minister of a parish church
That Anglican parish churches are named after Christian saints;
That a “vicarage” is a house provided for the accommodation of a vicar
and his/her family
That since Revd Sanders lives in St Paul’s Vicarage, as well as being
an ordained minister of religion, it is highly likely that he is indeed the
Anglican vicar of St Paul’s church;
That a synonym for “vicar” is “clergyman”
That Leeds is an English city within the county of Yorkshire
Do mountains exist?
Do mountains exist?
Are we at the top
of Everest?
Do mountains exist?
Are we on Mount
Fuji at all?
Ontologies
Ontologies can describe many different kinds of relationships
Diet
Herbivore
Omnivore
has
_di
et
t
die
s_
Bears
ha
is_a
However, ontologies can have problems …
We classify pandas as herbivores because 99% of their diet is bamboo
What about the other 1%?
Autopsy of one panda revealed bones of a bamboo rat in its stomach
In captivity, pandas will eat pork coated with honey
Does this make the panda an omnivore?
Humans make ‘reasonable judgements’ when classifying things
However, machines usually reason over facts that are either true or false, and
cannot easily be programmed to make subtle distinctions
Scientific imaging
Images and videos form a vital
part of the scientific record, for
which words are no substitute
In the post-genomic world,
attention is now focused on the
functional analyses of gene
expression, and on organization
and integration within cells
In a month a single active cell
biology lab may generate
between 10 and 100 Gbytes of
multidimensional image data
But at present little of this is
published
The problem of image publication
Even when images are published,
they are often only processed
images, not the original image data
For example one might publish
a single section or a projection
from a complete 3D confocal
image
or a couple of frames from a
movie
It would be of great value if more
original image data were published
This would both permit
re-analysis and secondary
meta-research
and would be useful for
teaching and learning
Using Protégé to define a class in the ontology
Ontology Organiser
A constraint propagator and datatype manager
Eliminates the cognitive overload of the user during ontology development
while asserting relationships between resources
Ontology Organiser has capabilities not found in other editors like OilEd
First it can reduce the cognitive overload of the user during ontology
development while asserting relationships between resources. It:
evaluate constraints placed on relationships
propagate any alterations necessary up through an ontology's hierarchy
thereby maintaining ‘semantic robustness'.
Second, it addresses the more technical problem regarding the lack of
support for datatypes in existing ontology editing packages. Ontology
Organiser goes some way to aid the user in defining, modifying and
referencing custom datatypes in their ontologies
Ontology Organiser is available from SourceForge. Details can be found at
www.bioimage.org/publications.do