Transcript PPT
PIONEER HI-BRED
INTERNATIONAL, INC.
Plant Ontologies –
Industrial Science meets
Renaissance Concepts
Dave Selinger
Computational Biologist
Pioneer Hi-Bred,
DuPont Agriculture and Nutrition
Outline
What is the nature of the problem that a Plant
Anatomy Ontology can solve?
What is an Ontology?
How do you make a Plant Anatomy Ontology?
Does it really solve the problem?
RESEARCH
Industrial Science
Not science in industry, but the industrialization of data
creation, i.e. the ‘omics revolutions.
High-throughput data
Sequencing
Expression
Medium-throughput data
Proteomics
Metabolomics
Low-throughput data
Gene/protein function
Phenotype
RESEARCH
The double-edged sword of
Industrial Science
Industrial science means lots of cheap data
Sequencing
<< $0.01/base
$10,000 prokaryotic genomes are reality
$10,000 eukaryotic genomes will be reality in the next five years
Expression
<$0.50/gene
And much of this data is available for free after it is
produced!
Lots of data means that you can’t sit down with
your lab notebook and analyze the data by hand.
Databases,
software for searching and comparing
Whole new areas of research devoted to finding
meaningful patterns in lots of data.
RESEARCH
Organizing information
Information is not knowledge.
But knowledge can be acquired from information.
But only with a lot of effort, see third law of thermodynamics
Central challenge with Industrial science is organizing the
information.
The organization of the information determines what you can
discover.
Experimental design
Good design will produce a contrast that will support or refute a
hypothesis.
Statistical rigor –
– Is the signal higher than the noise?
– How conclusive will the discoveries be?
RESEARCH
Context
How do we compare across experiments?
Not
too hard if one person did all the experiments and
kept careful notes.
If
multiple people, then we need to define what was
done, what the analysis was, and what the sample was.
What was done – e.g. MIAME standard for describing the
technical details of an expression experiment.
Analysis – e.g. ANOVA, SAM, etc.
Sample – ?
RESEARCH
Renaissance concepts
(historically Enlightenment)
Things can be systematically described
and classified
Linneaus’ problem is much the same as
the sample description problem
Organisms - Linneaus, Species Plantarum,
1758
Variable specificity
California Laurel or Oregon Myrtlewood?
Kernel or seed?
In addition, a term like kernel assumes all
parts, but this assumption could be wrong
RESEARCH
Ontologies to the rescue?
Ontology = the study of being (Philosophy)
The specification of a conceptualization of a domain of interest
(Computer Science)
Original and continuing computer science interest was Artificial
Intelligence.
How can a computer make inferences?
Need to define meanings – can for example.
Structure and relationships in an ontology allow a computer to make
inferences.
– Mary is the mother of Bill. Is Mary a parent of Bill?
– IsA Mother Parent
Parts of an ontology
Concepts -> objects, real and abstract, processes, functions
Partitions -> rules that can classify concepts
Attributes -> properties of a concept, can have individual and class
attributes
Relationships -> is a, part of
RESEARCH
Does an ontology make sense?
The value of ontologies is a current debate among
information scientists.
One group advocates that ontologies are necessary for computers
to understand content.
Others argue that ontologies are not needed and are not practical
Semantic web -> an extension of the current HTML/XML based web to
something with ontological inference
Complexity is ok and just use a Google like search to connect concepts.
However, some problems, like organismal classification and the
periodic table are very amenable to an ontological approach.
Formal categories and stable entities
Expert users and catalogers
RESEARCH
Forms of ontologies
Ontologies can take several forms (data
structures)
Controlled
vocabulary (List)
Terms but no relationships
Enforces systematic naming
Hierarchy
(tree structure) => Taxonomy
Terms and “is a” relationship
Children are unique and have a single parent
Directed
acyclic graph => Gene Ontology
Multiple relationship types
Children with multiple parents
RESEARCH
Features of Trees
Because each child node has only one parent
There is an unambiguous path to the root from each leaf
Child nodes can be easily grouped at any level of the structure
Trees can express only one organizing principle
Work well for taxonomy (at least eukaryotic taxonomy)
Organizing principle is classification by similarity
All terms have an “is a” relationship to the next level term
Organisms were classified before evolution was hypothesized, but
the classification matches the evolutionary relationships
Similar example would be the periodic table of the elements
Classification can facilitate discovery of underlying principles
RESEARCH
A tree based Anatomy Ontology
Developed by Winston Hide’s group at SANBI and
Electric Genetics
Single concept, orthogonal trees
Cells
Tissues
Organs
Disease
state
Each tree is independent, but has related
dimensions describing a sample
Set operations, intersection or union, between
trees allows specific queries.
RESEARCH
Features of DAGs
A tree is a special case of the DAG class
Children can have multiple parents.
Allows
multiple classifications of the same child
E.g. a guard cell is both part of a leaf and is an epidermal cell.
Allows for more than a binary classification of a concept
If
this results from poor definition of the concept, then it
is not good.
Multiple parentage fits a “normalized” data model
Like
a normalized relational database, a DAG can
minimize duplication of objects (concepts).
RESEARCH
Sample DAG
Root
Cooking
Spices
– Bay leaf
•
Laurel nobilis
•
Umbellularia californica (California laurel)
Trees
Lauraceae
– Laurel
•
Laurel nobilis
– Umbellularia
•
Umbellularia californica
RESEARCH
Constructing the Pioneer Plant
Ontology
Decided to produce a DAG
Used
DAGeditor (editor developed for GO)
Developed our own web based viewing tool
AmiGO was too complicated to re-use. Other public browsers
did not have the functionality we wanted.
Decided to focus on Corn and Soybeans
Used
Kiesselbach’s 1949 Monograph on Corn structure
and reproduction as the primary source.
Used Iowa State University Ag Extension publications
for the development stages of corn and soybeans
Added information from a botany textbook to cover
missing terms from soybean.
RESEARCH
To collaborate or not to
collaborate?
Advantage of just using the Pioneer Ontology was
that it served our needs and was focused on corn
and soybeans, our major crops.
Disadvantage was that it was not synchronized to
the public
We
would not be able to easily integrate public tissue
classifications to ours
We
would not be able to easily take advantage of
improvements to the public ontology
Presumably
the public ontology would be more
“botanically correct” than ours.
RESEARCH
Plant Ontology Consortium
Focused on model organisms
Arabidopsis
Rice
and other grasses with the rice terms (corn).
Used a DAG approach
Multiple
concepts
Structure (cells, tissues, sporophyte and gametophyte)
Development
Used
DAGeditor and other GO approaches
Most terms have multiple parents
Same software and data structures as GO
RESEARCH
Plant Ontology
Domain = Plant anatomy and development
Concepts
Plant parts (leaf, root, flower, meristem, etc.)
Life cycle stages (sporophyte, gametophyte)
Developmental stages (V1, flowering, R1, etc.)
Relationships between concepts
“A kind of” (Is a)
– A prop root is a root
“A part of” (part of)
– A root cap is part of a root
In addition, for plant anatomy a “develops from” relation is needed
– For example the relationship between stomatal guard cells and the guard
mother cell
– Guard cells develop from guard mother cells
RESEARCH
Adapting the POC ontology for
Pioneer’s needs
Problem is that it has many more terms than
required for our experiments
Some
terms describe tissues or cells that are not
practical to collect (e.g. antipodal cells)
Some terms describe parts not found in corn (e.g.
nectary)
Another problem is that we collect samples that
are convenient subdivisions of structures
Tip
and base of an immature ear. Each differs from a
whole immature ear in terms of what it contains.
Basal endosperm – morphologically distinct from starchy
endosperm, but not found in the ontology
RESEARCH
Our current solution
Add additional terms to the POC ontology
Use a different id system
easily distinguished from POC terms
will not be overwritten by on-going public curation efforts.
Label experiments with the terms from the ontology.
Create a Custom ontology
Query the whole ontology with the terms used in the labeling and
keep only
terms that are used to label an experimental sample
Parent terms of used terms.
Can be readily rebuilt if new experiments or terms are added.
RESEARCH
What can you do with the
ontology?
Provides a grouping mechanism
Summarize expression for a tissue
Compare expression between tissues
Make complex queries that involve multiple tissues
Provides a systematic label for annotating genes
Where is the gene expressed?
Query annotation of genes based on terms
Provides a description of the complexity of tissue samples
Leaf sample is composed of multiple cell types with different roles
Cell types can be shared between tissues or structures
RESEARCH
Comparing by tissue
The ontology provides the groupings, but how to
summarize
Mean?
Median?
Maximum
value?
Significance of differences?
Each
group will be much more variable than a set of
samples from a controlled experiment.
But
you may be able to eliminate the inevitable false
discoveries that appear when looking at large numbers
of genes.
RESEARCH
Annotating genes
This is the primary use for TAIR and Gramene
Potentially
label most genes with tissues of expression
However,
need to differentiate presence with preferential
expression.
A gene may be present in many tissues, but highly expressed in
a few
Another gene may be present in the same tissues, but similarly
expressed in all of them.
– Might need to precompute and indicate which tissues the gene is
significantly preferentially expressed in.
– Might be able to use the RMS differences between expression in
each tissue as a measure of consistency.
RESEARCH
Complexity
Genes may appear to differ between tissues for
trivial reasons
Example:
Gene appears to be preferentially expressed
in stem versus leaf tissue.
If gene is really specific to vascular tissue and stem has more…
Gene is expressed late in development, adjacent leaves and
stems may differ in development.
Ontology
can guide further experiments
Compare vascular and non-vascular tissue from both leaf and
stem.
Compare multiple leaf and stem samples from different positions
(developmental stages).
RESEARCH
Conclusions
The Plant Ontology classifies experiments and
genes based on anatomical and developmental
concepts.
Now that we have significant data, can we, like
Darwin, discern the underlying mechanisms for
how anatomical and developmental differences
occur.
The Plant Ontology will be successful and used
long term if it facilitates these kinds of
investigations.
RESEARCH
Acknowledgements
Pioneer
Henry
Mirsky
Lane Arthur
Bob
Merrill
POC
Doreen
Katica
Ware (Gramene)
Ilic (TAIR)
RESEARCH