Transcript PPT

PIONEER HI-BRED
INTERNATIONAL, INC.
Plant Ontologies –
Industrial Science meets
Renaissance Concepts
Dave Selinger
Computational Biologist
Pioneer Hi-Bred,
DuPont Agriculture and Nutrition
Outline

What is the nature of the problem that a Plant
Anatomy Ontology can solve?

What is an Ontology?

How do you make a Plant Anatomy Ontology?

Does it really solve the problem?
RESEARCH
Industrial Science

Not science in industry, but the industrialization of data
creation, i.e. the ‘omics revolutions.

High-throughput data



Sequencing

Expression
Medium-throughput data

Proteomics

Metabolomics
Low-throughput data

Gene/protein function

Phenotype
RESEARCH
The double-edged sword of
Industrial Science

Industrial science means lots of cheap data
 Sequencing


<< $0.01/base
$10,000 prokaryotic genomes are reality
$10,000 eukaryotic genomes will be reality in the next five years
 Expression
<$0.50/gene
 And much of this data is available for free after it is
produced!

Lots of data means that you can’t sit down with
your lab notebook and analyze the data by hand.
 Databases,
software for searching and comparing
 Whole new areas of research devoted to finding
meaningful patterns in lots of data.
RESEARCH
Organizing information


Information is not knowledge.

But knowledge can be acquired from information.

But only with a lot of effort, see third law of thermodynamics
Central challenge with Industrial science is organizing the
information.

The organization of the information determines what you can
discover.

Experimental design


Good design will produce a contrast that will support or refute a
hypothesis.
Statistical rigor –
– Is the signal higher than the noise?
– How conclusive will the discoveries be?
RESEARCH
Context

How do we compare across experiments?
 Not
too hard if one person did all the experiments and
kept careful notes.
 If
multiple people, then we need to define what was
done, what the analysis was, and what the sample was.

What was done – e.g. MIAME standard for describing the
technical details of an expression experiment.

Analysis – e.g. ANOVA, SAM, etc.

Sample – ?
RESEARCH
Renaissance concepts
(historically Enlightenment)

Things can be systematically described
and classified


Linneaus’ problem is much the same as
the sample description problem


Organisms - Linneaus, Species Plantarum,
1758
Variable specificity

California Laurel or Oregon Myrtlewood?

Kernel or seed?
In addition, a term like kernel assumes all
parts, but this assumption could be wrong
RESEARCH
Ontologies to the rescue?

Ontology = the study of being (Philosophy)


The specification of a conceptualization of a domain of interest
(Computer Science)
Original and continuing computer science interest was Artificial
Intelligence.



How can a computer make inferences?
Need to define meanings – can for example.
Structure and relationships in an ontology allow a computer to make
inferences.
– Mary is the mother of Bill. Is Mary a parent of Bill?
– IsA Mother Parent

Parts of an ontology




Concepts -> objects, real and abstract, processes, functions
Partitions -> rules that can classify concepts
Attributes -> properties of a concept, can have individual and class
attributes
Relationships -> is a, part of
RESEARCH
Does an ontology make sense?

The value of ontologies is a current debate among
information scientists.

One group advocates that ontologies are necessary for computers
to understand content.


Others argue that ontologies are not needed and are not practical


Semantic web -> an extension of the current HTML/XML based web to
something with ontological inference
Complexity is ok and just use a Google like search to connect concepts.
However, some problems, like organismal classification and the
periodic table are very amenable to an ontological approach.

Formal categories and stable entities

Expert users and catalogers
RESEARCH
Forms of ontologies

Ontologies can take several forms (data
structures)
 Controlled
vocabulary (List)

Terms but no relationships

Enforces systematic naming
 Hierarchy
(tree structure) => Taxonomy

Terms and “is a” relationship

Children are unique and have a single parent
 Directed
acyclic graph => Gene Ontology

Multiple relationship types

Children with multiple parents
RESEARCH
Features of Trees

Because each child node has only one parent

There is an unambiguous path to the root from each leaf

Child nodes can be easily grouped at any level of the structure

Trees can express only one organizing principle

Work well for taxonomy (at least eukaryotic taxonomy)

Organizing principle is classification by similarity

All terms have an “is a” relationship to the next level term

Organisms were classified before evolution was hypothesized, but
the classification matches the evolutionary relationships

Similar example would be the periodic table of the elements

Classification can facilitate discovery of underlying principles
RESEARCH
A tree based Anatomy Ontology


Developed by Winston Hide’s group at SANBI and
Electric Genetics
Single concept, orthogonal trees
 Cells
 Tissues
 Organs
 Disease


state
Each tree is independent, but has related
dimensions describing a sample
Set operations, intersection or union, between
trees allows specific queries.
RESEARCH
Features of DAGs

A tree is a special case of the DAG class

Children can have multiple parents.
 Allows
multiple classifications of the same child

E.g. a guard cell is both part of a leaf and is an epidermal cell.

Allows for more than a binary classification of a concept
 If
this results from poor definition of the concept, then it
is not good.

Multiple parentage fits a “normalized” data model
 Like
a normalized relational database, a DAG can
minimize duplication of objects (concepts).
RESEARCH
Sample DAG

Root
 Cooking

Spices
– Bay leaf
•
Laurel nobilis
•
Umbellularia californica (California laurel)
 Trees

Lauraceae
– Laurel
•
Laurel nobilis
– Umbellularia
•
Umbellularia californica
RESEARCH
Constructing the Pioneer Plant
Ontology

Decided to produce a DAG
 Used
DAGeditor (editor developed for GO)
 Developed our own web based viewing tool


AmiGO was too complicated to re-use. Other public browsers
did not have the functionality we wanted.
Decided to focus on Corn and Soybeans
 Used
Kiesselbach’s 1949 Monograph on Corn structure
and reproduction as the primary source.
 Used Iowa State University Ag Extension publications
for the development stages of corn and soybeans
 Added information from a botany textbook to cover
missing terms from soybean.
RESEARCH
To collaborate or not to
collaborate?

Advantage of just using the Pioneer Ontology was
that it served our needs and was focused on corn
and soybeans, our major crops.

Disadvantage was that it was not synchronized to
the public
 We
would not be able to easily integrate public tissue
classifications to ours
 We
would not be able to easily take advantage of
improvements to the public ontology
 Presumably
the public ontology would be more
“botanically correct” than ours.
RESEARCH
Plant Ontology Consortium

Focused on model organisms
 Arabidopsis
 Rice

and other grasses with the rice terms (corn).
Used a DAG approach
 Multiple
concepts

Structure (cells, tissues, sporophyte and gametophyte)

Development
 Used
DAGeditor and other GO approaches

Most terms have multiple parents

Same software and data structures as GO
RESEARCH
Plant Ontology


Domain = Plant anatomy and development
Concepts




Plant parts (leaf, root, flower, meristem, etc.)
Life cycle stages (sporophyte, gametophyte)
Developmental stages (V1, flowering, R1, etc.)
Relationships between concepts

“A kind of” (Is a)
– A prop root is a root

“A part of” (part of)
– A root cap is part of a root

In addition, for plant anatomy a “develops from” relation is needed
– For example the relationship between stomatal guard cells and the guard
mother cell
– Guard cells develop from guard mother cells
RESEARCH
Adapting the POC ontology for
Pioneer’s needs

Problem is that it has many more terms than
required for our experiments
 Some
terms describe tissues or cells that are not
practical to collect (e.g. antipodal cells)
 Some terms describe parts not found in corn (e.g.
nectary)

Another problem is that we collect samples that
are convenient subdivisions of structures
 Tip
and base of an immature ear. Each differs from a
whole immature ear in terms of what it contains.
 Basal endosperm – morphologically distinct from starchy
endosperm, but not found in the ontology
RESEARCH
Our current solution

Add additional terms to the POC ontology

Use a different id system

easily distinguished from POC terms

will not be overwritten by on-going public curation efforts.

Label experiments with the terms from the ontology.

Create a Custom ontology


Query the whole ontology with the terms used in the labeling and
keep only

terms that are used to label an experimental sample

Parent terms of used terms.
Can be readily rebuilt if new experiments or terms are added.
RESEARCH
What can you do with the
ontology?



Provides a grouping mechanism

Summarize expression for a tissue

Compare expression between tissues

Make complex queries that involve multiple tissues
Provides a systematic label for annotating genes

Where is the gene expressed?

Query annotation of genes based on terms
Provides a description of the complexity of tissue samples

Leaf sample is composed of multiple cell types with different roles

Cell types can be shared between tissues or structures
RESEARCH
Comparing by tissue

The ontology provides the groupings, but how to
summarize
 Mean?
 Median?
 Maximum

value?
Significance of differences?
 Each
group will be much more variable than a set of
samples from a controlled experiment.
 But
you may be able to eliminate the inevitable false
discoveries that appear when looking at large numbers
of genes.
RESEARCH
Annotating genes

This is the primary use for TAIR and Gramene
 Potentially
label most genes with tissues of expression
 However,
need to differentiate presence with preferential
expression.

A gene may be present in many tissues, but highly expressed in
a few

Another gene may be present in the same tissues, but similarly
expressed in all of them.
– Might need to precompute and indicate which tissues the gene is
significantly preferentially expressed in.
– Might be able to use the RMS differences between expression in
each tissue as a measure of consistency.
RESEARCH
Complexity

Genes may appear to differ between tissues for
trivial reasons
 Example:
Gene appears to be preferentially expressed
in stem versus leaf tissue.


If gene is really specific to vascular tissue and stem has more…
Gene is expressed late in development, adjacent leaves and
stems may differ in development.
 Ontology
can guide further experiments

Compare vascular and non-vascular tissue from both leaf and
stem.

Compare multiple leaf and stem samples from different positions
(developmental stages).
RESEARCH
Conclusions

The Plant Ontology classifies experiments and
genes based on anatomical and developmental
concepts.

Now that we have significant data, can we, like
Darwin, discern the underlying mechanisms for
how anatomical and developmental differences
occur.

The Plant Ontology will be successful and used
long term if it facilitates these kinds of
investigations.
RESEARCH
Acknowledgements

Pioneer
 Henry
Mirsky
 Lane Arthur
 Bob

Merrill
POC
 Doreen
 Katica
Ware (Gramene)
Ilic (TAIR)
RESEARCH