Shah - Buffalo Ontology Site

Download Report

Transcript Shah - Buffalo Ontology Site

Computations using pathways
and networks
Nigam Shah
[email protected]
THE GOAL = MAKING SENSE OF
HIGH THROUGHPUT DATA
High throughput data
• “high throughput” is one of those fuzzy terms
that is never really defined anywhere
• Genomics data is considered high throughput if:
• You can not “look” at your data to interpret it
• Generally speaking it means ~ 1000 or more genes and
20 or more samples.
• There are about 40 different high throughput
genomics data generation technologies.
• DNA, mRNA, proteins, metabolites … all can be
measured
How does ontology help?
• An ontology provides a organizing framework for
creating “abstractions” of the high throughput
data
• The simplest ontologies (i.e. terminologies,
controlled vocabularies) provide the most bangfor-the-buck
• Gene Ontology (GO) is the prime example
• More structured ontologies – such as those that
represent pathways and more higher order biological concepts
still have to demonstrate real utility.
–
Gene Ontology to analyze microarray
data
Using GO annotations
Descriptions built by connecting/linking ontology
terms
Biologists interpret a list of genes and form a result
statement such as:
The photosynthesis genes located in the chloroplast are
repressed in response to ozone stress and have the ABRE
binding site enriched in their promoters.
…more structure
OBOL
OBOL
Relations Ontology
Relations Ontology
?<link>?
<Some MF> in <Some BP>
Between-ontology structure
… more structure [beyond GO]: PATO
The building blocks of phenotype descriptions: EQ
Entity (bearer) such as spermatocyte, wing
Quality (property, attribute)
- a kind of dependent continuant
Formally, an EQ description defines:
- a Quality which inheres_in a bearer entity
The building blocks are combined according to the Phenosyntax
www.fruitfly.org/~cjm/formats
Semantically structured annotations
1. Relationship ontology
2. Mouse Pathology ontology
3. Tissue/Organ
4. Gene ontology
Basal layer of organ
shows membranous
staining
mRNA of genes encoding proteins
with mf in bp at cc is increased in
sample-id which shows some
pathology in some tissue in
some organ
Queries enabled:
1. Identify all images with a specific pathology
2. Identify cases with pathology and some gene expression changes
3. Correlate changes biological processes with change in morphology
Discovery enabled:
1. Classify samples in expression space and “look” for histological changes that
correlate with it.
HOW
WHY
Open Questions/Challenges
• Creation/acceptance of a systematic formalism
for creating expressive annotations. (e.g.
associated_with, involves)
• A generic tool that uses ontologies and allow the
user to compose terms and cross ontology
annotations
• Easy term/annotation composition
• Control the amount of alternative [compositional]
statements allowed
Pathways to analyze array data
“Pathways” to analyze array data
• The notion of a cancer signaling pathway can
serve as an organizing framework for interpreting
microarray expression data.
• On examining a relatively small set of genes
based on prior biological knowledge about a
given pathway, the analysis becomes more
specific.
Reactome’s sky painter
Operations on pathway resources
Custom code
RDF + SPARQL
OWL + SWRL
Verify a pathway resource
Proofreading
Reactome[1]
In progress
In progress
Perform integrated querying
of multiple pathway
resources
Hard (“wrapper”
approaches)
PKB[2]
Verify multiple pathway
resources
Too hard (there are
~200)
Merge and compare multiple
pathway resources
“Reason” over pathway
resources
[1] A case study in pathway knowledgebase verification, BMC Bioinformatics 2006, 7:196
[2] Pathway Knowledge Base: An Integrated pathway resource using BioPAX, Submitted to Applied Ontology
Merge and compare pathway resources
• Given a set of ‘nodes’ and some ‘links’ among them, query
multiple pathway sources and fill in the most plausible
interactions between the nodes.
• Plausible = not contradicted by existing data and knowledge
• Current pathway resources [in biopax] can not support this
because, the manner in which ‘nodes’ are identified, the
manner in which ‘links’ are identified is arbitrary.
• Reactome has started to connect the pathway steps will GO
biological processes.
• BioPAX lets pathway sources “export” their nodes and links.
• …but p53 in resource A is still different from P53 in resource B
• … and Activate in resource A is still different from activates in
resource B
Problem
• I have no clue what a pathway is!
• A set or series of interactions, often forming a
network, which biologists have found useful to
group together for organizational, historic,
biophysical or other reasons.
• The complexity and abstraction represented in
a pathway is decided by its author attempting
to represent the interactions between a set of
genes, proteins, and small molecules.
“Networks” to analyze high throughput
genomic data
Building networks
• Take a high throughput
dataset
• Define a notion of
‘relatedness’ depending on
the dataset
• Co-expression for
microarray data
• Co-occurance for literature
networks
• …
• Enlist [node]--<link>--[node]
pairs
• Find a good graph drawing
program!
Nice hairball but …
From Long et al, in Trends in
Biochemical Sciences, vol 32, no 7.
From Srinivasan et al, in Briefings in
Bioinformatics August 2007.
Srinivasan B, Snow R, Shah N and
Batzoglou S in Interactome Networks
conference @ CSHL
Hypotheses/Models to analyze high
throughput genomic data
Events and Implicit claims
An hypothesis is a statement
about relationships (among
objects) within a biological
system.
Protein P induces transcription of
gene X
An ‘event’ is a relationship
between two biological entities.
P
promoter |
gene X
Implicit claims that can be
tested:
1. P is a transcription factor.
2. P is a transcriptional
activator.
3. P is localized to the nucleus.
4. P can bind to the promoter
of gene X
Representing Events Explicitly
A hypothesis consists of at least one event stream
An event stream is a sequence of one or more events or event
streams with logical joints (or operators) between them.
An event has exactly one agent_a, exactly one agent_b and
exactly one operator (i.e. a relationship between the two
agents). It also has a physical location that denotes ‘where’ the
event happened, the genetic context of the organism and
associated experimental perturbations when the event
happened.
A logical joint is the conjunction between two event streams.
User interfaces
Hypothesis described in
Natural Language
Biological process described
in a formal language
Evaluating an hypothesis
A. Representation of an
hypothesis in terms of events
(ev = event)
C. Plot of the support versus conflicts
for submitted and neighboring
hypotheses (n1, b1). Clicking on the
n1 submits that hypothesis as ‘seed’
n1
b1
B. Holding the mouse on a neighboring
hypothesis (b1) shows what event was
replaced to create it
HyBrow: lessons learnt
• The minimum requirement for a formal
representation:
• Ability to represent data  information 
Knowledge
• A language to unambiguously express your
“thought experiment” (your model, hypothesis,
theory, theorem etc)
• A reasoning framework to evaluate the outcome/
validity/accuracy of your thought experiment
• Project Home page: www.hybrow.org
Pathways as “models”?
• Pathways are assumed to be models representing biological
processes, without actually knowing the modeling formalism in
which the model is valid.
• The ‘language’ of writing out a pathway doesn’t really have a
grammar and/or a logic
• Most pathways end up being lists of heterogeneous sets of
“steps” (in terms of the time of execution, the place of
execution, the abstraction level, the kind of ‘thing’ passed along
etc…)
• Lots of discussion on requirements of data providers, where are
the users/consumers and their use cases?
Claims
• Pathways are useful only if they can serve as
“models” [accurate representations] of a process
• Hence whatever needs to be done to ensure that a pathway is a
valid model of at least one formalism should be required of the
pathway author.
• A pathway representation that doesn’t solve the
problem of uniquely identifying entities doesn’t
solve the problem of integrating pathways.
• We just end up with marked up, structured information from
multiple providers, without actually integrating anything.
Success of projects in the Biomedical domain
High KR
complexity
Virtual
soldier
TMJ
HyBrow
TAMBIS
Riboweb
Biocyc
BioSigNet
BioLingua
Pathway
logic
Mycin
PharmGKB
Reactome
Minimal KR
complexity
Use of GO
Minimal computational
complexity
High computational
complexity
Success of projects in the Biomedical domain
High KR
complexity
Virtual
soldier
TMJ
HyBrow
TAMBIS
Riboweb
BioCyc
BioLingua
Pathway
logic
BioSigNet
Mycin
PharmGKB
Reactome
Minimal KR
complexity
Use of GO
Minimal computational
complexity
High computational
complexity