Interactive Visual Analysis of Gene Expression Data

Download Report

Transcript Interactive Visual Analysis of Gene Expression Data

G2P Visual Analytics Group Roadmap
San Diego Meeting, January 2010
Ruth Grene, chair
Greg Abram, co-chair
Bernice Rogowitz
Bjoern Usadel, Lenny Heath, Nick Provart, Eric Lyons, Steve
Welch, Tom Brutnell
1
January 10, 2010 draft
Our process
1.
2.
3.
4.
Worked with iPLANT biologists to understand specific experimental and
analytical “use cases” as part of the process of uncovering intermediate
steps between phenotype and genotype
Expressed these processes as high-level “workflow” diagrams, showing the
steps they would go through to accomplish a scientific discovery
Captured their insights on “gaps” – what technology capabilities would
enhance their science ?
Developed recommendations for the overall iPLANT cyberinfrastructure
that would support scientists’ and educators’ current and future needs
2
Our Workflow Approach
• We worked with iPLANT scientists to express their scientific use cases in
terms of data and operations on those data
– Multiple types of data (e.g., experimental, computed, archival)
– Multiple types of operations (e.g., analytical, visualization, search)
• A Workflow is a pathway of operations on data
– Operation
– Data
– Flow
3
G2P Workflows
• Workflow for Maize Gene Analysis (“Tom’s Workflow”
• Analysis of Omics Data from a Model Species (“Ruth’s
Workflow”)
• Analysis of Gene Expression Data from a Partially Sequenced
Species
• Simplified Analysis of Omics Data – proposed first prototype
• Interactive visualization of gene expression data
• Steve’s analytical workflow
• iPTOL workflow
4
4
Workflow for Maize Gene Analysis
Modeling and Statistical
Inference
Candidate maize
gene
Homolog Finder (e.g,
CoGE)
Literature search
List of homologous
Arabidopsis gene IDs
5 genes of interest
Examine clusters that can
handle maize data (e.g.,
eNorthern, MapMan)
note: very limited data for maize
so may need to go to rice
For each, examine structure of
transcripts and expression over
time (e.g, EFP Maize Genome
Browser)
Expression data for 20
maize genes
5
Co-Expression Analysis
(e.g., ATTED2)
Expression Network of 10
Arabidopsis Genes
Homolog Finder (e.g,
CoGE)
Find expression values for
these genes (e.g, Next Gen)
List of 20 homogolous
maize gene IDs
/tb/ber
DNA Subway Metaphor
• Tom’s workflow very nicely fits the DNA Bus metaphor
– Linear workflow
– Options can be substituted at various nodes
6
Workflow for Analysis of Omics Data in a Model Species
Gene
expression
data
Experiments
Identify sub-cellular
locations of gene
products (e.g., SUBA,
Interactome)
Metabolite
Data
•Interactive visual and
statistical analysis
Interactive Visual &Statistical
Analysis (e.g., ViVA, Co-expression
analysis, PlantMetGenMap,
Cytoscape/Gene Mania)
Inferred ProteinProtein
interactions
Visualize
Visually-identifed, cellbased, network regions of
interest
7
•Integrated gene
expression and
metabolomic data
•Explicit support for
iterative what-if analysis
Visually identified genes
and metabolites to map
onto functional pathways
iterate
iterate
Visualize
Testable
Hypotheses
Visually-identified
enriched pathways
/rg/ber
But, analysis paths are not always linear…
• Ruth’s non-linear use-case, for example, shows the analytical
importance of
– Branching
– Recursion
– Back-tracking
– N.B. - The DNA Subway can be conceptualized as a special case of a
more general workflow model
8
Gene Expression from A Partially Sequenced Species
1
Experimental exposure of
plants to differing
conditions
Ecophysiological
data
Responsive
genes
Meta Annotator: Explore known
features of these genes (e.g.
signaling pathways, eFP,
literature)
Paint responsive genes
onto pathways (e.g.,
MapMan)
3
7
6
2
Identification of homologs in
reference species (e.g.
CoGe)
Formulate
mechanistic
models
Homologs in
reference
4a
Visually-identified
enriched pathways
5
9
Compare magnitude of
activity across reference
pathways (e.g.,
PageMan,KEGG, GO,
MapMan)
4b
Identification of candidate
homologs that have been
reported as co-expressed with
reference homologs (e.g.,
statistical correlation)
Network of coexpressed genes for
reference species
/rg/ber
Workflow for Interactive Gene Expression Analysis
Array experiments:
Multiple conditions for
multiple genes
Gene Expression
Data
High-level view of
interactive visual
analysis workflow
Interactive visual analysis
10
Visually-identified
patterns and
relationships
/ber
A Workflow for Interactive Visual Analysis
Array experiments:
Multiple conditions for
multiple genes
Visually select
regions of interest
in data for each
experiment
Histogram
Stack
Common
data
model
View behavior of,
and select genes
in, the gene array
11
Parallel
Coordinates
Identify gene
expression patterns
across multiple
experiments
Scatterplot
View and find
clusters in two or
three variables
Gene Array
Visualizer
Chromosome
Map
Visualization
11
Drill-down reveals
the underlying
components
Gene Expression
Data
Explore which
chromosomes have
highest expression (e.g.,
eQTL)
Visually-identified patterns and
relationships
/ber
Interactive Visual Analysis of Gene Expression Data
strand
12
Modeling Work Flow
SBML
Manual
Model Entry
Format Conversion
OpenMI component
SBML
NextGen Pipeline
Opt. method
Parameters to map
Association
Mapping
MapMan
User defined
visual templates
Eigenvalues &
vectors
Sensitivity
coefficients
eFP Browser
13
Sensitivity
Analysis
Information theoretic comparisons
Cross validation or bootstrapping
Visualization of
Model Outputs
Eigen Plot
Tissue
Problem-specific runs
RNA-seq
Environ.
Environmental data
DNA-seq
Parameter
Estimation
Biol. data
QTL Visualization
CoGe
Other models
Independent:
Biol. data
Environmental data
Information
matrix
Verification
Graph. Residuals
Analysis
eQTL Visualization
Method-dependent
Parameters (from GUI)
eQTL Data
from
database
Select
phenotypes
Select genome
regions
List of
phenotypes
List of
genomic
regions
Specialized
visualization
component
14
Method-dependent
Parameters (from GUI)
iPToL Visualization Workflows
1. Large Trees
Phylogenetic
Tree
Tip and Node
Labels
Interactive Tree
Visualization,
incorporating
intelligent filtering
(e.g, PhyloWidget,
Paloverde)
Visually-identified
patterns and
relationships
Challenge: visualization tools that support analysis of 100-500K
tips, and support interactive what-if analysis, and “semantic”
zoom, in addition to standard zoom pan and select functions
2. Tree Comparisons
Phylogenetic
Tree 1 (e.g.,
Species Tree)
Phlogenetic
Tree 2 (e.g.,
Gene Tree
15
Two DynamicallyLinked Interactive
Tree Visualizations,
highlighting
correspondences
Visually-identified
correspondences
Challenge: infrastructure that supports semantic “brushing” and
linking between different representations
General Observations from our Scientists
– We found that a workflow model worked very well for representing the data and
processes in plant genomic research exploration
• Suitable for representing gene expression, metabolites and signaling components
• Accommodated different approaches, including, e.g., metabolism, growth habit, and
physiology
– Testable hypotheses may be presented visually as linear pathways or as networks.
– Having a common structure for representing use cases helped identify similarities and
patterns across different biological domains
– The workflow methodology helped scientists clarify and communicate the processes
they used
– Using this methodology helped scientists clarify requirements for data and integration
– Most important: This methodology helped scientists identify operations that don’t exist
today which could help them create next-generation science
16
Specific Observation 1: Different User Populations
Need to serve three different types of users
– Plant biologists who are not computer scientists -- need sophisticated, easyto-use analysis and visualization tools
• additionally, will serve educational and outreach applications
– Power-users: Computationally-savvy plant biologists -- need easy-to-use
system that allows them to create personalized and customized workflows
– Developers who are creating new tools – need established APIs so that their
tools can be easily used by the community
17
Different Interfaces for Different Users
Biologists’ View: High-level templates for Plant Biologists and casual users,
with lots of defaults and pre-selected parameters
List of
genes
Co-expression
analysis
Network
Power Users’ View: Visibility into underlying workflows, with freedom to add
different data sources, select tools and parameters
Co-expression analysis
List of
genes
Metabolites
Statistical
analysis tool
Interactive
Visual
Analysis
Pathways
Network
Underlying Infrastructure supports both views: Provides explicit treatment of
underlying data, databases, data integration, tools, operations parameters, defaults,
wrappers, provenance, interconnectivity, access, etc.
18
Specific Observation 2: Dynamic and Interactive Data
Analysis
– Most current tools and methods are static
• Scientists want to interact with their data, to do interactive “whatif” experiments
• Scientists want to have access to dynamic, time-varying data, and
tools to help them analyze them
19
Specific Observation 3: Multiple Data Types
• Multiple data types. Scientists want to join in multiple types of
data into their analyses, to see patterns across multiple data
types, including networks, pathways, sequences, tabular data,
images, 3-D, text
20
Example: Using Pubmed to Integrate Data about Relevant Literature
1. Click on PubMed link for selected term
2. Clips citing selected term
3. Obtain cited article by clicking PubMed icon
21
Reference--http://brainmaps.org
Specific Observation 4: Exploring links across different
visual representations
• Interactive “brushing” (a la ViVA). Support for color “painting” in
one visual representation that is reflected in other linked
representations
– e.g., painting a metabolic pathway with metabolomic and gene
expression magnitude
– e.g., using interactive visualization to identify genes of interest, and
have the relevant sub-cellular structures or pathways automatically
highlighted
– e.g., mark a pathway or tree-node and see gene expression for that
region
22
Specific Observation 5: Annotation and Provenance
• Capturing scientists-identified features
–
When a scientist identifies a key pathway or relationship, the system should
allow him/her to capture that pattern, so that it can later be used for
communication, to use as a template for future analyses, or to search for
similar patterns in other data sets.
– The scientists want tools to help them keep track of where data came from,
what operations were done, and why
23
Specific Observation 6: Need to Validate Tools and
Workflows
•
How does a scientist know if a “pre-packaged” component or workflow is valid or
appropriate or computationally correct?
•
One suggestion: use “Social Computing” methodologies
– The community provides ratings, comments and annotations that build knowledge
• Community of plant biologists and computational scientists
– For components: users can rate components, discuss their uses, register criticisms,
provide comparisons, suggest competing technologies
– For workflows: users can comments on their uses, suggest extensions, quibble with
particular choices, etc.
24
Specific Observation 7: Scalability and Extensibility
•
•
•
The scientists need a system that will allow them to continually update, improve
and modify the work they do, keeping pace with
– Larger data sets
– Improved analytics
– Alternate methods
– New tools
– New data
– Joining different types of data
Need for easy methods to substitute components, visualizations, methods,
analyses, data, etc.
The iPLANT cyberinfrastructure has to able to grow organically and flexibly
25
Implications for the CyberInfrastructure
• The CyberInfrastructure needs to be:
– Based on re-usable, composable components
– Extensible, able to support
•
•
•
•
•
Updated components
New data types
New visual and analytics methods
Iteration
Leaps forward in scale, interactivity and dynamic data
– Easy to use for plant biologists, power users and developers
26
VizTrails- example workflow methodology
Power-user
workflow
Provenance
and metadata
Conceptual
workflow
Interactive
visualizations
•
•
•
•
•
Visual programming interface for representing data and operations as workflows
Loose coupling, using parameterizable Python wrappers
Extensible, flexible, re-usable components and workflows
Coupled with an attractive, flexible User Interface (to be developed)
Supported by “plumbing” infrastructure that provides explicit treatment of underlying data,
databases, data integration, tools, interconnectivity, access, etc.
27
Multiple Levels
Conceptual Level:
High-level templates for
Plant Biologists and casual
users, with lots of defaults
pre-selected
List of
genes
Co-expression
analysis
Network
Power-User Level:
Visibility into underlying
workflows, with freedom to
select tools and data
sources and program new
operations
Infrastructure Level:
The explicit treatment of
underlying data, databases,
data integration, tools,
operations, parameters,
defaults, wrappers, provenance,
interconnectivity, access, etc.
28
List of
genes
Interactive
Visual
Analysis
Statistical
analysis tool
APIs
iPLANT Cyberinfrastructure
Network
iPlant RIC
iP Visual
Programming
Interface
Predefined
Visual
Programs
Other iP
Visualization
Tools
Other iPlant
Tools
iPlant GUI
Application
Dataflow Engine and Component Set
Visualization Data Cache
iPlant Cyberinfrastructure
…
29
iPlant
Resource
iPlant
Resource
iPlant
Resource
iPlant
Resource
iPlant
Resource
iPlant
Resource
Visual Analytics Roadmap
• Stage 1: Explore workflow methodology with plant biologists, develop
sample workflows, and provide insights into requirements for the iPLANT
cyberinfrastructure
done
• Stage 2: Use VizTrails to provide a prototype workflow for a real biology
problem using real data 1Q2010
30
Simplified Omics Workflow for First Demo
Experiments
Gene
expression
data
Metabolite
Data
•Integrated gene
expression and
metabolomic data
•Interactive visual and
statistical analysis
Interactive Visual &Statistical
Analysis (e.g., ViVA, Co-expression
analysis, PlantMetGenMap,
Cytoscape/Gene Mania)
•Explicit support for
iterative what-if analysis
Visually identified genes
and metabolites to map
onto functional pathways
iterate
Visualize
Testable
Hypotheses
31
Visually-identified
enriched pathways
/rg/ber
Some Conclusions
• For plant biologists, the intellectual process is creative and
diverse; there is no one-size-fits-all solution
• There is a great hunger to
– Use existing tools
– Use components of existing tools
– Develop new tools
• For visual analysis, no set of tools will be comprehensive
– Pre-existing tools are very valuable
– Reusable visualization components will be important
– New tools, especially more abstract, general tools, will be invaluable
32
Review of Scientists’ Requirements
• Whatever the direction, we need to support
–
–
–
–
Different users with different needs
Dynamic and interactive data
Multiple data types
Interactive “brushing” or “painting” across all visual
representations
– Annotation and provenance
– A methodology to validate our components and tools
– Scalability and extensibility
33
And most important, we need to…
Provide an analysis environment that will enable new science
– Enable scientists to see relationships across multiple types of data
(e.g., integrating gene expression and metabolomics data)
– Enable scientists to do what-if experiments, interactive exploration
of static and dynamic data, and to integrate modeling and
visualization capabilities
– Enable scientists to mark a region in one representations and see
the impact in all other linked representations
– Provide a system that is flexible and extensible, which will grow
organically as data grows in volume, and new data, tools, and
methodologies emerge
iPLANT should be the platform for the future of plant science, and
the choice for future plant scientists
34
Back-up
35
“Build” or “Buy” Continuum
Build from
scratch, e.g.,
with Open GL
Enhance
Existing
software
<= Time consuming/expensive
•
•
Wrap components
from existing
software
Wrap existing
applications
Quick/inexpensive/flexible=>
Key Question: Do we build a fixed system, or do we explicitly design a system that welcomes
different applications (entire software packages) or components (e.g., visualizations,
algorithms)?
The Workflow approach: Building a system that is extensible and evolutionary, able to
integrate new functions that we have not anticipated
–
–
–
–
36
For the first release, identify the tasks and capabilities do we want to support. Question: Who
decides?
Understand the state-of-the art in commercial and academic systems. Identify candidate
applications or components to wrap.
Identify gaps, since this will determine the need for new software (e.g., modifying a component to
accept plant genomes)
Focus development effort on overall design goals (e.g., dynamic linking between representations,
translators, wrappers, etc.)
RG/BR
12/16/09
Alternate Simplified Demo
1
Experimental exposure of
plants to differing
conditions
Responsive
genes
Ecophysiological
data
2
3
Paint identified genes
onto pathways (e.g.,
MapMan)
Identification of homologs in
reference species (e.g.
CoGe)
Homologs in
reference
4
6
Formulate
Testable
Hypotheses
Explore known features
of reference homologs
from the literature
Simplified from the Gene Expression from A Partially Sequenced
Species workflow
37
/rg/ber
Other data sources to be Incorporated
• 1. Motifs from Regulatory Regions in Model Species
• 2. Cell-specific Expression
• 3. Pathways Wiki, place gene(s) of interest in established pathways.
• 4. Metabolites, incorporate information from Reactome
• 5. Literature , PubMed Assistant???
• 6. Displays of inferred regulatory networks, as in Gene Mania.
•7. Small RNAs and target genes.
38
Specific Requirements for Visual Analysis
•
•
•
•
Multiple data types. Ability to see patterns across different types of data,
including networks, pathways, sequences, tabular data, images, 3-D, text.
Multiple modes of interaction, including static visualizations, interactive “what-if”
visual analyses, multiple time slices, dynamic data.
Interactive “brushing”. Support for color “painting” in one visual representation
that is reflected in other linked representations (e.g., using interactive
visualization to identify genes of interest, and have the relevant sub-cellular
structures or pathways automatically highlighted)
Capturing scientists-identified features. When a scientist identifies a key pathway
or relationship, the system should allow him/her to capture that pattern, so that it
can later be used for communication, to use as a template for future analyses, or
to search for similar patterns in other data sets.
39