Modeling and Associated Visualization Needs

Download Report

Transcript Modeling and Associated Visualization Needs

Modeling and Associated
Visualization Needs
A Trilogy in Four Parts
The Acts:
Not in Chronological Order
• Overview of the G2P cyberinfrastructure
• Systems biology models (bottom up)
– Viz needs: Multivariate Dynamics, Inner Space, &
Sensitivity Analysis
• Ecophysiological models (top down)
– Viz needs: The same, plus Outer Space
• Statistical models (non-mechanistic)
– Viz needs: Help & fast!!
Solving the G2P problem means
developing a methodology…
…that lets one start with some species & trait that
one knows very little about and end with the
ability to quantitatively predict trait scores for
target genotype/environment combinations.
Build quantitative
models
Acquire data
Ignorance
Prediction
Tools
Elicit hypotheses
Testing
To work, such a methodology must be cyber-enabled
Super-user Developer
DI
Metabolic data
DI
Whole plant data
DI
Modeling
and
Statistical
Inference
User inferred
Environment data
DI
Experiment
Hypothesis
Expression data
User inferred
Visualization
DI
Visualization
Seq data
Systems Biology Models
Modeling a single gene
Temperature
Controlled by the amounts
of upstream regulatory
gene products
Amount of
M  gene product
at time t
Some fraction of M
degrades per unit time
Change in amount Influx amount Efflux amount
Rate 


unit time
unit time
unit time
Linking multiple genes…
Transcription
Transcription Factor
“A” Gene Codons
RNAP
Promoter Region
Prot. Syn.
DNA
RNAP
Promoter
Region
Translation
“B” Gene Codons
DNA
A “Bathtub” Model
Temperature
modulates all rates
Other Gene Products
affect degradation
Transcription Factors
modulate reading
What is a “product”?






RNA’s: messenger (mRNA) & otherwise
Some models do not distinguish mRNA & protein
(e.g., when time scales are long)
Some models individually represent mRNA,
cytosolic protein, and nuclear protein
Some models will separate products by tissue/organ
(e.g., leaves, phloem, meristem)
Many models include metabolites & protein
complexes
Basic equation is still the same (influx-eflux)
Rate 
Change in amount Influx amount Efflux amount


unit time
unit time
unit time
Linear
Constant Frac.
M
1
1
Enhancer
Repressor
0
0
0
2
3
MichaelisMenton
1
Activation
dM

dt
Hill Function
1
 M /(1  M )
Input
0
0
1
2
Mass Action
Etc.
3
Temperature effects
One form of temperature effect
Folded protein
packaged to go
32.0 C
Chaperone
(folds/QC)
Bad protein
(unreleased)
Nucleus
39.5 C
Endoplasmic
reticulum
(Abstracted from Ellgaard et al. 1999)
Temperature effects
Linear
Constant Frac.
M
1
1
Enhancer
Repressor
0
0
0



Activation
dM

 R

dt
Hill Function
Input






1
2
3
MichaelisMenton
1
 M /(1  M )
0
0
1
2
Mass Action
Etc.
3
A close up – the diurnal clock
mRNA
Michaelis-Menton
Environmental
effect (light)
Mass
Translation
Hill function
Influx
- Efflux
Net
transport
intoaction
nucleus
 t   0
 t   1
Locke et al., 2005
?
Locke et al., 2005 - 9 of 13 equations
Barak et al., 2000
(S. Brady)
Flowering time
prediction
Plant growth & metabolism
Photosynthesis biochemistry
Root
development
Soil conditions
(water stress)
Sensitivity Analysis & Sloppy Systems
Photoperiod pathway
E
C
TOC1
PHYB
A
LHY
CCA1
Light
low
Autonomous
pathway
Temperature
C-
FPA
CRY2
C
D
C
GI
FVE
C-
FCA
CO
FLC
FT
SOC1
AP1
LFY
Nearly nonfunctional in
the Landsberg
erecta strain
Vernalization
pathway
B
B
Gibberellin
pathway
Flowering
Adapted from various literature
Z. Dong, 2003.
Each letter is a power of two in sensitivity
Stiff & Sloppy Directions
Parameter 2
All parameter combinations inside
this ellipse yield essentially identical
goodness-of-fit values
Optimum goodness-of-fit
“Sloppy” direction
“Stiff” direction
Sloppy/Stiff ca. 1000
Parameter 1
The “ellipses” may be “hyper-pancakes” with 15 to 30 sloppy
directions. How can these be meaningfully visualized??
Sloppy directions in a clock model
Cytosolic 'Y' protein
0.024
GIGANTEA ?
0.020
0.016
Locke et al
Simplified
0.012
0.008
0.004
0.000
0
24
48
72
96
Hours after sunset
71 parameters reduced to 46 parameters
120
Ecophysiological Models…
• …come in three flavors
– Environmental physics models (1945 to
present)
– Crop simulation models (1965 to present)
– Geochemical cycling models
• Blend the characteristics of both of the above
• Are more recent
• …are now poised to contribute to the G2P
problem via a top-down approach
What is the focus of models in Environmental Physics?
• Mimics conditions inside a uniform plant canopy;
• The typical setting is an agricultural field;
– Includes plant-related, edaphic (soil), and meteorological inputs;
• Based on physical principles;
– Conservation of matter and energy; convection, conduction, convection;
–Some plant processes – gas exchange, photosynthesis, respiration
• Plant structure consists of leaves, stems, roots;
• Time horizon typically a few days with time steps on the order
of minutes.
•Ergo plants often do not grow
Environmental Physics Models: 1945-75
•
•
•
•
1D or Bulk approach;
Big Leaf / Big Root submodels;
Bucket soil submodels;
Resistance analogs used for the
atmospheric environment;
Limited prediction of soil or
canopy scalar variables;
Many empirical relationships;
Nebulous controlling variables
(e.g., canopy resistance to vapor
flux);
Poor plant/environment
feedback.
VPD
 LEAF
Atmosphere
Big
Leaf
W
Big Root
•
•
•
•
TLEAF
Bucket
of
Soil
TSOIL
TAIR
Environmental Physics Models: 1975-90
•
•
•
•
•
Multi-layer atmosphere, soil,
and canopy;
“Scaled leaf” approach within
canopy layers;
Relationships between photosynthesis, transpiration, and
biophysics (e.g., stomatal
action);
Use finite difference methods
to compute soil heat, water,
and gas flows;
Incorporate root density
functions and soil physical
properties.
TAIR , VPD,
CO2 , wind
speed profiles
TCANOPY , VPD,
CO2 , canopy
 profiles
TSOIL ,  ,
W
profiles
Atmosphere
Layers
Canopy
Sunlit
Layers
Shade
Soil
Layers
Rooting
Profile
What is a Crop Growth Model?
• Mimics one “average plant” at a field or smaller scale;
• The plant environment is an agricultural production setting;
– Includes cultural- and production-related I/O variables;
– Includes varietal, edaphic, and meteorological inputs;
• Based on physiological processes;
– Photosynthesis, respiration, transpiration, nutrient uptake, carbon
partitioning, growth, and phenological development;
• Plant structure consists of leaves, stems, roots, & grain;
• Annual time horizon with daily or hourly time steps.
What is the current status of Crop Growth Models?
• Skillful models can account for ca. 70% of yield variance;
• Ongoing work focuses on refinement and applications;
– Problems being researched include methods for estimating cultivar
and soil characteristics on an operational scale;
• Model structures and approaches have matured;
• Recent physical theory may not be emphasized;
• Physical theory does not seem to improve predictions.
Interestingly, incorporating crop growth model components into
physical models does not guarantee improved predictability
either, even though physical scientists recognize knowledge of
the plant as limiting.
Special case Geochemical cycling models
•Used to model “ecosystem services” and/or “land surface
processes” inside general circulation models
•Blend of both kinds of models;
• Includes plant-related, edaphic, and meteorological inputs;
•Based on physical principles
– Conservation of matter and energy; convection, conduction, convection;
–Some plant processes – gas exchange, photosynthesis, respiration
• Plant structure consists of leaves, stems, roots;
• Time horizon of years with time steps on the order of minutes
(depends on spatial scale).
Main points --
• Neither current crop growth models nor
environmental physics models adequately
depict plant process control mechanisms;
• This accounts for the failure of models to
mimic the plasticity of real plants across
different environments;
• The information needed to remedy this
situation is emerging from the genomic
sciences;
• Incorporating this information requires a
reorganization of crop models
New Crop Growth Model Concept
Energy
Water
N
 ,T
Sensors
Control Submodel
[CPAI]
[KE60]
 , ,T
Physical Submodel
Viz needs for ecophysiological
models and G2P components
• Largely the same as for systems biology
models – multivariate dynamics in spatially
discrete plant parts
• Note that our “G2P solution” specifies
predicting trait scores in non-constant
environments.
– That most directly refers to the outdoors
– Therefore geographic variation must also be
considered
A hazy shade of winter…
• One frame of a movie comparing the standard deviation
of flowering time for the Columbia strain of A. thaliana
germinating on each day.
• Projected by the gene-based model of Wilczek et al,
2009.
• The standard deviation is over five years (left, 20042009, real data; right, 2094-2099, A1B climate scenario.)
Statistical genetic methods I
• Can be used to
– Predict phenotypes based on genotypes
– Locate regions of the genome likely to contain genes
controlling particular phenotypes
• Can be used when
– Knowledge of gene mechanisms is lacking
• Big Caveat
– The mathematical form of the G2P relationship is just
assumed to be linear
– … and the data & models elaborated until the job
gets done to adequate accuracy
Statistical genetic methods II
• Why does it work?
– Because there are sufficient regimes of near
linearity buried in mechanistic network eq’ns
that general linear statistical models have
levels of predictive skill useful for some
purposes (e.g. crop breeding)
– Rest assured that there are limits to what
should be expected of these models
• How does it work?
What are genetic markers?
Position within gene
Aligned DNA sequences
of 25 different genetic
lines
Single nucleotide
polymorphism (SNP)
(Data from the Purugganan Lab)
Different sibling lines will have
different marker combinations
The DNA sequence for line 1 has the same
sequence as parent “B” at the location of
marker “g17286”…
…but in line 8 the DNA matches parent “A” at
that location
Many different linear models
Phenom    1 X m ,ve 001   2 X m ,T 1G11a 
 150000 X m ,last_marker
Genome Wide Association
Finding quantitative trait loci (QTL)
Find markers i, j, and k such that
Phenom    i X m ,i   kj X m ,k X m , j  other terms
is a good fit
1
where X m ,n   if marker n is from parent
0
A
 in line m
B
etc….
What a QTL analysis output looks like.
This is a “1d-scan” – i.e. Xm,j
(Buckler et al, Science, 2009)
Two Stat Inf Viz Problems
• Higher order scans e.g. k , j ,l X m,k X m, j X m ,l
– Remember SNP numbers can be in the 150K
to 3M range.
• eQTL viz problems
– Can be 30K phenotypes…
– …and higher order scans
eQTL Analysis – Looking for Regulators
Transcription Factor
“A” Gene Codons
RNAP
DNA
Promoter Region
Prot. Syn.
DNA
RNAP
“B” Gene Codons
Promoter
Region
Let “Pheno” be the amount of mRNA (expression) produced by gene “B”.
This could be different in lines that varied either in the promoter of “B” or
in lines that had differences in the coding region of gene “A”. These are
called “cis” and “trans” effects, respectively.
Massive eQTL Variation
75% of all genes have at least 1 eQTL
I
Chromosome
II
III
IV
V
QTL
Effect
Bay +
Bay -
Trans Hotspot
(D. Kliebenstein)
Position of eQTL for each of 15,771 genes
Arranged by Physical Order
Cis Diagonal
eQTL Viz Problems…
How to plot interaction effects?
That is Xm,jXm,k
and a gazillion phenotypes
Questions?
Virtual soybean simulations from Han et al. 2007