IPDPS Presentation of S4W - People at VT Computer Science

Download Report

Transcript IPDPS Presentation of S4W - People at VT Computer Science

Modeling and Understanding
Stress Response Mechanisms
with Expresso
Ruth G. Alscher
Lenwood S. Heath
Naren Ramakrishnan
Virginia Tech, Blacksburg, VA 24061
ORNL Workshop on Genomics
Duke University
May 1, 2001
Who’s Who
Computer Science
Plant Biology
Ruth Alscher
Plant Stress
Virginia Tech
Boris Chevone
Plant Stress
Dawei Chen
Molecular Biology
Bioinformatics
Ron Sederoff,
North
Ross Whetten
Carolina
Len van Zyl
State Univ. Y-H.Sun
Lenwood Heath (CS)
Algorithms
Virginia Tech
Naren Ramakrishnan (CS)
Data Mining
Problem Solving Environments
Craig Struble,
Vincent Jouenne (CS)
Image Analysis
Forest Biotechnology
Ina Hoeschele (DS)
Statistical Genetics
Keying Ye (STAT)
Bayesian Statistics
Statistics
Virginia Tech
People
Ron Sederoff
Ross Whetten
Y-H .Sun
Keying Ye
Lenny Heath
Craig Struble
Ruth Alscher
Len van Zyl
Boris Chevone
Dawei
Chen
Vincent Jouenne
Naren
Ramakrishnan
Overview
•
•
•
•
Plant responses to environmental stress
Stress on a chip
Summary of results obtained
Expresso
– Managing expression experiments
– Analyzing expression data
– Reaching conclusions
• Where we go from here
– Modeling experiments
– Modeling pathways
Plant-Environment Interactions
• Several defense systems that respond to
environmental stress are known.
• Their relative importance is not known.
• Mechanistic details are not known. Redox sensing
may be involved.
Scenarios for Effect of Abiotic
Stress on Plant Gene Expression
The 1999 Experiment: A Measure of
Long Term Adaptation to
Drought Stress
• Loblolly pine seedlings (two unrelated genotypes “C”
and “D”) were subjected to mild or severe drought
stress for four (mild) or three (severe) cycles.
– Mild stress: needles dried down to –10 bars; little
effect on growth, new flushes as in control trees.
– Severe stress: needles dried down to –17 bars;
growth retardation, fewer new flushes compared to
controls.
• Harvest RNA at the end of growing season,
determine patterns of gene expression on DNA
microarrays.
• With algorithms incorporated into Expresso, identify
genes and groups of genes involved in stress
responses.
Hypotheses
• There is a group of genes whose expression
confers resistance to drought stress.
• Expression of this group of genes is lower under
severe than under mild stress.
• Individual members of gene families show distinct
responses to drought stress.
Selection of cDNAs for Arrays
• 384 ESTs (xylem, shoot tip cDNAs of loblolly) were
chosen on the basis of function and grouped into
categories.
• Major emphasis was on processes known to be
stress responsive.
• In cases where more than one EST had similar
BLAST hits, all ESTs were used.
Categories within Protective and
Protected Processes
Gene
Expression
Signal
Transduction
Protease-associated
ROS and Stress
Environmenta
l
Protective
Processes
Nucleus
Cell Wall Related
Trafficking
Phenylpropanoid
Pathway
Change
Development
Protected
Processes
Secretion
Cells
Cytoskeleton
Tissues
Plant Growth Regulation
Chloroplast Associated
Metabolism
Carbon Metabolism
Respiration and Nucleic Acids
Mitochondrion
A Note about Categories
• Categories are not mutually exclusive;
gene(s) may be assigned to more then one
category. For example, heat shock proteins
have been grouped under these different
categories and subcategories
– Abiotic stress – heat
– Gene expression – post-translational processing –
chaperones
– Abiotic stress - chaperones
Abiotic
Biotic
Stress
Protective
Processes
Cell Wall Related
“Isoflavone
Reductases”
Antioxidant
Processes
Phenylpropanoid
Pathway
Categories
within
“Protective
Processes”
Drought
Dehydrins, Aquaporins
Heat
Non-Plant
Heat shock proteins
(Chaperones)
Xenobiotics
GSTs
Chaperones
NADPH/Ascorbate/
Glutathione
Scavenging Pathway
Sucrose Metabolism
Cellulose
Arabionogalactan proteins
Cytosolic
ascorbate
peroxidase
superoxide
dismutase-Fe
superoxide
dismutase-Cu-Zn
glutathione
reductase
Extensins and proline rich proteins
Hemicellulose
Pectins
Xylose
Other Cell Wall Proteins
Lignin Biosynthesis
isoflavone reductases
phenylalanine ammonia-lyases
S-adenosylmethionine decarboxylases
glycine hydromethyltransferases
4-coumarate-CoA
ligases
CCoAOMTs
cinnamyl-alcohol
dehydrogenase
Quality Control
• Positive: LP-3, a loblolly gene known to respond
positively to drought stress in loblloly pine, was
included.
LP-3 was positive in the moist versus mild
comparison, and unchanged in the moist versus
severe comparison.
• Negative: Four clones of human genes used as
negative controls in the Arabidopsis Functional
Genomics project were included. The clones did
not respond.
Drought
Abiotic
Biotic
ROS and Stress
Protective
Processes
Cell Wall Related
“Isoflavone
Reductases”
Antioxidant
Processes
Phenylpropanoid
Pathway
Dehydrins, Aquaporins
Heat
Heat shock proteins
Non-Plant
Xenobiotics GSTs
Cystosolic
ascorbate
Chaperones
peroxidase
NADPH/Ascorbate/
Glutathione
Scavenging Pathway
Sucrose Metabolism
Categories that
contained positives in
genotypes C and D
(Control versus Mild)
Cellulose
Data from two slides (4 arrays)
for C and two slides (4 arrays)
for D were collected.
Xylose
superoxide
dismutase-Fe
superoxide
dismutase-Cu-Zn
glutathione
reductase
Extensins, Arabionogalactan,
and Proline Rich Proteins
Hemicellulose
Pectins
Other Cell Wall Proteins
Lignin Biosynthesis
isoflavone reductases
phenylalanine ammonia-lyase
S-adenosylmethionine decarboxylase
glycine hydromethyltransferase
4-coumarate-CoA
ligase
CCoAOMT
cinnamyl-alcohol
dehydrogenase
Hypotheses versus Results
• Among the genes responding to mild stress,
there exists a population of genes whose
expression confers resistance.
– Genes in 69 categories responded positively to mild
stress in Genotypes C and D (the positive response
was not observed in the severe stress condition in
Genotype D).
• There is evidence for a response to drought
among genes associated with other stresses.
– Isoflavone reductase homologs and GSTs responded
positively to mild drought stress.
– These categories are previously documented to
respond to biotic stress and xenobiotics, respectively.
Relationships among HSP homologs
In control versus mild stress,
HSP 100, 70, and 23 responded in C and D;
HSP 80s did not respond in either C or D.
Candidate Categories
• Include
– Aquaporins
– Dehydrins
– Heat shock proteins/chaperones
• Exclude
– Isoflavone reductases
Experimental Design: Computational
and Statistical Issues
• Numerous sources of error in microarray
experiments: identify, control, and analyze
• Clones on a microarray need to be replicated and
randomly placed (Lee et al., PNAS 97, August 29,
2000, 9834-9)
• Differing results among replicates can indicate
sources of error; consistency gives confidence
Expresso: A Problem Solving
Environment (PSE) for Microarray
Experiment Design and Analysis
• Integration of design and procedures
• Integration of image analysis tools and statistical
analysis (via Perl scripts)
• Connections to web database and sequence
alignment tools
• The software Aleph was used for inductive logic
programming (ILP).
Expresso: A Microarray Experiment
Management System
Design of Microarrays I
• Selected 384 archived ESTs
• Organized into 4 microtitre source plates after PCR
• Pipetted into 8 sets of 4 randomized microtitre
plates; each set a different arrangement of the
384 ESTs
• Printed type A microarrays from first 4 sets (16
plates); printed type B microarrays from second 4
sets
• Each array type has 4 replicates of each EST,
randomly placed
Design of Microarrays II
• Each slide contained 2 identical arrays (of type A
or B), 4 replicates of each EST per array
• Each slide, therefore, has a total of 8 replicates of
each EST
• A second slide also contained 2 arrays of the other
type, 4 replicates of each EST
• Total of 16 replicates of each EST for a 2 slide set
Design of Microarrays II
Spot and Clone Analysis
• Image Analysis: gridding, spot identification,
intensity and background calculation,
normalization
• Statistics: fold or ratio estimation, combining
replicates
• Higher-level Analysis: a slew of clustering
methods, inductive logic programming (ILP)
Analysis of Expression Data
• Microarray Suite: Manual grid; extract intensities
for each spot; compute ratios; compute calibrated
ratios
• Spot Statistics:
– Every calibrated ratio is divided by the mean of all
the uncalibrated ratios; the result is simply that the
mean of the calibrated ratios is 1.0
– Our tools use the logarithm of each calibrated ratio
– Positive: expression increase
– Negative: expression decrease
– Zero: no change in expression
Analysis of Expression Data
• The multiple (typically 16) log calibrated ratios for
a replicated clone do NOT follow a normal
distribution.
• Distribution is spread relatively evenly over a large
range.
• Statistical analysis based on mean and standard
deviation will be overly pessimistic in identifying
clones that are up- or down-expressed.
• From the observation of an even spread of the log
ratios, we assume that a clone whose expression is
not different from a probe pair will show a
distribution centered at a mean log ratio of 0.0.
Computational Methods
(A Probabilistic Analysis)
• In a zero-centered distribution, the probability that
any particular log ratio is positive (or negative) is
0.5.
• The number of positive (or negative) log ratios
follows a binomial distribution with parameters 16
and 0.5.
• The probability of 12 positive log ratios (or 12
negative log ratios), out of 16, for a clone whose
expression was unaffected by drought stress is
0.0384064.
• A clone with 12 or more positive log ratios is upexpressed with a probability 0.96.
Computational Methods
(Alternate Assumptions)
• Our more general assumption avoids the trap of
having to classify the response of each SPOT;
rather, we classify the response of an EST as one
of
– Up-regulated
– Down-regulated
– No clear change
• Response CLASSIFICATION rather than
QUANTIFICATION allows us to develop unified
relationships among genes and among treatments.
• Provides sufficient results for the use of inductive
logic programming (ILP).
Related Statistical Results
• Chen et al. (J. Biomed. Optics 2, 1997, 364-374)
– Assume a normal distribution and normalize ratios
– No replicates
– Estimate a confidence interval for ratios that applies
to each spot
• Lee et al. (PNAS 97, August 29, 2000, 9834-9)
emphasize need for replication
• Black and Doerge (PNAS, to appear)
– Investigate distributional assumptions of log-normal
and gamma distributions on intensities
– Determine the number of replicates needed for a
particular confidence level under each distribution
– Assume that normalization and location-dependent
noise have been eliminated.
Clustering Techniques
Clustering
Conceptual Clustering
Attribute-Value Methods
Similarity-Metric
SVMs
SOMs
Agglomerative Divisive
(bottom-up)
(top-down)
Inductive Logic Programming
• ILP is a data mining algorithm expressly designed
for inferring relationships.
• By expressing relationships as rules, it provides
new information and resultant testable
hypotheses.
• ILP groups related data and chooses in favor of
relationships having short descriptions.
• ILP can also flexibly incorporate a priori biological
knowledge (e.g., categories and alternate
classifications).
ILP subsumes
two forms of reasoning
• Unsupervised learning
– “Find clusters of genes that have similar/consistent
expression patterns”
• Supervised learning
– “Find a relationship between a priori functional
categories and gene expression”
• Hybrid reasoning
– “Is there a relationship between genes in a given
functional category and genes in a particular
expression cluster?”
– ILP mines this information in a single step
Rule Inference in ILP
• Infers rules relating gene expression levels to
categories, both within a probe pair and across
probe pairs, without explicit direction
• Example Rule:
[Rule 142] [Pos cover = 69 Neg cover = 3]
~level(A,moist_vs_severe,positive) :level(A,moist_vs_mild,positive).
• Interpretation:
“If the moist versus mild stress comparison was
positive for some clone named A, it was
negative or unchanged in the moist versus
severe comparison for A, with a confidence of
95.8%.”
More Rules we Obtained
•
[Rule 6]
level(A,moist_vs_mild,positive) :category(A, transport_protein).
level(A,mild_vs_severe,negative) :category(A, transport_protein).
•
[Rule 13]
level(A,moist_vs_mild,positive) :category(A, heat).
•
[Rule 17]
level(A,moist_vs_mild,positive) :category(A, cellwallrelated).
ILP in a Data Mining Context
Clustering
Conceptual Clustering
Attribute-Value Methods
ILP combines the expressiveness
of conceptual clustering with
the efficiency of attribute-value
Similarity-Metric techniques.
SVMs
SOMs
Agglomerative Divisive
(bottom-up)
(top-down)
Current Status of Expresso
• Completely automated and integrated
– Statistical analysis
– Data mining
– Experiment capture in MEL
• Current Work: Integrating
– Image processing
– Querying by semi-structured views
– Automatic experiment composition
• Future Work
– Model-based design and management
– Randomized experiment layout with constraints
– Closing-the-loop
Future Directions
Next Generation Stress Chips
1. Time course, short and long term, to capture gene
expression events underlying “emergency” and
adaptive events following drought stress
imposition.
(Use all available ESTs for candidate stress
resistance genes.)
2. Generate cDNA library from stressed seedlings.
Screen for full-length clones. Repeat Step 1.
3. Initiate modeling of kinetics of drought stress
responses.
Expresso: Future Directions
•
•
•
•
An open, integrated system for design, process,
analysis, data mining, data storage, and
integration of information from web-based
resources.
Supports closing the experimental loop.
Accumulated results influence later experiments,
as well as enable construction of testable models
of pathways.
Multiple models are refined and evaluated within
Expresso.
Biologists have interactive access to models and
control Expresso’s components.