Correlating mRNA and protein Abundance

Transcript Correlating mRNA and protein Abundance

Correlating mRNA and protein abundance
via genomic and proteomic characteristics
Dov Greenbaum
Gerstein Lab
Thesis Seminar
April 21, 2004
outline
Why analyze mRNA and protein correlations
Background
Disparate Data Sources
Correlating mRNA and Protein
Results
Other analyses
Formalism – comparing genome, transcriptome and
proteome in terms of broad categories
New Data Sets
Analysis via Broad Categories
Analysis of factors affecting correlations
Another reason to expect correlations 
Expression and Protein Interactions
Why Correlate mRNA & Protein?
Experiments
5000
4500
4000
3500
3000
2500
2000
1500
1000
500
0
mRNA Protein
Both mRNA and Protein Levels are
necessary for complete analysis
Shown mathematically in Hatzimanikatis et al Biotechnology 1999
Combinations of RNA and protein detection approaches have recently aided in the
identification of biomarkers in cancer
Hegde et al Current Opinion in Biotech 2003
Relationship between mRNA and
Protein levels
dPi = k mRNA - k P
s;i *
i
d;i i
dt
where ks,i and kd,i are the protein
synthesis and degradation
rate constants, respectively,
ks;i * mRNAi
At steady state:
Pi =
kdi
Methods for determining mRNA expression
Each have Strengths and Weaknesses
Methods for determining protein
abundance
2DE Gel Electrophoresis
–
•
•
•
(Klose, 1975; O’Farrell, 1975)
Multiple staining options
Small dynamic range
limited in what it can detect
Methods for determining protein
abundance
ICAT
– ICAT reagent-- relative
levels
– VB dynamic range
– Cannot detect posttranslational modifications
– it require proteins to contain
cysteine residues, & these
residues must be in the
region of a peptide that is
produced during proteolytic
cleavage
MudPit
Really only HT that can
detect PT modifications
Other Methods for determining
protein abundance
DIGE
– e.g. Cy3 vs cy5
labeling
– Very big dynamic
range
2D-electrophoresis
Tap Tagging
Weissman & O’Shea
(Oct 2003)
Other Methods for determining
protein abundance
4000
3000
2000
1000
80000 60000 40000 20000
Max
Max Prot
Affy
Tap
MP
ICAT
DIG
2DE
0
TAP
MP
ICAT
DIG
2DE
0
Same mRNA levels yet protein data varied
> 20X
N ~100, r = 0.9
Protein Quantification via measurement of radioactivity
Gygi et al Molecular and Cellular Biology,1999.
Same mRNA levels yet protein data varied > 20X
Do some ORFs bias the results?
73 proteins (69%) R = 0.356
mRNA vs Protein
r = 0.74
Protein Quantification via
image analysis
Futcher et al Molecular and
Cellular Biology, 1999
Jury is out…
Gygi et al: “This study revealed that
transcript levels provide little
predictive value with respect to the
extent of protein expression.”
Futcher et al: “there is a good correlation
between protein abundance and mRNA
abundance for the proteins that we have
studied”.
mRNA vs Protein
Greenbaum et al Bioinformatics 2001
r =0.67
3 Genes in Lung Adenocarcinomas
Op18, Annexin IV, and GAPD
r = 0.025
Chen et al Molecular & Cellular
Proteomics, 2002.
murine hematopoietic precursor MPRO
change in expression 0 - 72 hr
murine hematopoietic precursor MPRO
change in expression 0 - 72 hr
R = 0.58
~ 80% of the genes are
located in the first and
third quadrants
Ratios of wt+gal to wt gal
ICAT vs microarray
N ~ 290, r = 0.6
Ideker et al Science, 2001
Yeast growth under two different media
r = 0.45 but almost 1.0 for same loci in same pathway
Washburn et al PNAS 2003
Integrating multiple sources of
Information
The challenge for computational biology
is to provide methodologies for
transforming high-throughput
heterogeneous data sets into
biological insights about the
underlying mechanisms. Although highthroughput assays provide a global
picture, the details are often noisy,
hence conclusions should be supported
by several types of observations.
Integration of data from assays that
examine cellular systems from different
viewpoints (for instance, gene
expression and protein-protein
interactions) can lead to a more
coherent reconstruction and reduce the
effects of noise.
Nir Friedman Science 2004
Sources of Data
Data set
mRNA
expression
Protein
abundance
Description
Size [ORFs]
Reference
Young
Gene chip profiles yeast cells with mutations that affect
transcription
5455
Holstege et al. (1998)
Church
Gene chip profiles of yeast cells under four different conditions
6263
Roth et al. (1998)
Samson
Comparing gene chip profiles for yeast cells subjected to
alkylating agent
6090
Jelinsky et al. (1998)
SAGE
Yeast cells during vegetative growth
3778
Velculescu et al. (1997)
Reference expression
Scaling and integrating the mRNA expression set into one data
source
6249
-
2-DE #1
Measurement of yeast protein abundance by two-dimensional
(2D) gel electrophoresis and mass spectrometry
156
2-DE #2
Similar to 2-DE set #1
Transposon
Large-scale fusions of yeast genes with lacZ by transposon
insertion
Reference abundance
Scaling and integrating the 2-DE data sets into one data
source
Annotated Localization
Subcellular localizations of yeast proteins
2133 (6280)
Drawid et al. (2000)
Transmem-brane
segments
Predicted transmembrane and soluble proteins in yeast
2710 (6280)
Gerstein (1998)
MIPS functions
Functional categories for yeast ORFs
3519 (6194)
Mewes et al. (2000)
GOR secondary
structure
Predicted secondary structure for yeast ORFs
71
1410
181
Gygi et al. (1999)
Futcher et al. (1999)
Ross-Macdonald et al.
(1999)
-
Annotation
6280
Gerstein (1998)
Reference mRNA Sets
Young
Church
Samson
SAGE
Fitting Protein Data
Original Set
mRNA vs Protein
Greenbaum et al Bioinformatics 2001
r =0.67
Outliers
below trendline
above trendline
(2STDEV from the mean)
ORF
YBR118W
YER065C
YMR303C
YOL086C
YJR009C
YGR192C
YJR104C
YML054C
YJL052W
YKR059W
YML008C
YFL022C
YJL008C
YPL160W
YOR361C
YCL030C
YNL209W
FUNCTION
translation elongation factor eEF1 alpha-A chain
Isocitrate Lyase
Alcohol dehydrogenase II
Alcohol dehydrogenase I
Glyceraldehyde-3-phosphate dehydrogenase 2
Glyceraldehyde-3-phosphate dehydrogenase 3
Copper-zinc superoxide dismutase
lactate dehydrogenase cytochrome b2
glyceraldehyde-3-phosphate dehydrogenase 1
Translation initiation factor
S-adenosyl-methionine delta-24-sterol-c-methyltransferase
Phenylalanine-- tRNA Ligase beta chain
Component of chaperonin-containing T-complex
leucine--tRNA ligase
translation initiation factor eIF3 subunit
phosphoribosyl-AMP cyclohydrolase
heat shock protein of HSP70 family
High Protein
Metabolism (1)
Energy(2)
MIPS
5,30
1,2, 30
1, 2, 30
1, 2, 30
1, 2, 30
1, 2, 30
11,30
1,2,30
1,2,30
5,30
1,30
5,30
6,30
5,30
3,5,30
1
5,30
Low Protein
Prot. Syn. (5)
Prot. Fate (6)
Later larger datasets concurred with these results
in that Generally…
10000000
1000000
protein
100000
10000
1000
100
10
1
0.1
1
10
100
mRNA
AA metabolism & Energy
are 2X as likely to have high
protein vs mRNA than the
general population
1000
Protein synthesis (~35% of all
protein synthesis genes) and Protein
fate (folding, modification,
destination) are more likely to have
low protein vs mRNA than the general
population
Non-Outliers Generally…
Tight Regulation by the cell
Only 3% of transcription associated genes (n =
441) have significantly uncorrelated mRNA and
protein levels (2STDEV from trendline)
Transcription Assoc. genes are 25% of the
essential genes in yeast.
Essential Genes as a group have higher
correlations than the general yeast population
7% of Cell Cycle associated genes (n = 432) have
significant non-correlation
Quick Summary
• Why correlate mRNA and protein levels?
• Merged Disparate Data Sets
– Distinct but complimentary
• Global Correlations
• Outliers are interesting:
– Metabolism & Energy Relatively high protein
levels
– Protein Synthesis & Protein Fate low protein
levels
Data Set Size
~170 ORFs
2 DE-gel datasets
~6,000 ORFs
5 Affymetrix GeneChips
+ SAGE data
~6,000 ORFs
Enrichments
(Feature, [v,S], [w,G]) =
(F,[v,S]) -(F,[w,G])
(F,[w,G])
V & W are weights (expression level)
of Sets S & G
Visual Formalism
~170 ORFs
~6,000 ORFs
Depletion of
Random Coil Secondary Structure  STABILITY
Concurrence with data from Perczel et al Chemistry 2003
Regarding stability of specific secondary structures
Enrichment of
Amino Acids 
STABILITY
Alanine’s, Glycines, Valines result in more compact structures
More compact = more stable (i.e. thermophilic enzymes tend to be very compact)
Enrichment of
Amino Acids
Simple story: translatome is
enriched in same way as
transcriptome
Enrichment of Molecular Weights/Biomass
Abundant proteins are smaller = reduces cost
Effect of
transcription
yeast cell favors the expression of shorter ORFs over longer ones
(as opposed to long lightweight ORFs – see MW of aa)
This selection is happening, for the most part at the transcriptome level
-------------------------------------------------------------------------------------------------Neg Correlation between ORF length and mRNA expression Jansen &
Gerstein 2000 (And to a lesser degree with Protein Abundance)
Enrichment of Molecular Weights/Biomass
Abundant proteins are smaller = reduces cost
Effect of
transcription
CONCURS with experimental results from Akashi, Genetics 2003
See also: Akashi,Genetics 1996 & Moriyama and Powell, NAR 1998
hypothesize that this trend exists in S. cerevisiae, D. melanogaster and E.
coli. (although probably not in C. elegans)
Enrichment of
Functional Categories
10000000
1000000
protein
100000
10000
1000
100
10
1
0.1
1
10
mRNA
100
1000
Depletion
Functional Categories
Transcription & Cell Growth
Molecular switches
Require only minimal expression
Enrichment of
localization - BIAS?
(Drawid & Gerstein. 2000),
Review
Formalism
Different gene sets b/c of limited data
Enrichments
concur with experimental results
Fitting Protein Data
Newer Set
Mudpit fit first into mRNA space
then inverse fit back into protein space
then each of the data sets is fit via least
squares onto the Aebersold data set
Aebersold
Aebersold
Futcher
Reference
Yates
Gygi
mRNA
125
Futcher
Reference
Yates
Gygi
mRNA
29
113
102
116
125
73
61
56
64
69
150
143
128
150
1436
785
1346
1504
1480
6250
Fitting Protein Data
Newer Set
Mudpit fit first into mRNA space
then inverse fit back into protein space
then each of the data sets is fit via least
squares onto the Aebersold data set
Aebersold
Aebersold
Futcher
Reference
Yates
Gygi
mRNA
125
Futcher
Reference
Yates
Gygi
mRNA
29
113
102
116
125
73
61
56
64
69
150
143
128
150
1436
785
1346
1504
1480
6250
Fitting Protein Data
Newer Set
Mudpit fit first into mRNA space
then inverse fit back into protein space
then each of the data sets is fit via least
squares onto the Aebersold data set
Aebersold
Aebersold
Futcher
Reference
Yates
Gygi
mRNA
125
Futcher
Reference
Yates
Gygi
mRNA
29
113
102
116
125
73
61
56
64
69
150
143
128
150
1436
785
1346
1504
1480
6250
Fitting Protein Data
Newer Set
Mudpit fit first into mRNA space
then inverse fit back into protein space
then each of the data sets is fit via least
squares onto the Aebersold data set
Aebersold
Aebersold
Futcher
Reference
Yates
Gygi
mRNA
125
Futcher
Reference
Yates
Gygi
mRNA
29
113
102
116
125
73
61
56
64
69
150
143
128
150
1436
785
1346
1504
1480
6250
Fitting Protein Data
Newer Set
Mudpit fit first into mRNA space
then inverse fit back into protein space
then each of the data sets is fit via least
squares onto the Aebersold data set
Aebersold
Aebersold
Futcher
Reference
Yates
Gygi
mRNA
125
Futcher
Reference
Yates
Gygi
mRNA
29
113
102
116
125

73
61
56
64
69

150
143
128
150
1436
785
1346
1504
1480
6250


Global Correlation
mRNA Set 6249 ORFs
Protein Set # 2 2 2DE sets & 2 Mudpit
~2000 ORFs
Protein Abundance
1000
100
10
MudPit (1)
MudPit (2)
2DE (1)
2DE (2)
R = 0.66
1
0.1
0.1
1
10
mRNA Expression
100
1000
Functional Categories
Cell Cycle (R=0.71)
1000
Reference Data (R=0.66)
Protein Abundance
Cell Rescue (R=0.45)
100
Co-regulated
proteins
10
1
0.1
0.1
1
10
mRNA Expression
100
Subcellular Localization
Mudpit does not have the 2DE biases
Subcellular Localization
Protein Abundance
100
Lack of correlation in
mitochondria Concurs
with experimental
results from
Ohlmeier S et al.
JBC 2004
10
Nucleolus (R=0.8)
1
Cell Periphery (R=0.74)
Reference Data (R=0.66)
Mitochondria (R=0.42)
0.1
0.1
1
10
mRNA Expression
100
Expression as a function of localization
is well correlated with protein levels
(latest data)
Membrane
r = 0.73
Bud
r =0.76
r global = 0.46
Nucleus
r = 0.49
P
ER
r = 0.61
M
Cytoplasm
r = 0.50
Cell Wall
r =0.52
Mitochondria
r = 0.50
Extracellular
r = 0.33
Golgi
r = 0.28
Endosome
r = 0.87
Why would we not find strong
correlations?
Post translational modifications
Protein degradation
Error and Bias
Ribosomal Occupancy
Arava et al. (2003) Proc. Natl. Acad. Sci. USA
1
0.9
0.8
Occupancy
CAI
Coefficient
of Variation
Top
Top
Correlation
0.7
0.6
Top
0.5
0.4
0.3
Bottom
Our results concurred with Bottom
experimental findings by Brown and
Herschlag’s groups:
0.2
0.1
Bottom
0
Ribosomal Occupancy
Top Frac.
Bot. Frac.
0.78
0.30
Moreover:
mRNAs not associated with any
polysomes have even less of a
correlation r = 0.2
 v. strong translational control
1
0.9
Coefficient of Variation
Top
0.8
Correlation
0.7
0.6
0.5
0.4
0.3
0.2
Bottom
0.1
0
mRNA Expression Variability
Top Frac.
Bot. Frac.
0.89
0.20
mRNA expression
Variability of mRNA expression
time
1
0.9
Coefficient of Variation
Top
0.8
Correlation
0.7
0.6
0.5
0.4
0.3
0.2
Bottom
0.1
0
mRNA Expression Variability
Top Frac.
Bot. Frac.
0.89
0.20
mRNA expression
Variability of mRNA expression
time
Codon Adaptation Index
0.6
CAI
0.5
Top
Correlation
0.4
0.3
0.2
0.1
Bottom
0
Codon Usage
Top Frac.
Bot. Frac.
0.48
0.02
Concurs with experimental data: CAI does not
Predict mRNA and protein the same way
shown to be the result of different levels of
degredation
Another summary
Newer, larger data set
Looking at Broad Catagories
I
Post translational modifications?
where we expect PT control --> low r. Where we don’t expect -->
high r
Occupancy
Variability
II Protein Degradation?
CAI
III Experimental Error?
next section
Expression and interactions
Types of protein-protein interactions
– Protein complexes
• For example: proteasome, ribosome
– Aggregated interactions
• Yeast two-hybrid (Y2H)
• Genetic/physical interactions from MIPS
Relationship of P-P-interactions to abs.
expression level
Dij 
Ei  E j
Ei  E
similar protein results
Protein-Protein
Interactions & Expression
Correlations
Cell Cycle
CDC28 expt. (Davis)
Sets of interactions
(all pairs,
control)
Pairwise interactions
(from MIPS)
(Uetz et al.)
between selected expression timecourses
(strong interactions in permanent complexes, clearly diff.)
Protein-Protein Interactions &
Expression Correlations
Cell Cycle
CDC28 expt. (Davis)
Sets of interactions
(all pairs,
control)
Pairwise interactions
(from MIPS)
(Uetz et al.)
between selected expression timecourses
(strong interactions in permanent complexes, clearly diff.)
Permanent vs. Transient
Complexes
1
0.8
Rosetta
0.6
transient
Permanent
.
L Ribosome
0.4
S Ribosome
0.2
SAGA
0
-0.2
-0.2
0
0.2
0.4
0.6
CC
0.8
1
1.2
correlation
ORC2
ORC6
ORC5
ORC4
ORC3
ORC1
DPB3
CDC45
DPB2
CDC2
CDC7
POL2
HYS2
POL32
DBF4
MCM3
MCM6
CDC47
MCM2
CDC46
CDC54
Representing Expression Correlations within a Large
Complex in a Matrix
MCM3
MCM6
CDC47
MCM2
CDC46
CDC54
DPB3
CDC45
DPB2
CDC2
CDC7
POL2
HYS2
POL32
DBF4
ORC2
ORC6
ORC5
ORC4
ORC3
ORC1
Permanent? Transient?
correlation
L7/L12
Cell degrades all excess riboosmal proteins, except
L7 & L12
correlation
Expression Correlations Segment Large
Replication Complex into Component Parts
MCM3
MCM6
CDC47
MCM2
CDC46
CDC54
Temporally transient
MCMs
prots.
Polym.
d&e
ORC
DPB3
CDC45
DPB2
CDC2
CDC7
POL2
HYS2
POL32
DBF4
ORC2
ORC6
ORC5
ORC4
ORC3
ORC1
Proteasome
No distinction visible between
components
Proteasome
Overall .43
20S .50
19S .51
indicative of the possibility that
the two components are
really one?
Division is an artifact of their
discovery—M Hochstrasser
%ORFs in complexes with
significant correlation
Complex (> 2 ORFS, P < 0.001)
n
alpha
Cdc15
Cdc28
Rosetta
Alpha, al-treh. anchor (50)
4
Cacinerum B (100)
3
67%
67%
Chaperone containing T-complex TRiC (130)
8
50%
25%
Pho85p (133.20)
6
Glycine decarboxylase (200)
3
67%
ATPase (210)
4
100%
TRAPP (260.60)
10
Vps4p ATPase (260.70)
3
Nucleosome protein (320).
8
Cytochrome bc1 complex (420.30)
9
Cytochrome c oxidase (420.40)
8
F0/F1 ATP synthase (complex V)(420.5)
75%
33%
50%
40%
67%
100%
50%
87%
37%
75%
44%
78%
78%
38%
88%
50%
15
60%
Ribonucleoside reductase (430)
4
Nuclear processing (440.10.10)
5
RNA polymerase I (510.10)
8
38%
RNA polymerase II (510.40.10)
9
44%
Tornow & Mewes NAR 2003
75%
50%
40%
38%
50%
Average Expression of all
subnunits in a complex
mRNA expression (x103 )
1
10
100
10000000
1.0635
y = 3028.4x
2
R = 0.6076
1000000
100000
10000
1000
100
10
1
protein abundance
0.1
PP INT Summary
Complexes
broad catagories minimize noise
– Permanent complexes show strong co-expression
Posttranscriptional regulation functions at a whole complex
level (Washburn et al PNAS 2003)
– Transient complexes have weaker co-expression
Aggregated BINARY interactions (Y2H, physical, genetic)
Weak co-expression similar to transient complexes --noisy data?
ERROR ?  minimized in larger groups
Global Summary
mRNA expression is related to protein
abundance
Broad categories minimize noise that
prevents us from seeing this correlation
Integrating various genomic data is integral
to an analysis
Biologically relevant results can be seen
when looking at mRNA and protein
populations
Future Research
Further indepth analysis into protein
degredation
Integrate new Tap Tagging data into protein
abundance ref set
More intensive modeling of the relationship
between mRNA and protein
Relationship between mRNA and
Protein levels
dPi = k mRNA - k P
s;i *
i
d;i i
dt
where ks,i and kd,i are the protein
synthesis and degradation
rate constants, respectively, and
 is the growth rate
ks;i * mRNAi
At steady state:
Pi =
kdi
N end rule
PEST?
N End Rule in Yeast
Met
Pro
Val
Gly
Thr
Ser
Ala
Cys
AA
Glu
Ile
Tyr
Gln
Asp
His
Asn
Fast Decay
Slow Decay
Trp
Leu
Phe
Lys
Arg
1
10
100
In Vivo Hallf Life (Min)
1000
10000
Results of protein degredation
Significantly higher correlation for fast
decaying proteins
Not for slow decay
high decay rate is indicative of greater
cellular control over level e.g. proteins
with half lives of days – cell can’t tightly
control
Results are same for mRNA degredation -half lives have been quantified
Acknowledgments
Gerstein Lab
Weissman Lab
Zheng Lian
This work
Ronald Jansen (MSKCC)
Yuval Kluger (NYU)
Keck
(HHMI Biopolymer Laboratory and W. M. Keck
Other Projects
Haiyuan Yu
Hedi Hegyi
Jimmy Lin
Rajdeep Das
Jiang Qian
Nick Luscombe
Entire Gerstein Lab
Foundation Biotechnology Resource Laboratory)
Christopher Colangelo
Ken Williams
Thesis Committee
Mark Gerstein
Sherman Weissman
Kevin White
Genetics Department
SABRINA
Liana

Correlating mRNA and protein Abundance

Transcript Correlating mRNA and protein Abundance

Directory