No Slide Title

Download Report

Transcript No Slide Title

Proteomics: A Challenge for Technology and
Information Science
CBCB Seminar, November 21, 2005
Tim Griffin
Dept. Biochemistry, Molecular Biology and Biophysics
[email protected]
What is proteomics?
“Proteomics includes not only the identification
and quantification of proteins, but also the
determination of their localization, modifications,
interactions, activities, and, ultimately, their
function.”
-Stan Fields in Science, 2001.
Genomics vs. Proteomics
Similarities:
Large datasets, tools needed for annotation and
interpretation of results
Differences:
Genomics – generally mature technologies, data processing
methods, questions asked usually involve quantitative changes
in RNA transcripts (microarrays)
Proteomics – still evolving, complexity of protein biochemical
properties: expression changes, modifications, interactions,
activities – many questions to ask and data to interpret,
methods changing, different approaches (mass spec, arrays
etc.),
Genomics, Proteomics, and Systems Biology
genomics
genomic
DNA
mRNA
proteomics
protein
products
computational biology
functional
protein
mature prototype emerging
catalytic
activity
sub cellular
location
Protein
Modifications
3D structure
Protein
dynamics
quantitative
profiling
protein
phosphorylation
protein
cataloguing
arrays
sequencing
descriptive protein
interaction maps
system
interactions
between
components
identify
system
components
measure
and define
properties
“Shotgun” identification of proteins in mixtures by LC-MS/MS
Liquid chromatography coupled to tandem mass spectrometry (MS/MS)
peptide
fragments
peptides
++
+
+
++
+
µLC
separation
(50-100 um)
+
+
++
+
+
+
+
+
+
++
Ionization:
MALDI
or
Electrospray
+
Isolation
Mass
Analysis
Fragmentation
m/z
Tandem mass spectrum
(thousands in a matter of hours)
Peptide sequence determination from MS/MS spectra
Collision-induced dissociation (CID) creates two prominent ion series:
y-series: y14 y13 y12 y11 y10 y9 y8 y7 y6 y5 y4 y3 y2 y1
H2N-N--S--G--D--I--V--N--L--G--S--I--A--G--R-COOH
Relative Abundance
b-series: b1 b2 b3 b4 b5 b6 b7 b8 b9 b10b11 b12 b13 b14
200
400
600
m/z
800
1000
1200
Relative Abundance
Peptide sequence identifies the protein
GDIVNLGSIAGR
DIVNLGSIAGR
IVNLGSIAGR
VNLGSIAGR
NLGSIAGR
LGSIAGR
GSIAGR
H2N-NSGDIVNLGSIAGR-COOH
SIAGR
IAGR
AGR
GR
R
200
400
600
m/z
800
1000
1200
YMR134W, yeast protein involved in iron metabolism
Raw MS/MS spectrum
Relative Abundance
High-throughput protein identification by LC-MS/MS and automated
sequence database searching
200
400
600
m/z
800
1000
Direct identification of 1000+
proteins from complex mixtures
Relative Abundance
Protein sequence and/or DNA
sequence database search
Peptide sequence match
1200
GDIVNLGSIAGR
DIVNLGSIAGR
IVNLGSIAGR
VNLGSIAGR
NLGSIAGR
LGSIAGR
GSIAGR
SIAGR
IAGR
AGR
GR
R
200
400
Protein identification
H2N-NSGDIVNLGSIAGRNSGDIVNLGSIAGR-COOH
600
m/z
800
1000
1200
Dealing with the data
Integrated workflow?
1. Data acquisition
2. Peak analysis
3. Knowledge annotation
and interpretation
• Experimental information,
metadata capture
• Sequence database
searching
• Quantitative analysis
• Database mining
• Assignment of function,
pathway, localization etc.
• Output for database
archiving, publication
1. Data acquisition: capturing experimental information
Proteomics Experimental Data Repository
(PEDRo)
Proposed schema
• Similar to genomic needs, but experimental info a bit different
2. Peak Analysis
 ProFound
 Mascot
 PepSea
 MS-Fit
 MOWSE
 Peptident
 Multident
 Sequest
 PepFrag
 MS-Tag
Relative Abundance
Computational algorithms for searching MS/MS spectra
against protein sequence databases, mRNA sequences,
DNA sequences
200
400
600
m/z
800
1000 1200
Protein identification
• need cpu horsepower (parallel computing)
2. Peak Analysis: data formats
Format 1
Output 1
Format 2
?
Output 2
Format 3
?
Output 3
• Lack of flexibility
• Slow to evolve
• Lack of incorporation of competing products, methods
2. Peak Analysis: need general, flexible, in-house solutions
Format 1
Format 2
Format 3
reverse engineering of data formats
General tools for analysis of multiple data formats
2. Peak Analysis; reverse engineering data formats
http://sashimi.sourceforge.net/software_glossolalia.html
2. Peak analysis: quality control of protein matches
filtering
Unfiltered – 105+ matches
(lots of noise and junk)
Filtered – thousands of “true” matches
• Statistical analysis of database results (tools are available)
2. Peak Analysis: Quantitative analysis
State 1
N = normal isotope label
State 2
H H = heavy isotopic label
N
(e.g. 2H,
combine, proteolyze and
isolate labeled peptides
13C, 15N)
H
H
H
N
analyze peptides by
mass spectrometry
• External chemical labeling
• Metabolic labeling (SILAC)
• Enzymatic incorporation (O16/O18)
intensity
H
H
H
relative protein abundance =
N
m
[intensity of N-labeled peptide]
[intensity of H-labeled peptide]
mass-to- charge (m/z)
• Flexibility is key – need tools to handle different quantitative methods
2. Peak Analysis: Quantitative analysis
+TOF MS: 20 MCA scans from mm_sample.wiff
a=3.56145059693694800e-004, t0=6.89652636903192620e+001
Max. 274.0 counts.
Sample 2
1926.0240
274
260
1927.0231
240
220
In te n s ity , c o u n ts
200
180
160
1928.0203
140
120
Sample 1
100
1917.9946
80
1929.0322
1916.9909
60
1918.9924
40
1920.0007
20
0
1930.0176
1924.9803
1931.0077
1921.0165
1914
1916
1918
1920
1922
1924
m/z, amu
1926
1928
1930
1932
1934
Evolving methodologies: iTRAQ
Sample: 1
2
Digest to
peptides
3
Digest to
peptides
iTRAQ label: +114
4
Digest to
peptides
+115
+116
Digest to
peptides
+117
MS/MS spectrum
Intensity
Multidimensional separation
1
m/z
2
3
4
114 115 116 117
Diagnostic ions used
for quantitative analysis
Peptide fragments used for
sequence identification
• 4-way multiplexing: simultaneous comparison of multiple states, replicates
“old”
+TOF MS: 20 MCA scans from mm_sample.wiff
a=3.56145059693694800e-004, t0=6.89652636903192620e+001
116.0972
Max. 274.0 counts.
Sample 2
1926.0240
274
e
nc
y = nda
t
i
s
u
en ab
int tein
e
iv pro
lat
Re ative
rel
240
220
200
180
160
140
1927.0231
1928.0203
Intensity
260
I n t e n s it y , c o u n t s
Need for “changeable” tools
“new”
3
120
Sample 1
100
1917.9946
80
1929.0322
1916.9909
60
1918.9924
40
1920.0007
20
0
1930.0176
1924.9803
1931.0077
1921.0165
1914
1916
1918
1920
1922
1924
m/z, amu
1926
1928
1930
1932
2
4
115.0963
117.1025
1
1934
114.1005
Automated analysis tools?
3. Knowledge annotation: making sense of lists of data
3. Knowledge annotation: mining proteomic/genomic databases
3. Knowledge annotation: needs
• Annotation: accession numbers and protein names
• Functional assignments (functional degeneracy?)
• Pathway assignments
• Subcellular localization
• Disease implications
• Comparison of different proteomic datasets (i.e. expression profiles
compared to modification state profiles, other protein properties)
Automated and streamlined??
• Publication and deposit in databases
• Visualization of complex phenomena, interpretation of
biological relevance
• Modeling, integration with genomics data – computational
and systems biology