Statistical Analysis Using Scaffold - Proteome Software

Download Report

Transcript Statistical Analysis Using Scaffold - Proteome Software

Analysis of Complex Proteomic
Datasets Using Scaffold
Free Scaffold Viewer can be downloaded at:
www.proteomesoftware.com
Scaffold: Why do we need it?
Shotgun proteomics  Analysis of complex mixtures
Whole cell extract
10,000+ proteins
600,000 peptides
1.2 Million Spectra!!!
• Beyond the realm of manual interpretation
• How do we determine what is a valid protein
identification?
Statistical Analysis Using Scaffold
• All search engines use different scoring
algorithms  Can not directly compare results
• Many search engines results are described by
more than one value
Examples:
Mascot  Ion Score and Identity Score
Sequest  Xcorr and DeltaCn
Statistical Analysis Using Scaffold
Peptide Prophet*
• Creates a universal score (discriminant score) for the search
engine result (e.g. XCorr and DeltaCn are compressed to one
score for SEQUEST results, Ion score and Identity score for
Mascot results)
• Plots a histogram of the discriminant scores and
calculates a bimodal distribution based on standard
statistics to differentiate between correct and incorrect hits
• Computes the probability that the match is correct at a
given discriminant score
*Nesvizhskii, A. I. et al, Anal. Chem. 2003, 75, 4646-4658
Statistical Analysis Using Scaffold
200
Number of spectra in each bin
180
Histogram of discriminate scores
160
140
120
100
80
60
40
20
0
-3.9
-2.3
-0.7
0.9
2.5
4.1
Discriminant score (D)
5.7
7.3
Statistical Analysis Using Scaffold
200
Number of spectra in each bin
180
160
Assumes a mixture of
standard statistical
distributions
“incorrect”
140
120
100
“correct”
80
60
40
20
0
-3.9
-2.3
-0.7
0.9
2.5
4.1
Discriminant score (D)
5.7
7.3
Statistical Analysis Using Scaffold
200
Number of spectra in each bin
180
Peptide Probability
Threshold
“incorrect”
160
140
p ( D | ) p()
p( D | ) p()  p( D | ) p()
“correct”
p (  | D) 
120
100
80
60
40
20
0
-3.9
-2.3
-0.7
0.9
2.5
4.1
Discriminant score (D)
5.7
7.3
Statistical Analysis Using Scaffold
One Search
Engine may
not be
enough
SEQUEST
9%
22%
4%
34%
X!Tandem
www.proteomesoftware.com
19%
7%
5%
Mascot
Statistical Analysis Using Scaffold
• Peptide Prophet statistics are applied separately for
each search engine result (i.e. Mascot, SEQUEST,
and X!Tandem)
• Scaffold Merger combines the peptide probabilities
from each search engine to generate a protein
probability
The probability of identifying a spectrum
+
The probability of agreement between search engines
Protein Probability
Statistical Analysis Using Scaffold
Advantages using of Scaffold
• Allows you to choose a statistical error rate by setting
probability thresholds
• Allows you to compare and combine results from
different experiments and different search engines
• Allows sharing of raw data and search results
• Accepted as a suitable statistical method to validate
large datasets
This is the Samples view
List of all the proteins found in your samples
Homologous proteins (proteins matched to the same peptides)
are shown. You can directly like out to database entries
How does Scaffold Deal with
peptides that can be assigned to
more than one protein?
General Rule  Explain the spectral data
with the smallest set of proteins
B
A
Protein A and Protein B
share all the same
peptides so they will be
grouped together
How does Scaffold Deal with
peptides that can be assigned to
more than one protein?
General Rule  Explain the spectral data
with the smallest set of proteins
B
A
Protein A and protein B
each have one unique
peptide  they will be
listed separately only if
the peptide probability is
> 50%
How does Scaffold Deal with
peptides that can be assigned to
more than one protein?
General Rule  Explain the spectral data
with the smallest set of proteins
B
A
Protein B has two unique
peptides  it will be listed
separately
Scaffold will extract GO terms from NCBI annotations
Gene Ontology “GO” terms
• Controlled vocabulary containing consistent
descriptions of gene products in different
databases
• Describe gene products in terms of their
associated biological processes, cellular
components and molecular functions in a species
independent manner
Gene Ontology Project http://www.geneontology.org/GO.doc.shtml
List of samples
Probability thresholds for peptide and protein
identifications and required number of unique
peptides can be defined
Color coded to represent probability that protein
identification is correct
This is the Proteins view
Spectrum of each peptide labeled with y and b ions which can
be used for manual validation
Manual Spectrum Evaluation
• Search engine scores  Is peptide found by more
than one search engine?
Mascot ion score > 40
SEQUEST Xcorr > 2 (+2 ion), 2.5 (+3 ion)
deltaCn > 0.2
• Good signal-to-noise
• Long stretches of y and/or b ions
• All dominant peaks are assigned as y or b ions
• Fragmentation chemistry
N-terminal cleavage at P  dominate y-ion
C-terminal cleavage at D and E  dominate b-ion
Peptides containing W  abundant y-ions
S and T  tend to lose water (-18 Da)
R, N, and Q  tend to lose ammonia (-17 Da)
Good Spectrum
Peptide Sequence IAELAGFSVPENTK
+2 charge on parent peptide
y5
100%
I
A
K
L
E
T
N
A
E
G
P
F
V
S
S
V
F
1474.73 AMU, +2 H (Parent Error: -650 ppm)
P
G
E
A
N
L
T
E
K
A
I
Good signal-to-noise
50%
b9-H2O
y6
b7
b5 b6
b3 y3 b4 y4
0%
0
250
500
y7
b8
b9
y9
y10
y8
b10
750
m/z
1000
b11y11 b12 b13
y12
1250
Good coverage of y and b ion series
Dominant y-ion at N-terminal cleavage of P
SEQUEST: Xcorr = 2.61
deltaCn = 0.4
Mascot: Ion Score = 60.1
Identify Score = 37.3
Bad Spectrum
Peptide Sequence YPLADYALTPDMAIVDANLVMDMPK
+3 charge on parent peptide
100%
Y
K
P
P
L
M
A
D
D
Y
M
V
b19+2H
b20+2H
b21+2H+1
A L
T P D
M
L N A D V I
internal PLADYALTPD-CO
b9
A
A
I
M
V
D
2767.75 AMU, +3 H (Parent Error: -240 ppm)
D A N
L
V
M
D
M
P
K
P T
L
A
Y
D A L
P
Y
Poor coverage of y and b ion series
b17-H2O+2H
b8 b9+1
b20+2H+1
b9-H2O-H2O+2H
b17+2H x17
y15+2H
b8-H2O
b13+2H+1
y4
y3 x8+2H
a7-H2O+1
b9+2 y10
internalb13+2H
PLADYAL-NH3
b9-H2O-H2Ob22+2H+1
50%
b5y5 y6 y7
y9
y11
b11 y12
b15
b14
0%
0
500
1000
1500
2000
2500
m/z
Multiple unassigned peaks
SEQUEST: Xcorr = 2.26
deltaCn = 0.2
Poor signal-to-noise
Mascot: Ion Score = 9.93
Identity Score = 37.3
This is the Statistics view
Scaffold Statistics View
Score Histogram
Blue indicates “incorrect”
proteins
Red indicates “correct”
proteins
Important!
Must have enough data to
fit two distributions for the
statistics to be valid.
Protein is “correct” if it passes the peptide and protein
probability and minimum # peptide filters.
Scaffold Statistics View
With at least 2 unique
Peptides (95% peptide prob)
the maximum
protein probability is ~100%.
With only 1 unique
peptide (95% peptide prob)
the maximum
protein probability is <90%.
Scaffold Statistics View
Missed IDs
SEQUEST only
Scaffold Statistics View
Mascot only
Missed IDs
Scaffold Statistics View
Using both Mascot and Sequest results in more
“correct” protein identifications
Mascot only
Both
Sequest only
This is the Publish View
Publication Guidelines
for Proteomic Data
Journal of Molecular and Cellular Proteomics
http://www.mcponline.org/misc/ParisReport_Final.shtml
Publication Guidelines
for Proteomic Data
Data Analysis
• Name and version of software used to extract peak list
• Name and version of database searching software (Mascot,
Sequest, Spectrum Mill, or X! Tandem)
• Values of all search parameters used (enzyme, modifications,
mass tolerance, etc.)
• Name and size of the database searched (Swisprot or NCBI and
the number of sequence entries)
• Name and version of any additional software used for statistical
analysis and an explanation of the analysis (Scaffold, #peptide
requirements, probability settings)
Publication Guidelines
for Proteomic Data
Each Peptide Identified
• Peptide sequence noting any modifications or
missed cleavages
• Parent peptide ion mass and charge
• All search engine scores
Each Protein Identified
• Accession number
• Sequence coverage and total number of unique
peptides