Mass Spectrometry - Indiana University

Download Report

Transcript Mass Spectrometry - Indiana University

L529 - Presentation
PROTEOMICS
- Yogita Mantri
-Arvind Gopu
11/10/2003
Introduction – What is Proteomics?
“The identification, characterization and
quantification of all proteins involved in a
particular pathway, organelle, cell, tissue, organ
or organism that can be studied in concert to
provide accurate and comprehensive data about
that system.”
http://www.inproteomics.com/prodef.html
Central lesson from eukaryotic genome projects


Evolutionary complexity is not primarily determined by increasing
the number of genes, but by increasing variation on the level of
the synthesized proteins.
This is achieved by generating MULTIPLE proteins from only
ONE gene e.g. by
 different combinations of exons by alternative splicing
 post-translational protein processing (e.g. cleavage of propeptides)
 post-translational protein modifications (e.g. acetylation,
glycosylation)
 modified central dogma: DNA --> RNA --> protein(s)
 it is important to perform analyses on the level of gene
PRODUCTS Key

Key advantage of proteomics


Key limitation of proteomics


Researchers work on the level of gene products and
deal with genes that are really expressed to give a
detectable PRODUCT and are not just "expressed“
which only says they produce a detectable mRNA but it
is not clear whether there is a gene product or not.
Usually, only a fraction of the proteins synthesized can
be detected in a proteomics experiment, whereas the
expression of ALL genes can be monitored in a wholegenome array experiment.
Key prerequisite of proteomics

A genome sequence for the investigated organism or at
least a collection of many cDNA sequences is required.
Experimental Background
Mass Spectrometry
What is Mass Spec?





Analytical tool measuring molecular weight (MW) of
sample
Only picomolar concentrations required
Within an accuracy of 0.01% of total weight of
sample and within 5 ppm for small organic
molecules
For a Mr of 40 kDa, there is a 4 Da error
This means it can detect amino acid substitutions /
post-translational modifications
What sort of info is returned?





Structural information can be generated
Particularly using tandem mass spectrometers
Fragment sample & analyse products
Useful for peptide & oligonucleotide sequencing
Plus identification of individual compounds in
complex mixtures
How does a Mass Spectrometer work?





3 fundamental parts: the ionisation source, the
analyser, the detector
Samples easier to manipulate if ionised
Separation in analyser according to mass-to-charge
ratios (m/z)
Detection of separated ions and their relative
abundance
Signals sent to data system and formatted in a m/z
spectrum
Simplified Schematic


The analyser, detector and ionisation source are under high
vacuum to allow unhindered movement of ions
Operation is under complete data system control
Schematic of a typical TOF-MS/MS
Sample Introduction
& Ionisation


Direct into ionisation source or via
chromatography for component
separation (HPLC, GC, capillary
electrophoresis)
Ionisation can be positively charged (for
proteins) or negatively charged (for
saccharides and oligonucleotides)
Ionisation methods








Atmospheric Pressure Chemical Ionisation (APCI)
Chemical Ionisation (CI)
Electron Impact (EI)
Electrospray Ionisation (ESI)
Fast Atom Bombardment (FAB)
Field Desorption / Field Ionisation (FD/FI)
Matrix Assisted Laser Desorption Ionisation
(MALDI) (Clemmer Group)
Thermospray Ionisation
Detection & Recording of Ions


Detector monitors ion current, amplifies it and
then transmits signal to data system
Common detectors: photomultiplier, electron
multiplier, micro-channel plate
Mass spectrometry is a very powerful method to
analyse the structure of organic compounds, but
suffers from 3 major limitations:
Compounds cannot be characterised without clean
samples
This technique has not the ability to provide sensitive
and selective analysis of complex mixture
For big molecules like peptides spectra are very complex
and very difficult to interpret
Tandem MS or MS/MS has 2 mass spectrometers in series.
In first mass spectrometer (MS1) is used to SELECT, from the
primary ions, those of a particular m/z value which then pass into
the Fragmentation Region. The ion selected by the MS1 is the
parent ion and can be a molecular ion resulting from the primary
fragmentation. DISSOCIATION occurs in the fragmentation
region. The daughter ions are analysed in the Second
Spectrometer (MS2). In fact, the MS1 can be viewed as an ion
source for MS2.
MS1
MS2
Peptide Sequencing







Peptides of 2.5 kDa or less give best data
Protein sample often taken from 2-D gels and digested
A protein digest can be analysed as entire mix
Initial MS spectrum showing Mr of all components in digest
(peptide map) may be enough for a database search and
identification
Peptides fragmented along the amino acid backbone in tandem
mass spectrometry
Some peptides generate enough info for full sequence, others
only generate partial sequences of 4-5 amino acids
Often this “tag” sequence is sufficient for database identification
Data Analysis
Common Data Analysis - Pipeline
Issue #1 (Relatively Minor?)

Diverse set of Mass Spectrometers…




More flexibility BUT ...
Different data formats
Limited Data analysis possible
Exchange of RAW datasets and creation of public
repositories for the data/software? Not easy if not
impossible
Work Around for Issue #1?

To get around this problem






Convert to ASCII text - speed and loss of precision can be
an issue
Other formats specific to this field
A lot of XML based file formats seem be floating around
Of course using XML format (for example) gives raise to
additional level of complexity -- parsers, formatters, etc
It does add flexibility between data formats
Indexing techniques used to speed up access
Issue #2 (Much bigger!)



Data Size
Higher Dimensionality
The combination is even deadlier!


More detail in a minute … Before that …
The LC/MSMS spectrum data looks like this:
 LC
Drift
TOF
Intensity

i.e., 3-D + Intensity
Issue #2 (Continued…)

As a first step in data analysis:

Find peaks in the LC/MSMS data





Peaks is kind of a misnomer.
Center of mass (or something like that) is a better term.
Illustrates inherent non-uniformity within proteomics circles
Easier said than done as we found out!
Let us start with a simpler case of finding peaks in 2D data – a little more complicated than 1-D …
Peak Finding – 2-D data
http://www.cs.nott.ac.uk/~gxk/aim/notes/hillclimbing.doc
Peak Finding - Higher Dimensions?

As mentioned earlier data is of the form:



LC
Drift
i.e., 3-D + Intensity
TOF
Intensity
Add to this huge data size and get a hang of how difficult
a problem it is
Some Possible Solutions

Solutions we thought about:
 Find peaks using a brute force approach


Squeeze 3-D data into 2-D, find peaks and then work
backwards.


Not computationally feasible in terms of time and
memory
This is the algorithm implemented by Frank - one of the
IU Chemistry folks
Use existing implementations of graph functions
available in packages (For example: LEDA) to
preprocess data and then find peaks on smaller data
set
Our Peak Finding Algorithm



Used LEDA package for C++
Specifically made use of O(n Log n) implementation
of Delaunay Triangulation Neighbor Finding
algorithm in 3 D space
Once neighbors were found then do a brute force
peak finding step



How good were our results?
More details? Take a look at our summer
presentation at Chemistry
Sample of the data … What it looks like?
Peptide Assignment


Find sequence of amino acids that can
generate the list of masses seen in the
tandem MS scan.
Many different strategies:



Searching MS/MS spectra against a sequence
database (Sequest, Mascot, etc)
De novo sequencing (no database!)
Hybrid
Scoring Peptide Sequences

Multiple search engines are available




Sequest and Mascot
They use different scoring algorithms
Search outputs are not comparable
Search outputs usually require expert
validation …
An example of scoring system: SCOPE





Probabilistic model for scoring tandem MS
against peptide database
Two stage model
Uses dynamic programming
Incorporates fragment ion probabilities, noisy
spectra and instrument measurement error
Details:

http://bioinformatics.oupjournals.org/cgi/screenpdf
/17/suppl_1/S13.pdf (Scoring Spectra section)
Peptide Validation


Validate peptide assignments made during
the database search step.
Obviously, method used should be
standardized and independent from the
experimental and computational methods
used
Manual Validation

Filtering by database search scores

Problems:



Filtering criteria vary among researchers
Error rates are unknown
Possible only on very small datasets
Model Based Validation?

Empirical Statistical Model to estimate
accuracy …



Anal. Chem 2002, 74, 5383 – 5392
Employs Expectation Maximization and
Machine Learning techniques
Learns to distinguish between correct from
incorrect database search results
Model Based Validation – EM algorithm


Each peptide assignment evaluated w.r.t. all
other assignments including incorrect ones
Denote correct and incorrect assignments as
(+) and (-); Scores as x_1, x_2 … x_s

P(+ | x_1, x_2 … x_s) =
P(x_1, x_2 … x_s | +) * P(+)
--------------------------------------∑ P(x_1, x_2 … x_s | i) * P(i)
Model Based Validation – EM algorithm
(Continued …)

Replace search scores with discriminant
function F
P(F| +) * P(+)
P(+ | F) = ------------------∑ P(F| i) * P(i)

Bunch of probabilistic parameters considered
Ended up approximating distributions to
Gaussian and Gamma distrs.

(More details out of scope of this presentation, please refer paper)

Example of Automated Validation





An example: Protein Prophet
Compute probabilities that peptides assigned to
MS/MS spectra are correct
Learns distributions of search scores and peptide
properties among correct and incorrect results
The computed probabilities are claimed to be a true
measure of the confidence!
Combines probabilities of peptides assigned to
MS/MS spectra to compute probability that
corresponding proteins are present in the sample
Interpretation

Assign a biological meaning to the output of
the pipeline
Current Issues and Challenges
After Proteomics…..
Functional Genomics
ProteinChipTM.
Slide adapted from http://www.ciphergen.com/tech_doc11.2.html
Limitations of Proteomics
Experimental limitations:
Large-scale protein analysis difficult because:
-Proteins are fragile
-They can exist in multiple isoforms
-There is no protein equivalent of PCR for
amplification of a small sample
Data Analysis Limitations:
-Data contains a lot of noise that is difficult
to separate from actual signal. This results
in wastage of computing resources on
searching for unlikely spectra.
-Database searches for matching spectra
only give scores, leaving manual
intervention necessary for eliminating false
positives
Biomedical limitations
-In practice, it is very difficult to trace the complete
progression of a disease.
-Hence, using proteomics for monitoring the
biochemistry of a disease is like using a photo
camera to record a football match.
-Case of breast cancer research:
http://www.mcponline.org/cgi/reprint/2/5/281.pdf
References and Further Reading






Explains the whole process nicely -- article
http://swehsc.pharmacy.arizona.edu/analysis/Proteomics_New
s.htm
Mascot Home page -- help section
http://www.matrixscience.com/help_index.html
Presentation about MS MS data
http://sashimi.sourceforge.net/extra/oral.pdf
http://www.genetik.unibielefeld.de/D1E33C76A7CCA010AAD3B435B51404EE/Geno
me_Research_WS2002_03/stunde_ws0203_10.pd
http://bmbus6.leeds.ac.uk/mres/5130/MassSpectrometry.ppt
Some info about drug discovery/economic issues n such:
http://monod.uwaterloo.ca/cs798/proteomics.pdf
Paper on interpreting MSMS data
http://chem-ncms.unl.edu/asms2003/kurt.pdf
How to estimate correctness of MS MS prediction -- EM !!!
http://www.proteomecenter.org/PDFs/Keller.AnalChem.02.pdf

http://www.nature.com/cgitaf/DynaPage.taf?file=/nbt/journal/v21/n3/full/nbt0303-221.html

http://www.esainc.com/MolecularProteomics/molecular_proteomic
s.htm

Others:
http://genome.ucsd.edu/classes/be202/ppt/11

Delaunay Triangulation:
http://almond.srv.cs.cmu.edu/afs/cs/project/quake/public/www/tria
ngle.delaunay.html

SCOPE paper -- screen PDF
http://bioinformatics.oupjournals.org/cgi/screenpdf/17/suppl_1/S1
3.pdf
Internet sites

www.astbury.leeds.ac.uk/Facil/MStut/mstutorial.htm
(Dr Alison E. Ashcroft at Leeds)





www.asms.org (The American Society for Mass Spectroscopy)
www.spectroscopynow.com (Base Peak)
Mass Spec tools
www.expasy.ch/tools/#proteome
http://prowl.rockefeller.edu
www.mann.embl-heidelberg.de
Bibliography
Internet sites :
 http://www.google.com
•
•
•
•
•
•
http://www.bmss.org.uk/what_is/whatis.html
http://www.duke.edu/~mdfeezor/NSHome/inform/msms1.html
http://www.astbury.leeds.ac.uk/Facil/MStut/mstutorial.htm
http://ms.mc.vanderbilt.edu/tutorials/ms/3.htm
http://www.garvan.unsw.edu.au/public/corthals/book/IPMS.html
http://www.micromass.co.uk/basics/Glossary.html
Ionization Methods
Further Reading
1. For MALDI beginner:
http://www.srsmaldi.com/Maldi/Guide.html
2. For MALDI lab user:
http://www.srsmaldi.com/Maldi/Lab.html
3. For MALDI tutorial:
http://ms.mc.vanderbilt.edu/tutorials/maldi/maldi-ie_files/frame.htm
4. Ionization Methods 1:
http://www.jeol.com/ms/docs/ionize.html
5. Ionization Methods 2:
http://www.waters.com/Waters_Website/Applications/lcms/lcms_itq.htm
SELDI Web sites:
• Molecular Analytical Systems (MAS).
http://www.seldi.org/
• Manufacturers of ProteinChip(R)
http://www.ciphergen.com/