Proteogenomics - NUS Computing

Download Report

Transcript Proteogenomics - NUS Computing

Hugo Willy
A combination of the words
Proteomics and Genomics.
Proteogenomics commonly refer to
studies that use proteomic
information, often derived from mass
spectrometry, to improve gene
annotations.
It is used to characterize protein
sequence.
The basic idea is to ionize proteins
and let it “fly” in a vacuum chamber.
The mass/charge (m/z) ratio of the
ion can be deduced from the Time of
Flight (TOF) of the ion (to reach a
detector) or the frequency in which it
is circling in a magnetic field.
Some Mass Spectrometry technique ionize
whole proteins but the current popular
method is to chop a protein into peptides.
The peptides are separated by their masses
before ionization and sequenced
independently.
The peptide sequences are mapped back to
known protein sequences or used for de
novo sequencing (very much like genome
sequencing)
The peptide lengths – according to the
people I met is around 7-15 amino acids
Pros:
It is accurate in determining mass.
It can surely point, assuming
unambiguous mapping to a protein
sequence, to those proteins that are
translated in the cell – this can point
which mRNAs get translated and which
are not.
It can be used to quantify the amount of
different proteins in the sample – as
opposed to predicting it from the mRNA
levels using microarray
Pros:
It can identify Post Translational
Modification i.e
If proteins are phosphorylated (then it is
Kinase related)
If proteins are methylated and acetylated
(important in Histone code)
If proteins are ubiquitinated (related to
protein degradation)
It can detect (ribosomal) programmed
frameshift and alternative splicing
events.
Cons:
It is still expensive (but some expert in
RECOMB Satellite for Computational
Proteomics said it is just as expensive as
RNA-Seq).
It is hard to distinguish amino acids with
similar mass sum (most notably Leucine
and Isoleucine)
We do not have reliable way to amplify
proteins in the sample (serious problem)
Accurate prediction of Translation Start Site.
Accurate prediction of programmed
frameshifts.
Accurate prediction of post translational
modification.
A confirmation if a (pseudo)gene is actually
translated.
Observation: most current algorithms on
gene prediction are not based on proteomic
data (because they were not available)
For a novel protein, mapping the
peptides from the Mass Spectrometry
experiments to the exomes/genomes
(similar problem as RNA-Seq)
Currently they try to collect exomes
(regions that is assumed to be exons)
and translate them in 6 different frames
(3 in each DNA strand).
They also build a exon splice graph which
models different splicing alternatives of a
single gene
Each box represents a
single exon and the
arrows represent
possible combinations of
them in the translated
protein product.
They developed a program to search a
peptide in this graph called Inspect. Can be
found at http://proteomics.ucsd.edu/Inspect
Revising gene models – hence their
annotations.
Finding novel peptides that maps to
non-exonic regions – novel genes?
Nitin Gupta et al. Whole proteome analysis of posttranslational modifications: applications of massspectrometry for proteogenomic annotation. Genome
Res 2007.
Proteogenomics: Annotating Genomes using the
Proteome. Natalie Castellana. Poster in RECOMB CP
2011.
http://proteomics.ucsd.edu/recombcp2011/Posters/Poste
r_B19.pdf
Tutorial: Proteogenomics. Natalie Castellana.
http://bix.ucsd.edu/projects/recombcp10_tutorials/RECO
MBCP_Tutorial_Castellana.pdf
Most of the work are done by Pavel Pevzner and other
groups in UC San Diego. Here is their website
http://proteomics.ucsd.edu/
Is a branch of proteogenomics that
compares proteomic data from multiple
related species concurrently and exploits
the homology between their proteins to
improve annotations with higher statistical
confidence.
In a sense – this is the approximate
peptide matching problem.
However, it needs to take residue
conservation at different part of the
proteins into account e.g sites which are
post translationally modified must be
preserved to maintain function.
Some work in comparative
proteogenomics:
Nitin Gupta et al. Comparative proteogenomics:
Combining mass spectrometry and comparative
genomics to analyze multiple genomes. Genome
Res 2008.
GenoMS (Castellana et al. MCP 2010) – This is a
program to map peptides to the genome of other
related organism
Metaproteomics (also Community Proteomics,
Environmental Proteomics, or Community
Proteogenomics) is the study of all protein
samples recovered directly from
environmental samples.
This involves simultaneous mapping of
peptides to all known genomes and
proteomes to get the identity of different
organisms present in a sample.
Example work in this field is by
Wilmes P, Bond PL. Metaproteomics: studying
functional gene expression in microbial
ecosystems. Trends Microbiol. 2006.
CSPS (Bandeira et al. Nat. Biot. 2009)
MassBank
http://www.massbank.jp/en/docume
nt.html
I notice that Hoang’s problem – the one
which may be able to store multiple
reference genomes is going to be very
relevant.
RNA-Seq - Mass Spectrometry = Noncoding RNA?
Anything else?