GST II: ---Title--- - University of Missouri
Download
Report
Transcript GST II: ---Title--- - University of Missouri
Computational
Proteomics
Dong Xu
Computer Science Department
109 Engineering Building West
E-mail: [email protected]
http://digbio.missouri.edu
573-882-7064 (O)
Outline
Introduction
Protein
identification using Mass-spec
Protein
interaction and pathway
Summary
Introduction – What is
Proteomics?
“The identification, characterization and
quantification of all proteins involved in a
particular pathway, organelle, cell, tissue,
organ or organism that can be studied in
concert to provide accurate and
comprehensive data about that system.”
http://www.inproteomics.com/prodef.html
Scope of proteomics
Graves and Haystead (2002)
Microbiol & Molec. Biol.
Rev. 66, 39-63
Outline
Introduction
Protein
identification using Mass-spec
Protein
interaction and pathway
Summary
Eucaryote Gene/Protein
Expression Control
nucleus
DNA
Primary
RNA
transcript
transcriptional
control
Methods:
Masspect
Microarray
inactive
mRNA
cytosol
mRNA
degradation
control
mRNA
RNA
processing
control
RNA
transport
control
mRNA
translation
control
protein
nucleus
membrane
protein
degradation
control
inactive
protein
post
translational
control
modified
protein
2D Page
experimental mass
Control
isoelectric point
Bruno ME et al., Arch Biochem Biophys (2002) 406,153-164
Toxicant
Mass Spectroscopy Techniques
Matrix assisted laser de-adsorption timeof-flight (MALDI-TOF)
mainly for peptide mass mapping
Electro-spray MS-MS
more sensitive for protein identification
de novo amino acid sequence.
MS fingerprint for protein
protein
MPSESSYKVHRPAKSGGS
trypsin digestion
peptides
MPSESSYK
VHR
PAK
SGGS
In-silico Digestion
MPSESSYKVHRPAKSGGS
in-silico
digestion
another protein
……
in-silico
digestion
……
……
Peak Picking
|PM(a) – PM(b)| < Error
score(TM(a), TM(bi))
MOWSE Score (1)
Popular
scoring scheme used.
Protein
score based on frequency of
occurrence of peptides.
Frequency
table is created for every
database used.
MOWSE Score (2)
Protein 0 – 10
Peptide
10 – 20
20 – 30
k Da
k Da
k Da
0 – 100
Da
54
23
7
100-200
Da
34
12
23
…
…..
….
…..
…..
…..
….
MOWSE Score (3)
Bin frequencies are normalized by
dividing by maximum number in the
column.
Scoring scheme
Sj = 50 / (Pn * H)
where Pn is the product of n normalized frequencies of
matching peptides, H is the protein molecular weight.
Proteins are ranked by their scores.
Too many matches
For each mass, there are very many
peptides in the database with the mass.
There are many missed peaks in the MS.
There are many noise in the MS.
For each MS, there could be many
proteins in the database that matches the
MS.
From Peptides to Protein
Computational Studies on Confidence
Assessment for Protein Identification
We have developed a statistical model which give a p-value indicating
the confidence for the protein identification to be true. The model is
based on the Extreme Value Distribution of the protein identification
scores from a randomly shuffled MS spectral peaks.
Score: 1268
P-value: 0.025
Distribution of score for Swissprot
with a large number of input spectra
Cumulative Distribution of score
Tandem Mass (MS/MS)
Spectrum
MRIMVRTLRGDRVALDVDGATTTVAQVKGMVMARER
MRIMVRTLRGDRVALDVD
GATTTVAQVKGMVMARER
b-ion
y-ion
Assumption: Will break between every two amino
acids, providing a unique sequence pattern.
MS/MS Fragmentation Pattern
x3
H2N
y3
R1
O
C
C
H
a1
b1
z3
x2
y1
R2
O
N
C
C
H
H
c1
a2
b2
z2
x1
y1
R3
O
N
C
C
H
H
c2
a3
z1
H+
R4
b3
N
C
H
H
c3
COOH
A real MS/MS spectrum
with good quality
LGSSEVEQVQLVVDGVK
SEQUEST: Preliminary Score
MKFLILLFNILCLFPVLAADNHGVGPQGAS...
While parsing through the database, all peptides that match the input
mass within some user specified mass tolerance (i.e. +/- 1.0 amu) get a
preliminary score (Sp):
Sp = S(im) * nm * (1+b) * (1+r) / nt
S (im)
nm
nt
b
r
= sum of matched intensities
= number of matched fragment ions
= number of total fragment ions
= fragment ion continuity factor
= immonium ion factor
X-Correlation Score
• Sequence database has been parsed.
• Candidate peptides for correlation analysis are the top
500 preliminary scoring peptides.
• A theoretical spectrum is constructed for each candidate
peptide and compared against the input spectrum via
correlation analysis.
S x[t]y[t+t]
Discrete correlation function:
R[t] t=
Calculated via Fourier Transforms:
R[t] <=> X(f)Y*(f)
Calculation of
X-Correlation Score
88.1
185.2
361.5
490.6 561.7
692.9
806.0
893.1
1050.2
1226.4
Theoretical
spectrum
m/z
200
400
600
800
1000
1200
x8
100
1007.4
80
1155.5
60
Experimental
spectrum
662.3
40
1226.8
805.5 892.6
255.7
360.9
403.0 519.1
20
185.3
1324.8
250
500
750
m/z
1000
1250
De Novo Sequencing Using
Spectrum Graph Approach
Each
node of the graph represents a
peak in the spectrum.
Two nodes have an edge if and only
if the two corresponding peaks are
distanced with the mass of an amino
acid.
The path that connects the two ends
corresponds to a feasible solution.
Multiple paths
on the spectral ladder
From Graph to Sequence
Outline
Introduction
Protein
identification using Mass-spec
Protein
interaction and pathway
Summary
Protein Complex
Nucleosome
Protein-Protein Interactions
Protein complexes, molecular machines
Protein interaction cascade (signal transduction)
Transient vs. stable interaction
Binary interaction vs. complex
m
h
k
e
d
bait a
preys
b
f
Genetic vs. Physical Interaction
Signal transduction
Complex system
Regulatory network
Physical interaction
Genetic interaction
Transcription
factor
Expressed
gene
Experimental methods
Yeast Two-hybrid screens
Mass Spectrometry
Immunoprecipitation
Affinity binding
Antibody blockage
Protein chips
Rosetta stone approach for
predicting protein interaction
• protein A is homologous to subsequence from protein C
• protein B is homologous to subsequence from protein C
• subsequences from A and B are NOT homologous to each other
Online Databases
Database
URL
Database size
Binary
DIP
Complex
http://dip.doe-mbi.ucla.edu
18,000
BIND
http://binddb.org
6171
851
MIPS
http://mips.gsf.de/proj/yeast/CYGD/
11,200
1050
MINT
http://cbm.bio.uniroma2.it/ mint/
3786
782
BRITE
http://www.genome.ad.jp/ brite/
5506
http://genome.c.kanazawa-u.ac.jp/Y2H/
957
Interact
http://www.bioinf.man.ac.uk/
resources/interactpr.shtml
1000
PIMRider
http://pim.hybrigenics.com/
1400
http://biodata.mshri.on.ca/grid/
14,318
PathCalling
GRID
200
Yeast Protein
Interaction Network
Deletion phenotype:
Red = lethal
Green = non-lethal
Orange = slow growth
Yellow = unknown
An example of a
scale-free network
Most nodes have few
connections
A small number of nodes
(network hubs) are
connected to a large
number of other nodes
PPI Viewer
o
Protein-Protein-Interaction and Complex Viewer
o
http://mips.gsf.de/proj/yeast/CYGD/interaction/
o
Search ste20 (YHL007c, STE20, Ste20p, ste20D)
Binary interaction:
Complex data (Bate: Rad1p)
cdc28 >genetic< ste20
Bem1p >physical< Ste20p
Ste20p >physical< Prp20p
...
Rad1p, Car2p, Dun1p, Far1p,
Gpd1p, Gpd2p, Msi1p, Pdc6p,
Sec6p, Sen1p, Ste20p, Ubi4p,
YDR324c, YGR086c, YHR033w,
YLR368w, YNL116w, YPL004c
Protein Interaction Graph
http://portal.curagen.com/extpc/com.cura
gen.portal.servlet.Yeast
Predict cellular function
for hypothetical protein
Function inference based on neighbors
Consensus approach
Markov random field
Overview of Signal Transduction
Stimuli (signal)
CELL
Secretion
Motility
Metabolism
Genetic Transfer
Cell-Cell Communication
Gene Transcription
Sporulation/
Apoptosis
Essential for understanding disease and designing drug
Problem Formulation
signal
sensor
1. Define cascade proteins
2. Find interaction path
Protein-1
Protein-2
Protein-3
Gene-1
Gene-2
transcription factor
Nucleus
Cascade of (physical) protein interaction chains
Finding a plausible
signal cascade path
Short path
Biologically meaningful
(function, subcellular location)
Pathway Construction for
Amino Acid Transport in Yeast
Ptr3p
Ssy5p
poor nitrogen
GAP1…
rich amino acid
Ubc4p
Ubc2p Ptr1p
(general)
BAP2…
Cup9
(specific)
PTR2
peptide transport
Ssy1p
Ptr3p
Ssy5p
Aut10p
Tup1p
Mai1p
Ssn6p
YPL158C
Cln1p
Cdc28p
Vma22p
Amino acid
synthesis
Rpn6p
Jsn1p
Clb3p
Rtg3p
Gcn4p
Pre1p
Cns1p
Dipeptide
Ubc4p
Mig1p
Sho1p
Ubc2p
Energy
metabolism
Glucose
metabolism
Gln3p
Dal80p
Stp1p
Ptr2p
Two hybrid
Complex from Mass
Coprecipitate or pull-down
Gap1p
Ptr1p
X
Cup9p
Bap2p
Tat2p
Other biochemical methods
Working Model
Transcriptional control
Scope of proteomics
Graves and Haystead (2002)
Microbiol & Molec. Biol.
Rev. 66, 39-63
Reading Assignments
Suggested reading:
http://www.astbury.leeds.ac.uk/Facil/MStut/mstutorial.htm
Yu Chen and Dong Xu. Computational Analyses of HighThroughput Protein-Protein Interaction Data. Current Protein and
Peptide Science. 4:159-181. 2003.
Optional reading:
www.bio.davidson.edu/courses/genomics/proteomics.html
Optional Assignment (1)
1.
Make a yeast protein-interaction network
connecting Rho2p, Rom2p, Ste20p, and
Pfy1p. Use binary physical protein-protein
interaction to connect all the edges. Try to
make the network as simple as possible (i.e.,
involving few proteins).
2.
Can you predict the function of the yeast gene
YLR269C based on high-throughput proteinprotein interaction data? How confident are
you on this prediction?
Optional Assignment (2)
3.
A protein complex was identified containing
Rpn5p, Rri1p, YDR179Cp, YIL071Cp,
YMR025Wp, YOL117Wp. Can you find the bait
of this complex? How many possible binary
interactions in this complex can be verified by
yeast two-hybrid data?
4.
It is known that Cup9p is degredated by 26S
proteasome. Identify as many proteins in the
yeast 26S proteasome as possible. Find a
physical interaction network between proteins
in 26S proteasome and Cup9p.