Protein Sequence - University of California, Davis
Download
Report
Transcript Protein Sequence - University of California, Davis
Protein Sequence
Amino Acid Composition
IEC
RP HPLC
Ancient Sequencing methods
Modern Sequencing methods
Sequencing the Gene
Then what?
Amino Acid Composition
1952 - Complete Acid Hydrolysis
Ion Exchange Chromatography with programmed
buffer changes (~3 hr)
Post-column derivatization with
Ninhydrin
Fluorescamine
1980 - Complete Acid Hydrolysis
Precolumn derivatization to Phenylthiohydantoins
Reversed-Phase HPLC (~30 min)
Sequencing
Sanger Endgroup Analysis
Modify the protein with fluorodinitrobenzene
(amines), aka FDNB, Sanger’s reagent.
Alternative reagent, dansyl chloride, fluorescent.
Hydrolyze protein
Separate by TLC
Identify N-terminal amino acid by Rf
Treat protein with Aminopeptidase
Repeat until the end gets ragged
Use proteolytic fragments for simplicity
Sequencing
Generate proteolytic fragments
Use more than one protease in separate experiments
Trypsin cleaves after Arg and Lys residues
Chymotrypsin cleaves after Phe, Tyr, Trp
Separate fragments (HV paper electrophoresis/HPLC)
Sequence all peptides independently
Assemble the sequence using overlap info
Trypsin
Chtr
Automated Sequencing
Use proteolytic fragments
Sequence each peptide using automated
Edman Degradation
Each Edman cycle removes one amino acid
Converts it to PTH amino acid for HPLC
Assemble the sequence using overlap info
Trypsin
Chtr
N-Terminal Edman Degradation
S
S
-
H H
H
C
N
H
+
O H R' O
H2 N C C
C C R''
N
N
R
H
H
C
N
N
C
O
C
R
H
C
N
H
O
C
R'
N R''
R H
O
N C O H
C
C C
C S N H R' N R''
N
Peptide
Attack on
Phenylisothiocyanate
H
R
N
C
H
C
C
+ H+
O
S
Rearrangement
N
+
Analinothiazolinone
amino acid
H
O
H2 N C C
R'
N R''
H R
H N C
C O
Peptide N-1
C N
S
PTH-amino acid
Absorbs 260-275 nm
RP-HPLC compatible
C-Terminal Edman Degradation
O
O
H
RHN
O H R' O
C C
C C
N
OH
R
H
H3 C C
-HC
3
H
C
OH
RHN
O
C
R
H3 C C
O
Activation of carboxyl
by acetic anhydride
O H R' O
C
C C
N
O
H
C
O
CH 3
H
-
H
OH
RHN
C
R
C
S
Attack by thiocyanate
O
H3C C
N
O H R' O
C
C C
N
NH
C
S
+H2O
RHN
O
OH
R
R'
Hydrolysis
Peptide N-1
C C
O
H
HN
NH
S
TH-amino acid
Alternative Sequencing - MS
Use non-fragmenting ionization
Electrospray Ionization + traditional mass Spec
Matrix-assisted laser desorption-ionization + timeof-flight mass spec (MALDI-TOF)
Measures mass of mature, intact protein
and/or complexes
Sequencing the Gene
DNA synthesis in vitro requires
Template (the DNA you want to sequence)
Primer (complementary to region up stream of where you want to
sequence)
Polymerase
dXTP’s, Mg++
Primer pairs with template, free 3’-OH group ready for
action
As dXTP’s basepair with template, the 3’-OH attacks the
a-phosphate of the dXTP, displacing PPi, making a
phosphodiester, extending the nascent DNA chain by one
base
The Polymerase Reaction
R
Elongation of a primer that
is base-paired with a template
Requires a free 3’-0H group
O
Base
O
O P OCH 2
Base
O
O
OH
5’
O
OH
P P P P P P P P P P P
PP
P
O
O P O P O P OCH 2
Base
O
O
O
O
A G C A A C C A T T A A T
T C G T T G G T A A T T A C T A G A A T T C A
P P P P P P P P P P P P P P P P P P P P P P
3’
O
OH
5’
Di-deoxy Terminators
If 2’, 3’-dideoxy nucleoside triphosphates were used, the reaction
would proceed for only one cycle because there would be no free
3’-OH group to attack the next dXTP
If a fraction of a percent of ONE 2’, 3’-dideoxy nucleoside
triphosphate (say ddTTP) were used
SOME polymer would be terminated EACH time that base was
incorporated, i.e., each time dA occurs in the template.
If 1/1000th of the dTTP were ddTTP, then 1/1000th of the polymers
would terminate at each dA in the template… the rest would continue
You would get many polymers of different sizes, each corresponding
to the occurrence of a dA in the template
Use four separate reactions, one with ddTTP, one with ddATP,
one with ddGTP, and one with ddCTP (and all other components)
One of the reaction mixtures would contain a polymer that
terminated at each base
ddTTP
ddCTP
ddGTP
Agarose gel
Sequence of template
ddATP
Base in polymer
Use fluorescent or
radioactive primer so
you can see every
polymer
Separate them by
size (gel
electrophoresis)
Read sequence of
polymers from gel
Infer the sequence of
the template by
Watson-Crick
small
large
Dideoxy Terminators
3’
A
T
G
T
C
A
C
A
G
G
A
C
A
G
A
5’
5’
T
A
C
A
G
T
C
T
C
C
T
G
T
C
T
3’
A, T, G, and C. What are the Amino Acids?
Standard Genetic Code
First /Second
U
C
A
G
UUU
UUC
Phe
Phe
UCU
UCC
Ser
Ser
UAU
UAC
Tyr
Tyr
UGU
UGC
Cys
Cys
UUA
UUG
CUU
CUC
Leu
Leu
Leu
Leu
UCA
UCG
CCU
CCC
Ser
Ser
Pro
Pro
UAA
UAG
CAU
CAC
***
***
His
His
UGA
UGG
CGU
CGC
***
Trp
Arg
Arg
CUA
CUG
AUU
AUC
Leu
Leu
Ile
Ile
CCA
CCG
ACU
ACC
Pro
Pro
Thr
Thr
CAA
CAG
AAU
AAC
Gln
Gln
Asn
Asn
CGA
CGG
AGU
ACC
Arg
Arg
Ser
Ser
AUA
AUG
GUU
GUG
Ile
Met
Val
Val
ACA
ACG
GCU
GCC
Thr
Thr
Ala
Ala
AAA
AAG
GAU
GAC
Lys
Lys
Asp
Asp
AGA
AGG
GGU
GGC
Arg
Arg
Gly
Gly
GUA
GUG
Val
Val
GCA
GCG
Ala
Ala
GAA
GAG
Glu
Glu
GGA
GGG
Gly
Gly
U
C
A
G
ORFs - Look for longest uninterrupted sequence
Protein Sequence from Nucleotide Sequence
5'
3'
GCCCTTTCTAAAATGTCCAAAATGGCGCAAACCAAACTGTATGATGTGA
CGGGAAAGATTTTACAGGTTTTACCGCGTTTGGTTTGACATACTACACT
3'
5'
5'
Coding Strand
Template Strand
GCCCUUUCUAAAAUGUCCAAAAUGGCGCAAACCAAACUGUAUGAUGUGA
3' Message
A L S K M S K M A Q T K L Y D V ...
P F L K C P K W R K P N C M M * ...
P F * N V Q N G A N Q T V * C E ...
How do you know which strand is the coding strand?
You don't... There are six possible frames for translation.
So, you’ve got the sequence…So what?
Next topic: Bioinformatics
Inferences based on homology
Questions
1.
2.
3.
4.
5.
6.
7.
Has the gene been sequenced before? (Will I be able to publish?)
What is the sequence of the protein encoded by the gene?
Has the protein been sequenced before?
Is the gene similar to one that has been sequenced before?
1. Did I sequence the right gene?
2. Will I be able to find structural or functional relatives?
Is the protein similar to one that has been sequenced before?
1. How similar?
2. What does the similarity mean?
Can I predict the function of the gene product, or is the predicted function
consistent with what I know about the protein?
Can I get information about structural features of the gene product?
1. Secondary structure
2. Folding domains or other common patterns
3. Hydropathy profiles
1. How might predicted helices and/or sheet pack?
2. Is it likely to be a membrane protein, a transmembrane protein?
Answers: Sequence Similarities
and Similarity Searches
1. Search sequence databases for homologous proteins.
2. Find families of proteins that are similar to your protein.
3. Use information about the structure and properties of
the similar protein(s) to establish inferences about your
protein. If the exact sequence is in the database, the
similarity search routines will find that, too.
4. Determine whether two sequences are related (or
identical) by aligning them so that homologous regions
are adjacent.
5. For two identical sequences:
MGKARSMVLKHSTKARS
MGKARSMVLKHSTKARS
But, what about:
Imperfect homology
MGKARSMLLKHSTKARS
MGKARTMVLKHSTRARS
Gaps/insertions
MGKARSMLLKHSLKARS
MGRA
LKHSLRART
And, how homologous is homologous
Need
Similarity scores for pairs amino acids
Method for dealing with gaps
Algorithms for comparing a sequence
with a database
Ways to assess the degree of homology
Ways to link structural info with
sequence info
Dynamic Programming
Needleman-Wunsch Algorithm
Compares similarity of two proteins a & b at
positions i & j:
NWi,j = max(NWi-1, j-1 + s(aibj); NWi-1, j; +g; NWi, j-1 +g)
NWi-1, j-1 = running total
s(aibj)= similarity between residue i of protein a and
residue j of protein b
g = gap penalty
http://www.avatar.se/molbioinfo2001/dynprog/dynamic.html
Fill a Matrix with all possibilities
Simple example: s = 1,0 and g = 0
Smith-Waterman
Always compare NW terms to zero so
that it doesn’t get too small.
NWi,j = max (NWi-1, j-1 + s(aibj); NWi-1, j; + g; NWi, j-1 + g; 0)
BLAST & FASTA
FASTA - great, we won’t talk about it
much faster and more selective than SW,
but less sensitive
Basic Local Alignment Search Tool
less selective and more sensitive than
FASTA,
i.e., you may get more hits, but some of
them may be wrong
BLAST
Divide sequence into “words” of length W (eg.
BLASTp, initial W = 3)
Compare all W-length words
Retain only pairs with similarity above a
threshold,T
Call them High-Scoring Pairs
Increase W, repeat with HSPs
Keep going
remaining above a minimum similarity,
and compare to random probability (E)
Scoring MatricesMaking similarity quantitative
Compare the actual frequency to the
frequency expected by chance alone.
Probablilty that alanine appears at position x
in a protein
= fraction of Ala in all proteins
pAla
Probability that one protein has Ala at position
x, and another protein has Gly?
=pAlapGly
The frequency due to chance, alone.
Similarity
qAla,Gly = ACTUAL frequency that Ala and
Gly are at position x in two proteins (in
your database)
Ri,j = qi,j/pipj
Score: Si,j = log2(Ri,j) = log2(qi,j/pipj)
“Log-Odds Scores”
1 q i,j
log
p ip j
Remember Chou & Fasman?
PAM Matrices
Margaret Dayhoff assembled the Atlas of Protein
Structure
Evolutionarily-accepted mutations
Calculated qi,j for all aa’s in closely-related
proteins
These were accepted by Nature as similar/close
enough
Generate half matrices: Point Accepted
Mutation/Percent Accepted Mutations
Scale, so PAM1 reflects 1 mutation per 100
residues, PAM50, 50 allowed mutation/100
BLOSUM
Henikoff and Henikoff
BLOcks of Amino Acid SUbstitution
Matrix
BLOCKS is a database of related
proteins
BLAST Search
Go to BLAST Website
Enter Nucleotide or AA sequence
Choose BLAST type
Nucleotide-nucleotide; BLASTn
Protein-protein, BLASTp
6-frame-translated nucleotideProtein:BLASTx
others
Then?
Does it make sense?
Multisequence Alignment
Secondary structure prediction
Domains
Families
Caveat
It ain't what you don't know that'll kill you,
it's what you know that ain't so.