Protein Sequence - University of California, Davis

Transcript Protein Sequence - University of California, Davis

Protein Sequence

Amino Acid Composition






IEC
RP HPLC
Ancient Sequencing methods
Modern Sequencing methods
Sequencing the Gene
Then what?
Amino Acid Composition

1952 - Complete Acid Hydrolysis


Ion Exchange Chromatography with programmed
buffer changes (~3 hr)
Post-column derivatization with



Ninhydrin
Fluorescamine
1980 - Complete Acid Hydrolysis


Precolumn derivatization to Phenylthiohydantoins
Reversed-Phase HPLC (~30 min)
Sequencing

Sanger Endgroup Analysis








Modify the protein with fluorodinitrobenzene
(amines), aka FDNB, Sanger’s reagent.
Alternative reagent, dansyl chloride, fluorescent.
Hydrolyze protein
Separate by TLC
Identify N-terminal amino acid by Rf
Treat protein with Aminopeptidase
Repeat until the end gets ragged
Use proteolytic fragments for simplicity
Sequencing

Generate proteolytic fragments

Use more than one protease in separate experiments





Trypsin cleaves after Arg and Lys residues
Chymotrypsin cleaves after Phe, Tyr, Trp
Separate fragments (HV paper electrophoresis/HPLC)
Sequence all peptides independently
Assemble the sequence using overlap info
Trypsin
Chtr
Automated Sequencing


Use proteolytic fragments
Sequence each peptide using automated
Edman Degradation



Each Edman cycle removes one amino acid
Converts it to PTH amino acid for HPLC
Assemble the sequence using overlap info
Trypsin
Chtr
N-Terminal Edman Degradation
S
S
-
H H
H
C
N
H
+
O H R' O
H2 N C C
C C R''
N
N
R
H
H
C
N
N
C
O
C
R
H
C
N
H
O
C
R'
N R''
R H
O
N C O H
C
C C
C S N H R' N R''
N
Peptide
Attack on
Phenylisothiocyanate
H
R
N
C
H
C
C
+ H+
O
S
Rearrangement
N
+
Analinothiazolinone
amino acid
H
O
H2 N C C
R'
N R''
H R
H N C
C O
Peptide N-1
C N
S
PTH-amino acid
Absorbs 260-275 nm
RP-HPLC compatible
C-Terminal Edman Degradation
O
O
H
RHN
O H R' O
C C
C C
N
OH
R
H
H3 C C
-HC
3
H
C
OH
RHN
O
C
R
H3 C C
O
Activation of carboxyl
by acetic anhydride
O H R' O
C
C C
N
O
H
C
O
CH 3
H
-
H
OH
RHN
C
R
C
S
Attack by thiocyanate
O
H3C C
N
O H R' O
C
C C
N
NH
C
S
+H2O
RHN
O
OH
R
R'
Hydrolysis
Peptide N-1
C C
O
H
HN
NH
S
TH-amino acid
Alternative Sequencing - MS

Use non-fragmenting ionization



Electrospray Ionization + traditional mass Spec
Matrix-assisted laser desorption-ionization + timeof-flight mass spec (MALDI-TOF)
Measures mass of mature, intact protein
and/or complexes
Sequencing the Gene

DNA synthesis in vitro requires






Template (the DNA you want to sequence)
Primer (complementary to region up stream of where you want to
sequence)
Polymerase
dXTP’s, Mg++
Primer pairs with template, free 3’-OH group ready for
action
As dXTP’s basepair with template, the 3’-OH attacks the
a-phosphate of the dXTP, displacing PPi, making a
phosphodiester, extending the nascent DNA chain by one
base
The Polymerase Reaction
R
Elongation of a primer that
is base-paired with a template
Requires a free 3’-0H group
O
Base
O
O P OCH 2
Base
O
O
OH
5’
O
OH
P P P P P P P P P P P
PP
P
O
O P O P O P OCH 2
Base
O
O
O
O
A G C A A C C A T T A A T
T C G T T G G T A A T T A C T A G A A T T C A
P P P P P P P P P P P P P P P P P P P P P P
3’
O
OH
5’
Di-deoxy Terminators


If 2’, 3’-dideoxy nucleoside triphosphates were used, the reaction
would proceed for only one cycle because there would be no free
3’-OH group to attack the next dXTP
If a fraction of a percent of ONE 2’, 3’-dideoxy nucleoside
triphosphate (say ddTTP) were used





SOME polymer would be terminated EACH time that base was
incorporated, i.e., each time dA occurs in the template.
If 1/1000th of the dTTP were ddTTP, then 1/1000th of the polymers
would terminate at each dA in the template… the rest would continue
You would get many polymers of different sizes, each corresponding
to the occurrence of a dA in the template
Use four separate reactions, one with ddTTP, one with ddATP,
one with ddGTP, and one with ddCTP (and all other components)
One of the reaction mixtures would contain a polymer that
terminated at each base

ddTTP
ddCTP
ddGTP
Agarose gel
Sequence of template

ddATP
Base in polymer

Use fluorescent or
radioactive primer so
you can see every
polymer
Separate them by
size (gel
electrophoresis)
Read sequence of
polymers from gel
Infer the sequence of
the template by
Watson-Crick
small

large
Dideoxy Terminators
3’
A
T
G
T
C
A
C
A
G
G
A
C
A
G
A
5’
5’
T
A
C
A
G
T
C
T
C
C
T
G
T
C
T
3’
A, T, G, and C. What are the Amino Acids?
Standard Genetic Code
First /Second
U
C
A
G
UUU
UUC
Phe
Phe
UCU
UCC
Ser
Ser
UAU
UAC
Tyr
Tyr
UGU
UGC
Cys
Cys
UUA
UUG
CUU
CUC
Leu
Leu
Leu
Leu
UCA
UCG
CCU
CCC
Ser
Ser
Pro
Pro
UAA
UAG
CAU
CAC
***
***
His
His
UGA
UGG
CGU
CGC
***
Trp
Arg
Arg
CUA
CUG
AUU
AUC
Leu
Leu
Ile
Ile
CCA
CCG
ACU
ACC
Pro
Pro
Thr
Thr
CAA
CAG
AAU
AAC
Gln
Gln
Asn
Asn
CGA
CGG
AGU
ACC
Arg
Arg
Ser
Ser
AUA
AUG
GUU
GUG
Ile
Met
Val
Val
ACA
ACG
GCU
GCC
Thr
Thr
Ala
Ala
AAA
AAG
GAU
GAC
Lys
Lys
Asp
Asp
AGA
AGG
GGU
GGC
Arg
Arg
Gly
Gly
GUA
GUG
Val
Val
GCA
GCG
Ala
Ala
GAA
GAG
Glu
Glu
GGA
GGG
Gly
Gly
U
C
A
G
ORFs - Look for longest uninterrupted sequence
Protein Sequence from Nucleotide Sequence
5'
3'
GCCCTTTCTAAAATGTCCAAAATGGCGCAAACCAAACTGTATGATGTGA
CGGGAAAGATTTTACAGGTTTTACCGCGTTTGGTTTGACATACTACACT
3'
5'
5'
Coding Strand
Template Strand
GCCCUUUCUAAAAUGUCCAAAAUGGCGCAAACCAAACUGUAUGAUGUGA
3' Message
A L S K M S K M A Q T K L Y D V ...
P F L K C P K W R K P N C M M * ...
P F * N V Q N G A N Q T V * C E ...
How do you know which strand is the coding strand?
You don't... There are six possible frames for translation.
So, you’ve got the sequence…So what?
Next topic: Bioinformatics
Inferences based on homology
Questions
1.
2.
3.
4.
5.
6.
7.
Has the gene been sequenced before? (Will I be able to publish?)
What is the sequence of the protein encoded by the gene?
Has the protein been sequenced before?
Is the gene similar to one that has been sequenced before?
1. Did I sequence the right gene?
2. Will I be able to find structural or functional relatives?
Is the protein similar to one that has been sequenced before?
1. How similar?
2. What does the similarity mean?
Can I predict the function of the gene product, or is the predicted function
consistent with what I know about the protein?
Can I get information about structural features of the gene product?
1. Secondary structure
2. Folding domains or other common patterns
3. Hydropathy profiles
1. How might predicted helices and/or sheet pack?
2. Is it likely to be a membrane protein, a transmembrane protein?
Answers: Sequence Similarities
and Similarity Searches
1. Search sequence databases for homologous proteins.
2. Find families of proteins that are similar to your protein.
3. Use information about the structure and properties of
the similar protein(s) to establish inferences about your
protein. If the exact sequence is in the database, the
similarity search routines will find that, too.
4. Determine whether two sequences are related (or
identical) by aligning them so that homologous regions
are adjacent.
5. For two identical sequences:
MGKARSMVLKHSTKARS
MGKARSMVLKHSTKARS
But, what about:
Imperfect homology
MGKARSMLLKHSTKARS
MGKARTMVLKHSTRARS
Gaps/insertions
MGKARSMLLKHSLKARS
MGRA
LKHSLRART
And, how homologous is homologous
Need





Similarity scores for pairs amino acids
Method for dealing with gaps
Algorithms for comparing a sequence
with a database
Ways to assess the degree of homology
Ways to link structural info with
sequence info
Dynamic Programming
Needleman-Wunsch Algorithm
Compares similarity of two proteins a & b at
positions i & j:
NWi,j = max(NWi-1, j-1 + s(aibj); NWi-1, j; +g; NWi, j-1 +g)
NWi-1, j-1 = running total
s(aibj)= similarity between residue i of protein a and
residue j of protein b
g = gap penalty
http://www.avatar.se/molbioinfo2001/dynprog/dynamic.html
Fill a Matrix with all possibilities
Simple example: s = 1,0 and g = 0
Smith-Waterman

Always compare NW terms to zero so
that it doesn’t get too small.
NWi,j = max (NWi-1, j-1 + s(aibj); NWi-1, j; + g; NWi, j-1 + g; 0)
BLAST & FASTA

FASTA - great, we won’t talk about it


much faster and more selective than SW,
but less sensitive
Basic Local Alignment Search Tool


less selective and more sensitive than
FASTA,
i.e., you may get more hits, but some of
them may be wrong
BLAST


Divide sequence into “words” of length W (eg.
BLASTp, initial W = 3)
Compare all W-length words




Retain only pairs with similarity above a
threshold,T
Call them High-Scoring Pairs
Increase W, repeat with HSPs
Keep going


remaining above a minimum similarity,
and compare to random probability (E)
Scoring MatricesMaking similarity quantitative


Compare the actual frequency to the
frequency expected by chance alone.
Probablilty that alanine appears at position x
in a protein



= fraction of Ala in all proteins
pAla
Probability that one protein has Ala at position
x, and another protein has Gly?


=pAlapGly
The frequency due to chance, alone.
Similarity





qAla,Gly = ACTUAL frequency that Ala and
Gly are at position x in two proteins (in
your database)
Ri,j = qi,j/pipj
Score: Si,j = log2(Ri,j) = log2(qi,j/pipj)
“Log-Odds Scores”
1   q i,j 
  log


  p ip j 
Remember Chou & Fasman?

PAM Matrices






Margaret Dayhoff assembled the Atlas of Protein
Structure
Evolutionarily-accepted mutations
Calculated qi,j for all aa’s in closely-related
proteins
These were accepted by Nature as similar/close
enough
Generate half matrices: Point Accepted
Mutation/Percent Accepted Mutations
Scale, so PAM1 reflects 1 mutation per 100
residues, PAM50, 50 allowed mutation/100
BLOSUM



Henikoff and Henikoff
BLOcks of Amino Acid SUbstitution
Matrix
BLOCKS is a database of related
proteins
BLAST Search



Go to BLAST Website
Enter Nucleotide or AA sequence
Choose BLAST type




Nucleotide-nucleotide; BLASTn
Protein-protein, BLASTp
6-frame-translated nucleotideProtein:BLASTx
others
Then?





Does it make sense?
Multisequence Alignment
Secondary structure prediction
Domains
Families
Caveat
It ain't what you don't know that'll kill you,
it's what you know that ain't so.

Protein Sequence - University of California, Davis

Transcript Protein Sequence - University of California, Davis

Directory