PowerPoint slides - Center for Biological Sequence Analysis
Download
Report
Transcript PowerPoint slides - Center for Biological Sequence Analysis
Protein homology modeling
Morten Nielsen,
CBS, BioCentrum,
DTU
Objectives
• Understand the basic concepts of
homology modeling
• Learn why even sequences with very low
sequence similarity can be modeled
– Understand why is %id such a terrible
measure for reliability
• See the beauty of sequence profiles
Background. Why protein modeling?
• Because it works!
– Close to 50% of all new sequences can be homology
modeled
• Experimental effort to determine protein
structure is very large and costly
• The gap between the size of the protein
sequence data and protein structure data is
large and increasing
Homology modeling and the human genome
Swiss-Prot database
~200.000 in Swiss-Prot
~ 2.000.000 if include Tremble
New PDB structures
PDB New Fold Growth
Old folds
New folds
•
•
•
The number of unique folds in nature is fairly small (possibly a few
thousands)
90% of new structures submitted to PDB in the past three years
have similar structural folds in PDB
Number of new folds is NOT growing
Worldwide Structural Genomics
•
•
•
•
•
•
•
•
”Fold space coverage”
Complete genomes
Signaling proteins
Improving technology
Disease-causing organisms
Model organisms
Membrane proteins
Protein-ligand interactions
Structural Genomics in North America
• 10 year $600 million project initiated in 2000,
funded largely by NIH
• AIM: structural information on 10000 unique
proteins (now 4-6000), so far 1000 have been
determined
• Improve current techniques to reduce time
(from months to days) and cost (from $100.000
to $20.000/structure)
• 9 research centers currently funded (2005),
targets are from model and disease-causing
organisms (a separate project on TB proteins)
Homology modeling for structural genomics
What a new fold can give
Roberto Sánchez et al. Nature Structural Biology 7, 986 - 990 (2000)
How well can we do it?
Sali, A. & Kuriyan, J. Trends Biochem. Sci. 22, M20–M24 (1999)
Homology modeling. Why can we do it?
The structure of a protein is uniquely
determined by its amino acid sequence (but
sequence is sometimes not enough):
– prions
– pH, ions, cofactors, chaperones
Structure is conserved much longer than
sequence in evolution
Identification of fold
If sequence similarity is
high proteins share
structure (Safe zone)
If sequence similarity is low
proteins may share
structure (Twilight zone)
Most proteins do not have a
high sequence homologous
partner
Rajesh Nair & Burkhard Rost Protein Science, 2002, 11, 2836-47
Example.
A post doc in our group Anne Mølgaard did her PhD obtaining the
structure of the sequence below
>1K7C.A
TTVYLAGDSTMAKNGGGSGTNGWGEYLASYLSATVVNDAVAGRSARSYTREGRFENIADV
VTAGDYVIVEFGHNDGGSLSTDNGRTDCSGTGAEVCYSVYDGVNETILTFPAYLENAAKL
FTAKGAKVILSSQTPNNPWETGTFVNSPTRFVEYAELAAEVAGVEYVDHWSYVDSIYETL
GNATVNSYFPIDHTHTSPAGAEVVAEAFLKAVVCTGTSLKSVLTTTSFEGTCL
• What is the function
• Where is the active site?
Could she have saved three years
work?.
• Function
• Run Blast against PDB
• No significant hits
• Run Blast against NR (Sequence database)
• Function is Acetylesterase?
• Where is the active site?
Example. Where is the active site?
1G66 Acetylxylan esterase
1USW Hydrolase
1WAB Acetylhydrolase
Example. Where is the active site?
• Align sequence against structures of
known acetylesterase, like
• 1WAB, 1FXW, …
• Cannot be aligned. Too low sequence
similarity
1K7C.A 1WAB._ RMSD 11.2397
QAL 1K7C.A
71 GHNDGGSLSTDNGRTDCSGTGAEVCYSVYDGVNETILTF
DAL 1WAB._
160 GHPRAHFLDADPGFVHSDGTISH--HDMYDYLHLSRLGY
Is it really impossible?
• Worked for 2-3 years in SBI-AT
developing methods for homology modeling
in the twilight zone
• Shown that homology modeling is possible also
for very low sequence homology
• So, try to show that Anne could have
saved 3 years work if she had used the
most advanced homology modeling
techniques
How can we do it?
• Identify template(s) – initial alignment
• Can give you protein function
• Improve alignment
• Can give you active site
• Backbone generation
• Loop modeling
• Most difficult part
• Side chains
• Refinement
• Validation
How to do it
Identify fold
(template) for
modeling
– Find the structure in
the PDB database that
resembles your new
protein the most
– Can be used to predict
function
– And maybe active sites
Align protein sequence
to template
– Simple alignment
methods
– Sequence profiles
– Threading methods
– Pseudo force fields
Model side chains and
loops
Protein homology modeling is only possible
if %id greater than 30-50%
Why %id is so bad!!
1200 models sharing 25-95% sequence identity with the
submitted sequences (www.expasy.ch/swissmod)
Identification of correct fold
• % ID is a poor measure
– Many evolutionary related proteins share
low sequence homology
– A short alignment of 5 amino acids can
share 100% id, what does this mean?
• Alignment score even worse
– Many sequences will score high against
every thing (hydrophobic stretches)
• P-value or E-value more reliable
What are P and E values?
• E-value
– Number of expected hits
in database with score
higher than match
– Depends on database size
• P-value
Score 150
10 hits with higher
score (E=10)
10000 hits in
database =>
P=10/10000 = 0.001
– Probability that a random
hit will have score higher
than match
– Database size independent
Score
Template identification
• Simple sequence based methods
– Align (BLAST) sequence against sequence of proteins with known
structure (PDB database)
• Sequence profile based methods
– Align sequence profile (Psi-BLAST) against sequence of proteins
with known structure (PDB, FUGUE)
– Align sequence profile against profile of proteins with known
structure (FFAS)
• Sequence and structure based methods
– Align profile and predicted secondary structure against proteins
with known structure (3D-PSSM, Phyre)
• Sequence profiles and structure based methods
– Our work
Template quality
• Selecting the best template is crucial!
• The best template may not be the one
with the highest % id (best p-value…)
– Template 1: 93% id, 3.5 Å resolution
– Template 2: 90% id, 1.5 Å resolution
Template quality – Ramachandran plot
What goes wrong when Blast fails?
• Conventional sequence alignment uses a (Blosum)
scoring matrix to identify amino acids matches in
the two protein sequences
• This scoring matrix is identical at all positions in
the protein sequence!
EVVFIGDSLVQLMHQC
A
G
D
S
.
G
G
G
D
S
X
X
X
X
X
X
1PLC._
When Blast works!
1PLB._
1PLC._
When Blast fails!
1PMY._
1PLC._
When Blast fails, use sequence
profiles!
1PMY._
Blosum scoring matrix
A
R
N
D
C
Q
E
G
H
I
L
K
M
F
P
S
T
W
Y
V
A
4
-1
-2
-2
0
-1
-1
0
-2
-1
-1
-1
-1
-2
-1
1
0
-3
-2
0
R
-1
5
0
-2
-3
1
0
-2
0
-3
-2
2
-1
-3
-2
-1
-1
-3
-2
-3
N
-2
0
6
1
-3
0
0
0
1
-3
-3
0
-2
-3
-2
1
0
-4
-2
-3
D
-2
-2
1
6
-3
0
2
-1
-1
-3
-4
-1
-3
-3
-1
0
-1
-4
-3
-3
C
0
-3
-3
-3
9
-3
-4
-3
-3
-1
-1
-3
-1
-2
-3
-1
-1
-2
-2
-1
Q
-1
1
0
0
-3
5
2
-2
0
-3
-2
1
0
-3
-1
0
-1
-2
-1
-2
E
-1
0
0
2
-4
2
5
-2
0
-3
-3
1
-2
-3
-1
0
-1
-3
-2
-2
G
0
-2
0
-1
-3
-2
-2
6
-2
-4
-4
-2
-3
-3
-2
0
-2
-2
-3
-3
H
-2
0
1
-1
-3
0
0
-2
8
-3
-3
-1
-2
-1
-2
-1
-2
-2
2
-3
I
-1
-3
-3
-3
-1
-3
-3
-4
-3
4
2
-3
1
0
-3
-2
-1
-3
-1
3
L
-1
-2
-3
-4
-1
-2
-3
-4
-3
2
4
-2
2
0
-3
-2
-1
-2
-1
1
K
-1
2
0
-1
-3
1
1
-2
-1
-3
-2
5
-1
-3
-1
0
-1
-3
-2
-2
M
-1
-1
-2
-3
-1
0
-2
-3
-2
1
2
-1
5
0
-2
-1
-1
-1
-1
1
F
-2
-3
-3
-3
-2
-3
-3
-3
-1
0
0
-3
0
6
-4
-2
-2
1
3
-1
P
-1
-2
-2
-1
-3
-1
-1
-2
-2
-3
-3
-1
-2
-4
7
-1
-1
-4
-3
-2
S
1
-1
1
0
-1
0
0
0
-1
-2
-2
0
-1
-2
-1
4
1
-3
-2
-2
T
0
-1
0
-1
-1
-1
-1
-2
-2
-1
-1
-1
-1
-2
-1
1
5
-2
-2
0
W
-3
-3
-4
-4
-2
-2
-3
-2
-2
-3
-2
-3
-1
1
-4
-3
-2
11
2
-3
Y
-2
-2
-2
-3
-2
-1
-2
-3
2
-1
-1
-2
-1
3
-3
-2
-2
2
7
-1
V
0
-3
-3
-3
-1
-2
-2
-3
-3
3
1
-2
1
-1
-2
-2
0
-3
-1
4
Sequence profiles
•
In reality not all positions in a protein are
equally likely to mutate
•
•
•
Some amino acids (active cites) are highly
conserved, and the score for mismatch must
be very high
Other amino acids can mutate almost for
free, and the score for mismatch should be
lower than the BLOSUM score
Sequence profiles can capture these
differences
Protein structure classification
Protein world
Protein fold
Protein superfamily
Protein family
New Fold
Sequence profiles
ADDGSLAFVPSEF--SISPGEKIVFKNNAGFPHNIVFDEDSIPSGVDASKISMSEEDLLN
TVNGAI--PGPLIAERLKEGQNVRVTNTLDEDTSIHWHGLLVPFGMDGVPGVSFPG---I
-TSMAPAFGVQEFYRTVKQGDEVTVTIT-----NIDQIED-VSHGFVVVNHGVSME---I
IE--KMKYLTPEVFYTIKAGETVYWVNGEVMPHNVAFKKGIV--GEDAFRGEMMTKD---TSVAPSFSQPSF-LTVKEGDEVTVIVTNLDE------IDDLTHGFTMGNHGVAME---V
ASAETMVFEPDFLVLEIGPGDRVRFVPTHK-SHNAATIDGMVPEGVEGFKSRINDE---TVNGQ--FPGPRLAGVAREGDQVLVKVVNHVAENITIHWHGVQLGTGWADGPAYVTQCPI
TKAVVLTFNTSVEICLVMQGTSIV----AAESHPLHLHGFNFPSNFNLVDPMERNTAGVP
Matching any thing
but G => large
negative score
Any thing can match
How to make sequence profiles
Align (BLAST) sequence against large sequence
database (Swiss-Prot)
Select significant alignments and make profile
(weight matrix) using techniques for sequence
weighting and pseudo counts
Use weight matrix to align against sequence
database to find new significant hits
Repeat 2 and 3 (normally 3 times!)
And how to really do it?
• Make profile (three iterations)
• blastpgp -i fastafile -d nr -j 4 -e 1e-5 -C
restart.file
• Run profile against database
• blastpgp -i fastafile -d db.fsa -R restart.file
Sequence profiles (1J2J.B)
0 iterations (Blosum62)
2 iterations
1 iterations
3 iterations
Example. Annes sequence
(SGNH active site)
Example. Where is the active site?
• Sequence profiles might show you where to look!
• The active site could be around
• S9, G42, N74, and H195
1K7C.A
Profile-profile scoring matrix
1WAB._
Example. Where is the active site?
Align using sequence profiles
ALN 1K7C.A 1WAB._ RMSD = 5.29522. 14% ID
1K7C.A TVYLAGDSTMAKNGGGSGTNGWGEYLASYLSATVVNDAVAGRSARSYTREGRFENIADVVTAGDYVIVEFGHNDGGSLSTDN
S
G
N
1WAB._ EVVFIGDSLVQLMHQCE---IWRELFS---PLHALNFGIGGDSTQHVLW--RLENGELEHIRPKIVVVWVGTNNHG-----1K7C.A GRTDCSGTGAEVCYSVYDGVNETILTFPAYLENAAKLFTAK--GAKVILSSQTPNNPWETGTFVNSPTRFVEYAEL-AAEVA
1WAB._ ---------------------HTAEQVTGGIKAIVQLVNERQPQARVVVLGLLPRGQ-HPNPLREKNRRVNELVRAALAGHP
1K7C.A GVEYVDHWSYVDSIYETLGNATVNSYFPIDHTHTSPAGAEVVAEAFLKAVVCTGTSL
H
1WAB._ RAHFLDADPG---FVHSDG--TISHHDMYDYLHLSRLGYTPVCRALHSLLLRL---L
Structural superposition
Blue: 1K7C.A
Red: 1WAB._
Where was the active site?
Rhamnogalacturonan
acetylesterase (1k7c)
Including structure
• Sequence with in a protein superfamily
share remote sequence homology
• , but they share high structural homology
• Structure is known for template
• Predict structural properties for query
– Secondary structure
– Surface exposure
• Position specific gap penalties derived from
secondary structure and surface exposure
Using structure
Sequence & structure profile-profile based
alignments
– Template
• Sequence based profiles
• Annotated secondary structure
• Predicted secondary structure
– Query
• Sequence based profile
• Predicted secondary structure
– Position specific gap penalties derived from
secondary structure
How good are we?
Alignment accuracy. Scoring functions
• Blosum62 score matrix. Fg=1. Ng=0?
L
A
G
D
S
D
F
0
-2
-3
-3
-2
-3
I
2
-1
-4
-3
-2
-3
G
-4
0
6
-1
0
-1
D
-4
-2
-1
6
0
6
S
-2
1
0
0
4
0
L
4
-1
-4
-4
-2
-4
• Score =2+6+6+4-1=17
• Alignment
LAGDS
I-GDS
Alignment accuracy. Scoring functions
• How to make the most of sequence
profiles?
ma, b wseq log
pa pb
oseq
wsec log
sa sb
osec
A R N D C Q
E G H I L K M F P
S
T W Y V
T 1K7C.A 1 0.06 0.03 0.04 0.03 0.01 0.02 0.03 0.04 0.01 0.04 0.05 0.04 0.02 0.02 0.02 0.08 0.37 0.00 0.01 0.06
T 1K7C.A 2 0.04 0.21 0.02 0.02 0.01 0.02 0.03 0.02 0.01 0.02 0.03 0.04 0.01 0.01 0.01 0.07 0.36 0.00 0.01 0.03
V 1K7C.A 3 0.03 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.00 0.36 0.20 0.01 0.02 0.02 0.01 0.01 0.02 0.00 0.01 0.22
…..
G 1K7C.A 7 0.04 0.01 0.02 0.02 0.01 0.01 0.01 0.71 0.01 0.01 0.02 0.02 0.01 0.01 0.01 0.05 0.02 0.00 0.01 0.0
D 1K7C.A 8 0.02 0.02 0.05 0.67 0.00 0.02 0.05 0.02 0.01 0.01 0.01 0.02 0.00 0.01 0.01 0.03 0.02 0.00 0.01 0.0
S 1K7C.A 9 0.06 0.02 0.03 0.03 0.01 0.02 0.03 0.03 0.01 0.02 0.02 0.03 0.01 0.01 0.02 0.59 0.04 0.00 0.01 0.02
E 1WAB._ 40 0.04 0.21 0.06 0.12 0.00 0.06 0.06 0.02 0.02 0.01 0.02 0.19 0.01 0.01 0.04 0.04 0.04 0.00 0.01 0.0
V 1WAB._ 41 0.03 0.01 0.01 0.01 0.01 0.01 0.02 0.02 0.00 0.31 0.14 0.01 0.01 0.05 0.01 0.03 0.04 0.01 0.02 0.2
….
G 1WAB._ 45 0.03 0.02 0.01 0.01 0.00 0.01 0.01 0.81 0.00 0.01 0.01 0.01 0.00 0.01 0.01 0.02 0.02 0.00 0.00 0.0
D 1WAB._ 46 0.02 0.01 0.06 0.68 0.00 0.01 0.03 0.02 0.02 0.01 0.01 0.02 0.00 0.01 0.01 0.05 0.03 0.00 0.00 0.0
S 1WAB._ 47 0.04 0.02 0.02 0.02 0.01 0.01 0.03 0.02 0.01 0.01 0.01 0.03 0.01 0.01 0.01 0.68 0.03 0.00 0.01 0.0
Alignment accuracy
Alignment performance
0.450
0.400
Fractional n4
0.350
0.300
0.250
0.200
0.150
0.100
0.050
0.000
Blosum
Profile
Profile+SS
Profile+ASS
Train
0.259
0.393
0.417
0.420
Test
0.212
0.348
0.386
0.393
Fold recognition
• Benchmark
– Query set of 100 train set, 200 test set
– Database of 355 PDB structures
– Align Query against Db
• If structural similar hit = 1, else hit = 0
– Use CE to define structural similar
• Calculate AUC (area under the ROC curve)
– Perfect method can separate hits from non-hits
• How to rank hits?
– Alignment score?
– %Id
– Z score (p-value)
CE structural alignment
(combinatorial extension)
AUC performance measure
Query
1CJ0.A
1CJ0.A
1CJ0.A
1CJ0.A
1CJ0.A
1CJ0.A
1CJ0.A
Templ
1B78.A
1B8A.A
1B8B.A
1B8G.A
1B9H.A
1BAR.A
1BAV.C
Score Hit/nonhit
0.170963 0
-0.040029 0
-0.012789 0
12.342823 1
13.394361 1
-1.281068 0
-1.091305 0
Query
1CJ0.A
1CJ0.A
1CJ0.A
1CJ0.A
1CJ0.A
1CJ0.A
1CJ0.A
1CJ0.A
Templ
1B8G.A
1DTY.A
1DGD._
1GTX.A
2GSA.A
1BW9.A
1AUP._
1GTM.A
Score Hit/nonhit
12.342823 1
11.867786 1
11.271914 1
11.010288 1
10.958170 1
2.651775 0
2.507336 1
2.444512 0
Fold recognition performance
Test set performance
1.000
0.950
0.900
0.850
AUC
0.800
0.750
0.700
0.650
0.600
0.550
0.500
sco
z score
bl
pdbblast
blast
All
0.921
0.958
0.749
0.809
0.698
Per Protein
0.956
0.971
0.855
0.888
0.809
Outlook
• Include position dependent gap penalties
• The method now uses equal gap penalties
through out the scoring matrix
• In real proteins placement of insertions and
deletions is highly structure dependent
• No gaps in secondary structure elements
• Gaps most frequent in loops
CASP. Which are the best methods
• Critical Assessment of Structure Predictions
• Every second year
• Sequences from about-to-be-solvedstructures are given to groups who submit
their predictions before the structure is
published
• Modelers make prediction
• Meeting in December where correct answers
are revealed
CASP6 results
The top 4 homology modeling groups in CASP6
• All winners use consensus predictions
– The wisdom of the crowd
• Same approach as in CASP5!
• Nothing has happened in 2 years!
The Wisdom of the Crowds
The Wisdom of Crowds. Why the Many are
Smarter than the Few. James Surowiecki
One day in the fall of 1906, the British scientist Fracis
Galton left his home and headed for a country fair… He
believed that only a very few people had the
characteristics necessary to keep societies healthy. He
had devoted much of his career to measuring those
characteristics, in fact, in order to prove that the vast
majority of people did not have them. … Galton came
across a weight-judging competition…Eight hundred people
tried their luck. They were a diverse lot, butchers,
farmers, clerks and many other no-experts…The crowd
had guessed … 1.197 pounds, the ox weighted 1.198
The wisdom of the crowd!
– The highest scoring hit will often be wrong
• Not one single prediction method is
consistently best
– Many prediction methods will have the
correct fold among the top 10-20 hits
– If many different prediction methods all have
a common fold among the top hits, this fold is
probably correct
3D-Jury (Best group)
Inspired by Ab initio modeling methods
– Average of frequently obtained low energy
structures is often closer to the native structure
than the lowest energy structure
Find most abundant high scoring model in a list of
prediction from several predictors
1. Use output from a set of servers
2. Superimpose all pairs of structures
3. Similarity score Sij = # of Ca pairs within 3.5Å
(if #>40;else Sij=0)
4. 3D-Jury score = SijSij/(N+1)
Similar methods developed by A Elofsson (Pcons)
and D Fischer (3D shotgun)
How to do it? Where is the crowd
• Meta prediction server
– Web interface to a list of public protein
structure prediction servers
– Submit query sequence to all selected servers
in one go
http://bioinfo.pl/meta/
Meta Server
Evaluating the crowd.
Meta Server
Evaluating the crowd. 3D Jury
Take home message
• Identifying the correct fold is only a small step
towards successful homology modeling
• Do not trust % ID or alignment score to identify
the fold. Use p-values
• Use sequence profiles and local protein
structure to align sequences
• Do not trust one single prediction method, use
consensus methods (3D Jury)
• Only if every things fail, use ab initio methods