Foldrec_2008 - Center for Biological Sequence Analysis

Download Report

Transcript Foldrec_2008 - Center for Biological Sequence Analysis

Protein Fold recognition
Morten Nielsen,
CBS, BioSys,
DTU
Objectives
• Understand the basic concepts of fold
recognition
• Learn why even sequences with very low
sequence similarity can be modeled
– Understand why is %id such a terrible
measure for reliability
• See the beauty of sequence profiles
– Position specific scoring matrices (PSSMs)
Objectives
• and .....
• See the beauty of sequence profiles
– Position specific scoring matrices (PSSMs)
Background. Why protein modeling?
• Because it works!
– Close to 50% of all new sequences can be homology
modeled
• Experimental effort to determine protein
structure is very large and costly
• The gap between the size of the protein
sequence data and protein structure data is
large and increasing
Homology modeling and the human genome
How can we do it?
• Identify template(s) – initial alignment
• Can give you protein function
• Improve alignment
• Can give you active site
• Backbone generation
• Loop modeling
• Most difficult part
• Side chains
• Refinement
• Validation
Identification of fold
If sequence similarity is
high proteins share
structure (Safe zone)
If sequence similarity is low
proteins may share
structure (Twilight zone)
Most proteins do not have a
high sequence homologous
partner
Rajesh Nair & Burkhard Rost Protein Science, 2002, 11, 2836-47
Example.
A post doc in our group did her PhD obtaining the structure of the
sequence below
>1K7C.A
TTVYLAGDSTMAKNGGGSGTNGWGEYLASYLSATVVNDAVAGRSARSYTREGRFENIADV
VTAGDYVIVEFGHNDGGSLSTDNGRTDCSGTGAEVCYSVYDGVNETILTFPAYLENAAKL
FTAKGAKVILSSQTPNNPWETGTFVNSPTRFVEYAELAAEVAGVEYVDHWSYVDSIYETL
GNATVNSYFPIDHTHTSPAGAEVVAEAFLKAVVCTGTSLKSVLTTTSFEGTCL
• What is the function
• Where is the active site?
What would you do?
• Function
• Run Blast against PDB
• No significant hits
• Run Blast against NR (Sequence database)
• Function is Acetylesterase?
• Where is the active site?
Example. Where is the active site?
1G66 Acetylxylan esterase
1USW Hydrolase
1WAB Acetylhydrolase
Example. Where is the active site?
• Align sequence against structures of
known acetylesterase, like
• 1WAB, 1FXW, …
• Cannot be aligned. Too low sequence
similarity
1K7C.A 1WAB._ RMSD 11.2397
QAL 1K7C.A
71 GHNDGGSLSTDNGRTDCSGTGAEVCYSVYDGVNETILTF
DAL 1WAB._
160 GHPRAHFLDADPGFVHSDGTISH--HDMYDYLHLSRLGY
Is it really impossible?
Protein homology modeling is only possible
if %id greater than 30-50%
Why %id is so bad!!
1200 models sharing 25-95% sequence identity with the
submitted sequences (www.expasy.ch/swissmod)
Identification of correct fold
• % ID is a poor measure
– Many evolutionary related proteins share
low sequence homology
– A short alignment of 5 amino acids can
share 100% id, what does this mean?
• Alignment score even worse
– Many sequences will score high against
every thing (hydrophobic stretches)
• P-value or E-value more reliable
What are P and E values?
• E-value
– Number of expected hits
in database with score
higher than match
– Depends on database size
• P-value
Score 150
10 hits with higher
score (E=10)
10000 hits in
database =>
P=10/10000 = 0.001
– Probability that a random
hit will have score higher
than match
– Database size independent
Score
What goes wrong when Blast fails?
• Conventional sequence alignment uses a (Blosum)
scoring matrix to identify amino acids matches in
the two protein sequences
• This scoring matrix is identical at all positions in
the protein sequence!
EVVFIGDSLVQLMHQC
A
G
D
S
.
G
G
G
D
S
X
X
X
X
X
X
Blosum scoring matrix
A
R
N
D
C
Q
E
G
H
I
L
K
M
F
P
S
T
W
Y
V
A
4
-1
-2
-2
0
-1
-1
0
-2
-1
-1
-1
-1
-2
-1
1
0
-3
-2
0
R
-1
5
0
-2
-3
1
0
-2
0
-3
-2
2
-1
-3
-2
-1
-1
-3
-2
-3
N
-2
0
6
1
-3
0
0
0
1
-3
-3
0
-2
-3
-2
1
0
-4
-2
-3
D
-2
-2
1
6
-3
0
2
-1
-1
-3
-4
-1
-3
-3
-1
0
-1
-4
-3
-3
C
0
-3
-3
-3
9
-3
-4
-3
-3
-1
-1
-3
-1
-2
-3
-1
-1
-2
-2
-1
Q
-1
1
0
0
-3
5
2
-2
0
-3
-2
1
0
-3
-1
0
-1
-2
-1
-2
E
-1
0
0
2
-4
2
5
-2
0
-3
-3
1
-2
-3
-1
0
-1
-3
-2
-2
G
0
-2
0
-1
-3
-2
-2
6
-2
-4
-4
-2
-3
-3
-2
0
-2
-2
-3
-3
H
-2
0
1
-1
-3
0
0
-2
8
-3
-3
-1
-2
-1
-2
-1
-2
-2
2
-3
I
-1
-3
-3
-3
-1
-3
-3
-4
-3
4
2
-3
1
0
-3
-2
-1
-3
-1
3
L
-1
-2
-3
-4
-1
-2
-3
-4
-3
2
4
-2
2
0
-3
-2
-1
-2
-1
1
K
-1
2
0
-1
-3
1
1
-2
-1
-3
-2
5
-1
-3
-1
0
-1
-3
-2
-2
M
-1
-1
-2
-3
-1
0
-2
-3
-2
1
2
-1
5
0
-2
-1
-1
-1
-1
1
F
-2
-3
-3
-3
-2
-3
-3
-3
-1
0
0
-3
0
6
-4
-2
-2
1
3
-1
P
-1
-2
-2
-1
-3
-1
-1
-2
-2
-3
-3
-1
-2
-4
7
-1
-1
-4
-3
-2
S
1
-1
1
0
-1
0
0
0
-1
-2
-2
0
-1
-2
-1
4
1
-3
-2
-2
T
0
-1
0
-1
-1
-1
-1
-2
-2
-1
-1
-1
-1
-2
-1
1
5
-2
-2
0
W
-3
-3
-4
-4
-2
-2
-3
-2
-2
-3
-2
-3
-1
1
-4
-3
-2
11
2
-3
Y
-2
-2
-2
-3
-2
-1
-2
-3
2
-1
-1
-2
-1
3
-3
-2
-2
2
7
-1
V
0
-3
-3
-3
-1
-2
-2
-3
-3
3
1
-2
1
-1
-2
-2
0
-3
-1
4
Alignment accuracy. Scoring functions
• Blosum62 score matrix. Fg=1. Ng=0?
L
A
G
D
S
D
F
0
-2
-3
-3
-2
-3
I
2
-1
-4
-3
-2
-3
G
-4
0
6
-1
0
-1
D
-4
-2
-1
6
0
6
S
-2
1
0
0
4
0
L
4
-1
-4
-4
-2
-4
• Score =2+6+6+4-1=17
• Alignment
LAGDS
I-GDS
1PLC._
When Blast works!
1PLB._
1PLC._
When Blast fails!
1PMY._
1PLC._
When Blast fails, use sequence
profiles!
1PMY._
Sequence profiles
•
In reality not all positions in a protein are
equally likely to mutate
•
•
•
Some amino acids (active cites) are highly
conserved, and the score for mismatch must
be very high
Other amino acids can mutate almost for
free, and the score for mismatch should be
lower than the BLOSUM score
Sequence profiles can capture these
differences
Protein structure hierarchy
Protein world
Protein fold
Protein superfamily
Protein family
New Fold
Sequence profiles
ADDGSLAFVPSEF--SISPGEKIVFKNNAGFPHNIVFDEDSIPSGVDASKISMSEEDLLN
TVNGAI--PGPLIAERLKEGQNVRVTNTLDEDTSIHWHGLLVPFGMDGVPGVSFPG---I
-TSMAPAFGVQEFYRTVKQGDEVTVTIT-----NIDQIED-VSHGFVVVNHGVSME---I
IE--KMKYLTPEVFYTIKAGETVYWVNGEVMPHNVAFKKGIV--GEDAFRGEMMTKD---TSVAPSFSQPSF-LTVKEGDEVTVIVTNLDE------IDDLTHGFTMGNHGVAME---V
ASAETMVFEPDFLVLEIGPGDRVRFVPTHK-SHNAATIDGMVPEGVEGFKSRINDE---TVNGQ--FPGPRLAGVAREGDQVLVKVVNHVAENITIHWHGVQLGTGWADGPAYVTQCPI
TKAVVLTFNTSVEICLVMQGTSIV----AAESHPLHLHGFNFPSNFNLVDPMERNTAGVP
Matching any thing
but G => large
negative score
Any thing can match
How to make sequence profiles
Align (BLAST) sequence against large sequence
database (Swiss-Prot)
Select significant alignments and make profile
(weight matrix) using techniques for sequence
weighting and pseudo counts
Use weight matrix to align against sequence
database to find new significant hits
Repeat 2 and 3 (normally 3 times!)
Blast iterations
Protein world
Protein
Sequence profiles (1J2J.B)
0 iterations (Blosum62)
A
R
N
D
C
Q
E
G
H
I
L
K
M
F
P
S
T
W
Y
V
A
4
-1
-2
-2
0
-1
-1
0
-2
-1
-1
-1
-1
-2
-1
1
0
-3
-2
0
R
-1
5
0
-2
-3
1
0
-2
0
-3
-2
2
-1
-3
-2
-1
-1
-3
-2
-3
N
-2
0
6
1
-3
0
0
0
1
-3
-3
0
-2
-3
-2
1
0
-4
-2
-3
D
-2
-2
1
6
-3
0
2
-1
-1
-3
-4
-1
-3
-3
-1
0
-1
-4
-3
-3
C
0
-3
-3
-3
9
-3
-4
-3
-3
-1
-1
-3
-1
-2
-3
-1
-1
-2
-2
-1
Q
-1
1
0
0
-3
5
2
-2
0
-3
-2
1
0
-3
-1
0
-1
-2
-1
-2
E
-1
0
0
2
-4
2
5
-2
0
-3
-3
1
-2
-3
-1
0
-1
-3
-2
-2
G
0
-2
0
-1
-3
-2
-2
6
-2
-4
-4
-2
-3
-3
-2
0
-2
-2
-3
-3
H
-2
0
1
-1
-3
0
0
-2
8
-3
-3
-1
-2
-1
-2
-1
-2
-2
2
-3
I
-1
-3
-3
-3
-1
-3
-3
-4
-3
4
2
-3
1
0
-3
-2
-1
-3
-1
3
L
-1
-2
-3
-4
-1
-2
-3
-4
-3
2
4
-2
2
0
-3
-2
-1
-2
-1
1
K
-1
2
0
-1
-3
1
1
-2
-1
-3
-2
5
-1
-3
-1
0
-1
-3
-2
-2
M
-1
-1
-2
-3
-1
0
-2
-3
-2
1
2
-1
5
0
-2
-1
-1
-1
-1
1
F
-2
-3
-3
-3
-2
-3
-3
-3
-1
0
0
-3
0
6
-4
-2
-2
1
3
-1
P
-1
-2
-2
-1
-3
-1
-1
-2
-2
-3
-3
-1
-2
-4
7
-1
-1
-4
-3
-2
S
1
-1
1
0
-1
0
0
0
-1
-2
-2
0
-1
-2
-1
4
1
-3
-2
-2
T
0
-1
0
-1
-1
-1
-1
-2
-2
-1
-1
-1
-1
-2
-1
1
5
-2
-2
0
W
-3
-3
-4
-4
-2
-2
-3
-2
-2
-3
-2
-3
-1
1
-4
-3
-2
11
2
-3
Y
-2
-2
-2
-3
-2
-1
-2
-3
2
-1
-1
-2
-1
3
-3
-2
-2
2
7
-1
V
0
-3
-3
-3
-1
-2
-2
-3
-3
3
1
-2
1
-1
-2
-2
0
-3
-1
4
Sequence profiles (1J2J.B)
0 iterations (Blosum62)
3 iterations
Example. (SGNH active site)
Example. Where is the active site?
• Sequence profiles might show you where to look!
• The active site could be around
• S9, G42, N74, and H195
1K7C.A
Profile-profile scoring matrix
1WAB._
Example. Where is the active site?
Align using sequence profiles
ALN 1K7C.A 1WAB._ RMSD = 5.29522. 14% ID
1K7C.A TVYLAGDSTMAKNGGGSGTNGWGEYLASYLSATVVNDAVAGRSARSYTREGRFENIADVVTAGDYVIVEFGHNDGGSLSTDN
S
G
N
1WAB._ EVVFIGDSLVQLMHQCE---IWRELFS---PLHALNFGIGGDSTQHVLW--RLENGELEHIRPKIVVVWVGTNNHG-----1K7C.A GRTDCSGTGAEVCYSVYDGVNETILTFPAYLENAAKLFTAK--GAKVILSSQTPNNPWETGTFVNSPTRFVEYAEL-AAEVA
1WAB._ ---------------------HTAEQVTGGIKAIVQLVNERQPQARVVVLGLLPRGQ-HPNPLREKNRRVNELVRAALAGHP
1K7C.A GVEYVDHWSYVDSIYETLGNATVNSYFPIDHTHTSPAGAEVVAEAFLKAVVCTGTSL
H
1WAB._ RAHFLDADPG---FVHSDG--TISHHDMYDYLHLSRLGYTPVCRALHSLLLRL---L
Structural superposition
Blue: 1K7C.A
Red: 1WAB._
Where was the active site?
Rhamnogalacturonan
acetylesterase (1k7c)
Including structure
• Sequence with in a protein superfamily
share remote sequence homology
• , but they share high structural homology
• Structure is known for template
• Predict structural properties for query
– Secondary structure
– Surface exposure
• Position specific gap penalties derived from
secondary structure and surface exposure
Using structure
Sequence & structure profile-profile based
alignments
– Template
• Sequence based profiles
• Annotated secondary structure
• Predicted secondary structure
– Query
• Sequence based profile
• Predicted secondary structure
– Position specific gap penalties derived from
secondary structure
How good are we?
Alignment accuracy (twilight zone)
Fold recognition performance
What are the different methods?
• Simple sequence based methods
– Align (BLAST) sequence against sequence of proteins with known
structure (PDB database)
• Sequence profile based methods
– Align sequence profile (Psi-BLAST) against sequence of proteins
with known structure (PDB, FUGUE)
– Align sequence profile against profile of proteins with known
structure (FFAS)
• Sequence and structure based methods
– Align profile and predicted secondary structure against proteins
with known structure (3D-PSSM, Phyre)
• Sequence profiles and structure based methods
– HHpred
Take home message
• Identifying the correct fold is only a small step
towards successful homology modeling
• Do not trust % ID or alignment score to identify
the fold. Use P-values
• You can do reliable fold recognition AND
homology modeling when for low sequence
homology
• Use sequence profiles and local protein
structure to align sequences
CASP. Which are the best methods
• Critical Assessment of Structure Predictions
• Every second year
• Sequences from about-to-be-solvedstructures are given to groups who submit
their predictions before the structure is
published
• Modelers make prediction
• Meeting in December where correct answers
are revealed
CASP6 results
The top 4 homology modeling groups in CASP6
• All winners use consensus predictions
– The wisdom of the crowd
• Same approach as in CASP5!
• Nothing has happened in 2 (4!!) years!
The Wisdom of the Crowds
The Wisdom of Crowds. Why the Many are
Smarter than the Few. James Surowiecki
One day in the fall of 1906, the British scientist Fracis
Galton left his home and headed for a country fair… He
believed that only a very few people had the
characteristics necessary to keep societies healthy. He
had devoted much of his career to measuring those
characteristics, in fact, in order to prove that the vast
majority of people did not have them. … Galton came
across a weight-judging competition…Eight hundred people
tried their luck. They were a diverse lot, butchers,
farmers, clerks and many other no-experts…The crowd
had guessed … 1.197 pounds, the ox weighted 1.198
The wisdom of the crowd!
– The highest scoring hit will often be wrong
• Not one single prediction method is
consistently best
– Many prediction methods will have the
correct fold among the top 10-20 hits
– If many different prediction methods all have
a common fold among the top hits, this fold is
probably correct
3D-Jury (Best group)
Inspired by Ab initio modeling methods
– Average of frequently obtained low energy
structures is often closer to the native structure
than the lowest energy structure
Find most abundant high scoring model in a list of
prediction from several predictors
1. Use output from a set of servers
2. Superimpose all pairs of structures
3. Similarity score Sij = # of Ca pairs within 3.5Å
(if #>40;else Sij=0)
4. 3D-Jury score = SijSij/(N+1)
Similar methods developed by A Elofsson (Pcons)
and D Fischer (3D shotgun)
How to do it? Where is the crowd
• Meta prediction server
– Web interface to a list of public protein
structure prediction servers
– Submit query sequence to all selected servers
in one go
http://bioinfo.pl/meta/
Meta Server
Evaluating the crowd.
Meta Server
Evaluating the crowd. 3D Jury
Structural Genomics in North America
• 10 year $600 million project initiated in 2000,
funded largely by NIH
• AIM: structural information on 10000 unique
proteins (now 4-6000), so far 1000 have been
determined
• Improve current techniques to reduce time
(from months to days) and cost (from $100.000
to $20.000/structure)
• 9 research centers currently funded (2005),
targets are from model and disease-causing
organisms (a separate project on TB proteins)
Homology modeling for structural genomics
What a new fold can give
Roberto Sánchez et al. Nature Structural Biology 7, 986 - 990 (2000)
Take home message
• Identifying the correct fold is only a small step
towards successful homology modeling
• Do not trust % ID or alignment score to identify
the fold. Use p-values
• Use sequence profiles and local protein
structure to align sequences
• Do not trust one single prediction method, use
consensus methods (3D Jury)