Title goes here

Download Report

Transcript Title goes here

Protein structure and homology modeling
Morten Nielsen,
CBS, BioCentrum,
DTU
Objectives
• Understand the basic concepts of
homology modeling
• Learn why even sequences with very low
sequence similarity can be modeled
– Understand why is %id such a terrible
measure for reliability
• See the beauty of sequence profiles?
• Learn where to find the best public
methods
Outline
• Why homology modeling
• How is it done
• How to decide when to use homology
modeling
– Why is %id such a terrible measure
• What are the best methods?
• Models in immunology
Why protein modeling?
• Because it works!
– Close to 50% of all new sequences can be homology
modeled
• Experimental effort to determine protein
structure is very large and costly
• The gap between the size of the protein
sequence data and protein structure data is
large and increasing
Homology modeling and the human genome
Swiss-Prot database
~200.000 in Swiss-Prot
~ 2.000.000 if include Tremble
New PDB structures
PDB New Fold Growth
Old folds
New folds
•
•
The number of unique folds in nature is fairly small (possibly a
few thousands)
90% of new structures submitted to PDB in the past three
years have similar structural folds in PDB
Identification of fold
If sequence similarity is
high proteins share
structure (Safe zone)
If sequence similarity is low
proteins may share
structure (Twilight zone)
Most proteins do not have a
high sequence homologous
partner
Rajesh Nair & Burkhard Rost Protein Science, 2002, 11, 2836-47
Why %id is so bad!!
1200 models sharing 25-95% sequence identity with the
submitted sequences (www.expasy.ch/swissmod)
Identification of correct fold
• % ID is a poor measure
– Many evolutionary related proteins share
low sequence homology
• Alignment score even worse
– Many sequences will score high against
every thing (hydrophobic stretches)
• P-value or E-value more reliable
What are P and E values?
• E-value
– Number of expected hits
in database with score
higher than match
– Depends on database size
• P-value
Score 150
10 hits with higher
score (E=10)
10000 hits in
database =>
P=10/10000 = 0.001
– Probability that a random
hit will have score higher
than match
– Database size independent
Score
How to do it
Identify fold
(template) for
modeling
– Find the structure in
the PDB database that
resembles your new
protein the most
– Can be used to predict
function
Align protein sequence
to template
– Simple alignment
methods
– Sequence profiles
– Threading methods
– Pseudo force fields
Model side chains and
loops
Template identification
Simple sequence based methods
– Align (BLAST) sequence against sequence of proteins
with known structure (PDB database)
Sequence profile based methods
– Align sequence profile (Psi-BLAST) against sequence
of proteins with known structure (PDB)
– Align sequence profile against profile of proteins with
known structure (FFAS)
Sequence and structure based methods
– Align profile and predicted secondary structure
against proteins with known structure (3D-PSSM)
Sequence profiles
In conventional alignment, a scoring matrix
(BLOSUM62) gives the score for
matching two amino acids
–
–
–
In reality not all positions in a protein are
equally likely to mutate
Some amino acids (active cites) are highly
conserved, and the score for mismatch must
be very high
Other amino acids are mutate almost for
free, and the score for mismatch is lower
than the BLOSUM score
Sequence profiles can capture these
differences
Protein structure classification
Protein world
Protein fold
Protein superfamily
Protein family
New Fold
Sequence profiles
ADDGSLAFVPSEF--SISPGEKIVFKNNAGFPHNIVFDEDSIPSGVDASKISMSEEDLLN
TVNGAI--PGPLIAERLKEGQNVRVTNTLDEDTSIHWHGLLVPFGMDGVPGVSFPG---I
-TSMAPAFGVQEFYRTVKQGDEVTVTIT-----NIDQIED-VSHGFVVVNHGVSME---I
IE--KMKYLTPEVFYTIKAGETVYWVNGEVMPHNVAFKKGIV--GEDAFRGEMMTKD---TSVAPSFSQPSF-LTVKEGDEVTVIVTNLDE------IDDLTHGFTMGNHGVAME---V
ASAETMVFEPDFLVLEIGPGDRVRFVPTHK-SHNAATIDGMVPEGVEGFKSRINDE---TVNGQ--FPGPRLAGVAREGDQVLVKVVNHVAENITIHWHGVQLGTGWADGPAYVTQCPI
TKAVVLTFNTSVEICLVMQGTSIV----AAESHPLHLHGFNFPSNFNLVDPMERNTAGVP
Matching any thing
but G => large
negative score
Any thing can match
Sequence profiles
Align (BLAST) sequence against large sequence
database (Swiss-Prot)
Select significant alignments and make profile
(weight matrix) using techniques for sequence
weighting and pseudo counts
Use weight matrix to align against sequence
database to find new significant hits
Repeat 2 and 3 (normally 3 times!)
Example.
>1K7C.A
TTVYLAGDSTMAKNGGGSGTNGWGEYLASYLSATVVNDAVAGRSARSYTREGRFENIADV
VTAGDYVIVEFGHNDGGSLSTDNGRTDCSGTGAEVCYSVYDGVNETILTFPAYLENAAKL
FTAKGAKVILSSQTPNNPWETGTFVNSPTRFVEYAELAAEVAGVEYVDHWSYVDSIYETL
GNATVNSYFPIDHTHTSPAGAEVVAEAFLKAVVCTGTSLKSVLTTTSFEGTCL
• What is the function
• Where is the active site?
Example.
• Function
• Run Blast against PDB
• No significant hits
• Run Blast against NR (Sequence database)
• Function is Acetylesterase?
• Where is the active site?
Example. Where is the active site?
1G66 Acetylxylan esterase
1USW Hydrolase
1WAB Acetylhydrolase
Example. Where is the active site?
• Align sequence against structures of
known acetylesterase, like
• 1WAB, 1FXW, …
• Cannot be aligned. Too low sequence
similarity
1K7C.A 1WAB._ RMSD 11.2397
QAL 1K7C.A
71 GHNDGGSLSTDNGRTDCSGTGAEVCYSVYDGVNETILTF
DAL 1WAB._
160 GHPRAHFLDADPGFVHSDGTISH--HDMYDYLHLSRLGY
Example. Where is the active site?
• Sequence profiles might show you where to look!
• The active site could be around
• S9, G42, N74, and H195
Example. Where is the active site?
Align using sequence profiles
ALN 1K7C.A 1WAB._ RMSD = 5.29522
1K7C.A TVYLAGDSTMAKNGGGSGTNGWGEYLASYLSATVVNDAVAGRSARSYTREGRFENIADVVTAGDYVIVEFGHNDGGSLSTDN
S
G
N
1WAB._ EVVFIGDSLVQLMHQCE---IWRELFS---PLHALNFGIGGDSTQHVLW--RLENGELEHIRPKIVVVWVGTNNHG-----1K7C.A GRTDCSGTGAEVCYSVYDGVNETILTFPAYLENAAKLFTAK--GAKVILSSQTPNNPWETGTFVNSPTRFVEYAEL-AAEVA
1WAB._ ---------------------HTAEQVTGGIKAIVQLVNERQPQARVVVLGLLPRGQ-HPNPLREKNRRVNELVRAALAGHP
1K7C.A GVEYVDHWSYVDSIYETLGNATVNSYFPIDHTHTSPAGAEVVAEAFLKAVVCTGTSL
H
1WAB._ RAHFLDADPG---FVHSDG--TISHHDMYDYLHLSRLGYTPVCRALHSLLLRL---L
Structural superposition
Blue: 1K7C.A
Red: 1WAB._
Where was the active site?
Rhamnogalacturonan
acetylesterase (1k7c)
Including structure
• Sequence with in a protein superfamily
share remote sequence homology
• , but they share high structural homology
• Structure is known for template
• Predict structural properties for query
– Secondary structure
– Surface exposure
• Position specific gap penalties derived from
secondary structure and surface exposure
Structure biased alignment (3D-PSSM)
http://www.sbg.bio.ic.ac.uk/~3dpssm/
CASP. Which are the best methods
• Critical Assessment of Structure Predictions
• Every second year
• Sequences from about-to-be-solvedstructures are given to groups who submit
their predictions before the structure is
published
• Modelers make prediction
• Meeting in December where correct answers
are revealed
CASP6 results
The top 4 homology modeling groups in
CASP6
• All winners use consensus predictions
– The wisdom of the crowd
• Same approach as in CASP5!
• Nothing has happened in 2 years!
The Wisdom of the Crowds
The Wisdom of Crowds. Why the Many are
Smarter than the Few. James Surowiecki
One day in the fall of 1906, the British scientist Fracis
Galton left his home and headed for a country fair… He
believed that only a very few people had the
characteristics necessary to keep societies healthy. He
had devoted much of his career to measuring those
characteristics, in fact, in order to prove that the vast
majority of people did not have them. … Galton came
across a weight-judging competition…Eight hundred people
tried their luck. They were a diverse lot, butchers,
farmers, clerks and many other no-experts…The crowd
had guessed … 1.197 pounds, the ox weighted 1.198
The wisdom of the crowd!
– The highest scoring hit will often be wrong
• Not one single prediction method is
consistently best
– Many prediction methods will have the
correct fold among the top 10-20 hits
– If many different prediction methods all have
a common fold among the top hits, this fold is
probably correct
3D-Jury (Best group)
Inspired by Ab initio modeling methods
– Average of frequently obtained low energy
structures is often closer to the native structure
than the lowest energy structure
Find most abundant high scoring model in a list of
prediction from several predictors
1. Use output from a set of servers
2. Superimpose all pairs of structures
3. Similarity score Sij = # of Ca pairs within 3.5Å
(if #>40;else Sij=0)
4. 3D-Jury score = SijSij/(N+1)
Similar methods developed by A Elofsson (Pcons)
and D Fischer (3D shotgun)
How to do it? Where is the crowd
• Meta prediction server
– Web interface to a list of public protein
structure prediction servers
– Submit query sequence to all selected servers
in one go
http://bioinfo.pl/meta/
Meta Server
Evaluating the crowd.
Meta Server
Evaluating the crowd. 3D Jury
From fold to structure
Flying to the moon has not made man
conquer space
Finding the right fold does not allow you to
make accurate protein models
– Can allow prediction of protein function
Alignment is still a very hard problem
– Most protein interactions are determined by
the loops, and they are the least conserved
parts of a protein structure
Ab initio protein modeling
Modeling of newfold proteins
• Only when every thing else fails
• Challenge
• Close to impossible to model Natures
folding potential
Challenge. Folding potential
• New folds are in general
constructed from a set of
subunits, where each subunit
is part of a known fold.
• The subunits are small
compared to the overall fold
of the protein. No objective
function exists to guide the
global packing of the subunits.
Objective function
sij = 120aa
dij = 6Å
A way to solution
• Glue structure piece wise from fragments.
• Guide process by empirical/statistical potential
Fragments with correct
local structure
Example (Rosetta web server)
www.bioinfo.rpi.edu/~bystrc/hmmstr/server.php
Structure
Rosetta prediction
Take home message
• Identifying the correct fold is only a small step
towards successful homology modeling
• Do not trust % ID or alignment score to identify
the fold. Use p-values
• Use sequence profiles and local protein
structure to align sequences
• Do not trust one single prediction method, use
consensus methods (3D Jury)
• Only if every things fail, use ab initio methods