lecture5.pps

Download Report

Transcript lecture5.pps

PROTEOMICS
3D Structure Prediction
Contents
• Protein 3D structure.
– Basics
– PDB
– Prediction approaches
• Protein classification.
Protein Structures:
Primary
Amino acid
sequence
Secondary
Alpha helices &
Beta sheets,
loops.
Tertiary
Packing of
secondary
elements.
Quaternary
Packing of several
polypeptide chains
How Does a Protein Fold
• The classical nucleation-propagation model:
– the first event (fast) is hydrophobic collapse
accompanied by the formation of secondary
structures.
In this step domains are formed.
– the second step (slow) is the precise ordering
of the secondary elements: packing of
hydrophobic core, domain arrangement, etc.
• The 3D structure is assumed to be the most
stable structure - minimal free energy.
– Local minimum or global minimum?
Prions
• Proteins found in mammals.
• Responsible for the mad cow disease.
• There is no difference in the sequence of a normal
prion and an abnormal prion.
• The difference lies in the 3D structure.
• Disease is assumed to be propagated by the
insertion of an abnormal prion, that is capable of
changing the configuration of a normal prion to an
abnormal prion.
• Conclusion: there are several stable configurations
for a single protein.
PDB - Protein Data Base
• http://www.rcsb.org/pdb/index.html
• Contains proteins whose structure has been
solved.
• Number of solved proteins: 19,225.
• Ratio of solved structures / proteins: 1/7
(SwissProt) - 1/40 (TrEMBL)
• The entry for each protein consists of the
x,y,z coordinates of every atom.
• Tutorial
http://www.rcsb.org/pdb/query_tut.html
Prion Protein Domain from Mouse – Entry 1AG2:
Ribbons Vs. Cylinders
Broad View of the protein world I
• Estimation: ~1000-20,000 protein families
composed of members that share
detectable sequence similarity.
– A new sequence is expected to be similar to
other sequences in the data base, and can be
expected to share structural features with
these proteins.
• Structure prediction:
– >50% sequence identity imply similar structure.
– >30% sequence identity imply common structural
elements
Broad View of the protein world II
• There is a limited number of different 3D
structures.
– Comparing newly generated structures with
previously found structures, the new structure
often fold into alpha & beta elements in the
same order and in the same spatial configuration
as already known structures.
• Often there is no sequence similarity.
• Totally different sequences can fold into
similar structures.
Three Main Approaches
for Structural Prediction:
• Ab-Initio.
• Comparative Modeling.
• Fold Recognition.
Example:
A pathway for folding
a 2-domain protein.
http://www.pdg.cnb.uam.es/cursos/FVi2001DIA1/
The Ab-Initio Method
• The Structural Prediction Problem: “Given a
protein sequence, compute it’s structure”.
• Computation is based on energy calculation
stemming from the position of each atom in space
and its physical-chemical relations with other
atoms.
• Theoretically possible.
• Astronomical, highly under-constrained search
space.
• Biophysics complex and incomplete.
• Practically, next to impossible.
Comparative (Homology) Modeling
• Evolutionary related proteins (homologous) usually
have similar structure.
• The similarity of structures is very high in core
regions (helices & sheets).
However, loops may vary even in pairs of homologous
structures with high degree of sequence similarity.
Thick backbone - known
structure.
Thin lines - modeled
structure.
Some side-chains are not
positioned correctly,
but some look good.
Modeling Performance
Structure similarity predicted from sequence
similarity:
• Sander & Schneider (1991) aligned all the
sequences in PDB.
• Developed a formula for structure similarity based
on sequence similarity.
• Structure similarity depends on the length of the
protein.
Modeling Performance - Examples
•A protein of 10 amino acids requires 80% identity
for a similar structure.
• A protein of length > 80 requires
• ~30% identity for common sub-structures.
• ~50% identity for a similar structure.
• ~80% identity for a similar structure in a very
good resolution.
Fold Recognition Approaches
Fold - a combination of secondary structural units in
the same configuration.
Protein structural classification uses fold as a basic
level of classification.
•
Fold<->Family Relations
• Estimation 1: There are 1500-20,000 protein
families, based on homology.
Each family contains ~ one fold.
• Estimation 2: There are 700-1500 protein folds.
• Conclusions:
1. Many protein families share the same fold.
2. Different sequences are folded similarly.
• The common fold approach to structure
prediction: Use the collection of determined
structures to predict the structure of a protein.
How Condensed is a Fold?
How many different sequences can result in the
same fold for an average domain of 150 amino
acids?
– There are 20150 ~10200 different sequences
– about 1038 are less than 20% identical.
– Assume that only 1 in a million has a stable fold
- 1032.
– Expected number of different folds is 1000.
– About 1029 different sequences fold similarly.
Fold Recognition
• A fold is shared by family members, both close
and distant (distance is related to sequence
similarity)
– the globin fold
• For a query protein - if its family members are
identified, and their fold is known, we could assign
it the same fold.
Method 1: Which alignment algorithm detects close
and distant relatives?
PSI-BLAST
Fold Recognition - Threading
• Threading allows for identification of structure
similarity without sequence similarity.
• The amino acid (aa) sequence of a query protein is
examined for compatibility with the structural
core of a known protein.
“Given a protein structure, what sequences fold into it ?”
Threading
• The protein core is a very compact environment
composed of alpha and beta secondary structures.
• Very hydrophobic, no place for water molecules,
other aa, or aa with chemically different side
chains.
• Side chains have many contacts with neighboring
aa for stability.
• Threading matches the aa of the query with aa of
a known structure:
– If threading gives a good score, then the core
of the query is assumed to fold similarly.
Threading
• Two main methods:
– Contact potential method.
– Structural profile (Environmental template).
• Contact potential method
– the number of contact points and proximity
between aa is analyzed for every known
structure.
– The query is checked against all the
interactions in the core and their contribution
to the stability of the structure.
– The fold that results in the most energetically
stable structure is chosen.
Threading - Structural Profile
• The environment of every aa in known structures
is determined, including
– the secondary structure, the area of the side-chain that
is buried by closeness to other atoms, types of nearby
chains, etc.
• Each position is classified into one of 18 types
– 6 representing increasing levels of residue burial and
fraction of surface covered by polar atoms
– combined with three classes of secondary structures.
• Each aa is assessed for its ability to fit into that
type of site in the structure.
– Buried group is matched well with hydrophobic aa.
Structural Profile
• Profile rows are the residues in the
structure according to the 18 different
types.
• Profile columns are the 20 aa + insertion +
deletion.
– If residue in inside loop - many substitutions
are allowed, as well as insertions and deletions.
• The score for a given aa in a residue
estimates the fitness of the aa to the
residue type.
• How shall we find the best fitting region?
Structural Profile
• Dynamic programming algorithm finds the
best match of a query sequence to a
specific fold.
– Statistical significance can be computed by
doing the above for all sequences in the
database.
• The same analysis will be repeated for each
fold.
• The fold with the best statistically
significant score is chosen.
Threading - Pros and Cons:
• Good results.
• Environmental properties may be more accurate
then amino acid similarity matrices.
• Can lead to effective and fast implementations.
• Able to discover structural similarities impossible
to detect by sequence searching methods.
• Requires the existence of already known proteins
with similar structure.
CASP - Critical Assessment of Structure
Prediction
• Competition among different groups for resolving
the 3D structure of proteins that are about to be
solved experimentally.
• Current state - only fragments are “solved”:
• ab-inito - the worst, but greatly improved in the
last years.
• Modeling - performs very well when homologous
sequences with known structures exist.
• Fold recognition - PSI-BLAST is used for training
the threading procedures. Performs well.
A Clickable
Structure
Prediction
Flowchart:
http://www.bmm.icnet.uk/people/rob/CCP11BBS/flowchart2.html
Protein Classification
Proteins are classified to reflect both structural and
evolutionary relatedness. The principal levels are:
1. Family: Clear evolutionary relationship.
In general, > 30% pairwise residue identity between
the proteins.
2. Superfamily: Probable common evolutionary origin.
Combines families whose member proteins
have low sequence identities, but whose structural
and functional features suggest a common
evolutionary origin.
Structurally, superfamily members share a common
fold.
SCOP - Structural
Classification of Proteins
•
http://scop.mrc-lmb.cam.ac.uk/scop/
• Hierarchical classification of all proteins
with known structures.
• Classification:
• Class - all alpha, all beta, alpha & beta (a/b),
alpha + beta (a + b).
• Superfamily.
• Family.
• Fold - the major structural similarity unit.
• PDB entry for a protein.
CATH- Class Architecture
Topology Homologous Superfamily
•
http://www.biochem.ucl.ac.uk/bsm/cath_new/index.html
• Another protein structure classification database.
• Classification:
•
•
•
•
Class - all alpha, all beta, alpha & beta (a/b), alpha + beta
(a + b).
Architecture - gross orientation of
secondary structures, independent
of connectivity.
Topology - clusters structures according
to their topological connections and
numbers of secondary structures.
Homologous superfamilies - clusters proteins with highly
similar structures and functions.
PFAM - Protein Families
•
http://www.sanger.ac.uk/Software/Pfam/
• Database that contains large collection of multiple
sequence alignments and profile hidden Markov
Models (profile HMMs).
• Profile HMM is a probabilistic model which
describes a set of sequences.
• Widely used to describe related sequences.
• Defines domains - areas of homology that have a
3D structure independent of the rest of the
protein.
http://protomap.cornell.edu/
Classification of all the proteins in the SWISSPROT and TrEMBL
databases, into groups of related proteins.