Bioinformatics of proteins

Download Report

Transcript Bioinformatics of proteins

Bioinformatics of proteins:
Sequence, structure and the
‘symbiosis’ between them
Maya Schushan
The Ben-Tal lab
Bioinformatics
of proteins:
Sequence,
structure and
the ‘symbiosis’
between them
OUTLINE
• Sequence:
Databases, domains, motifs & annotations
• Structure:
Secondary structure, structure databases,
visualization and identification of functional
site
Sequences, domains, motifs & annotations
UniProt
• UniProt is a collaboration between the
European Bioinformatics Institute (EBI),
the Swiss Institute of Bioinformatics
(SIB) and the Protein Information
Resource (PIR).
• In 2002, the three institutes decided to
pool their resources and expertise and
formed the UniProt Consortium.
Sequences, domains, motifs & annotations
UniProt
• The world's most comprehensive catalog of information on
proteins
• Sequence, function & more…
• Comprised mainly of the databases:
– SwissProt – 366226 last year, 412525 protein entries now –
high quality annotation, non-redundant & cross-referenced to many
other databases.
– TrEMBL - 5708298 last year, 7341751 protein entries now –
computer translation of the genetic information from the EMBL
Nucleotide Sequence Database  many proteins are poorly
annotated since only automatic annotation is generated
Sequences, domains, motifs & annotations
UniProt
• Annotation description includes:
– Function(s) of the protein;
– Posttranslational modification(s) such as carbohydrates,
phosphorylation, acetylation and GPI-anchor;
– Domains and sites, for example, calcium-binding regions, ATPbinding sites, zinc fingers, homeoboxes,
– Secondary structure, e.g. alpha helix, beta sheet;
– Quaternary structure, i.g. homodimer, heterotrimer, etc.;
– Similarities to other proteins;
– Disease(s) associated with any number of deficiencies in the
protein;
– Sequence conflicts, variants, etc
Sequences, domains, motifs & annotations
UniProt
• Connected to many other databases
(e.g. Pfam , Prosite, EC, GO, PdbSum, PDB (to be discussed…))
• Each sequence has a unique 6 letter accession
• Entries in SwissProt also have IDs, which usually make sense
(e.g. CADH1_HUMAN for a cadherin of humans)
• Download sequence in FASTA format
Sequences, domains, motifs & annotations
UniProt: http://www.uniprot.org/
Type accession:
P05102
Or ID:
MTH1 _HAEPH
Sequences, domains, motifs & annotations
Sequences, domains, motifs & annotations
General data: name, origin, EC (enzymatic reaction)…
Sequences, domains, motifs & annotations
Functional data, including the GO annotations
Scroll down to find the sequence & download the FASTA
Sequences, domains, motifs & annotations
Known sites, predicted/known secondary structures,
Natural variation or mutagenesis
Sequences, domains, motifs & annotations
The protein’s sequence in FASTA format
Download
Send to BLAST
Sequences, domains, motifs & annotations
References for all info in the page- important to take a look…
Sequences, domains, motifs & annotations
Connections to other databases
Other sequence database,
e.g. genebank
Related structures in the
PDB (if available)
Model-structure in the
ModBase databaseautomatically derived!
All sorts of domain\motifs
databasesThe family related to the entry
Sequences, domains, motifs & annotations
Pfam- domain database
•Proteins are generally composed of one or more
functional regions, commonly termed domains.
•Different combinations of domains give rise to the
diverse range of proteins found in nature.
•The identification of domains that occur within
proteins can therefore provide insights into their
function.
Sequences, domains, motifs & annotations
Pfam- domain database
• The Pfam database is a large collection of protein domain
families.
• Each family is represented by multiple sequence alignments
and hidden Markov models (HMMs).
• Pfam entries are classified in one of four ways:
Family: A collection of related proteins
Domain: A structural unit which can be found in
multiple protein contexts
Repeat: A short unit which is unstable in isolation but
forms a stable structure when multiple copies are
present
Motifs: A short unit found outside globular domains
Sequences, domains, motifs & annotations
Pfam- domain database
There are two components to Pfam:
• Pfam-A entries are high quality, manually curated families.
these Pfam-A entries cover a large proportion of the
sequences in the sequence database.
• Pfam-B- automatically generated entries. Although of lower
quality, Pfam-B families can be useful for identifying
functionally conserved regions when no Pfam-A entries are
found.
•Pfam also generates higher-level groupings of related
families, known as clans. A clan is a collection of Pfam-A
entries which are related by similarity of sequence, structure
or profile-HMM.
Sequences, domains, motifs & annotations
Pfam- domain database
Allows http://pfam.sanger.ac.uk/ :
•Analyze your protein sequence for Pfam matches
•View Pfam family annotation and alignments
•See groups of related families
•Look at the domain organization of a protein sequence
•Find the domains on a PDB structure
•Query Pfam by keyword
Sequences, domains, motifs & annotations
Pfam- domain database
Searching for a certain protein accession
Sequences, domains, motifs & annotations
Pfam- domain database
Searching for a certain protein accession
Sequences, domains, motifs & annotations
Pfam- domain database
Sequences, domains, motifs & annotations
Other domain/motifs databases:
• PROSITE
• Interpro
• BLOCKS
• InterPro
• SMART
• Etc…
Sequences, domains, motifs & annotations
Classifying protein function
• Each protein performs one (or more…) specific
functions. This can be, e.g., catalyzation of a
specific enzymatic reaction, transport of an ion,
interaction with a DNA molecule etc…
• In order to easily address the specific functions,
attempts have been made to numerate and
classify the various functions performed by
proteins.
Sequences, domains, motifs & annotations
Classifying protein function
Examplesome of the diverse
functions exhibited by
Membrane proteins.
Sequences, domains, motifs & annotations
Enzyme Commission number (EC number)
• A numerical classification scheme for enzymes,
based on the chemical reactions they catalyze
• EC numbers do not specify enzymes, but enzymecatalyzed reactions. If different enzymes (for
instance from different organisms) catalyze the
same reaction, then they receive the same EC
number.
• By contrast, the UniProt database identifiers
uniquely specify a protein by its amino acid
sequence.
Sequences, domains, motifs & annotations
Enzyme Commission number (EC number)
• Every enzyme code consists of the letters "EC" followed by
four numbers separated by periods. Those numbers
represent a progressively finer classification of the enzyme.
• For example, the tripeptide aminopeptidases have the code
"EC 3.4.11.4":
• EC 3 enzymes are hydrolases (enzymes that use water to
break up some other molecule)
• EC 3.4 are hydrolases that act on peptide bonds
•EC 3.4.11 are those hydrolases that cleave off the aminoterminal amino acid from a polypeptide
•EC 3.4.11.4 are those that cleave off the amino-terminal
end from a tripeptide
Sequences, domains, motifs & annotations
Enzyme Commission number (EC number)
• For example, the tripeptide aminopeptidases have the code
"EC 3.4.11.4“, as shown for an enzyme from
Lactobacillus helveticus in the BRENDA database for
Comprehensive Enzyme Information System:
Sequences, domains, motifs & annotations
Enzyme Commission number (EC number)
•
•
•
•
•
•
EC
EC
EC
EC
EC
EC
1
2
3
4
5
6
-
Oxidoreductases
Transferases
Hydrolases
Lyases
Isomerases
Ligases
Sequences, domains, motifs & annotations
Gene Ontology
• A collaborative effort to address the need for consistent
descriptions of gene products in different database
• The GO project has developed three structured controlled
vocabularies (ontologies) that describe gene products in
terms of their associated biological processes, cellular
components and molecular functions in a speciesindependent manner.
• The use of GO terms by collaborating databases
facilitates uniform queries across them. The controlled
vocabularies are structured so that they can be queried at
different levels.
Sequences, domains, motifs & annotations
Gene Ontology
Cellular component
A cellular component is just that, a component of a
cell, but that it is part of some larger object;
this may be an anatomical structure (e.g. rough
endoplasmic reticulum or nucleus) or a gene product
group (e.g. ribosome, proteasome or a protein dimer)
Sequences, domains, motifs & annotations
Gene Ontology
Cellular component
A cellular component is just that, a component of a
cell, but that it is part of some larger object;
this may be an anatomical structure (e.g. rough
endoplasmic reticulum or nucleus) or a gene product
group (e.g. ribosome, proteasome or a protein dimer)
Sequences, domains, motifs & annotations
Gene Ontology
Biological process
A biological process is series of events accomplished
by one or more ordered assemblies of molecular
functions.
Examples of biological process terms are signal
transduction or pyrimidine metabolism.
It can be difficult to distinguish between a
biological process and a molecular function, but the
general rule is that a process must have more than
one distinct steps.
Sequences, domains, motifs & annotations
Gene Ontology
Molecular function
describes activities, such as catalytic or binding
activities, that occur at the molecular level.
Molecular functions generally correspond to
activities that can be performed by individual gene
products, but some activities are performed by
assembled complexes of gene products.
Examples of broad functional terms are catalytic
activity, transporter activity, or binding; examples
of narrower functional terms are adenylate cyclase
activity or Toll receptor binding.
Sequences, domains, motifs & annotations
Gene Ontology
Topology
The ontologies are in the form of directed acyclic graphs
(DAG), with the graph nodes being GO terms.
The ontologies are hierarchically structured, a more
specialized term (child) can be related to more than one less
specialized term (parent).
E.g. the biological process hexose biosynthetic process has
two parents, hexose metabolic process and monosaccharide
biosynthetic process. biosynthetic process is a type of
metabolic process and a hexose is a type of monosaccharide.
When any gene is involved in hexose biosynthetic process, it
is automatically annotated to both hexose metabolic process
and monosaccharide biosynthetic process.
Sequences, domains, motifs & annotations
Gene Ontology Example
Sequences, domains, motifs & annotations
Gene Ontology Interface
Search by gene or protein accession
http://www.geneontology.org/
Sequences, domains, motifs & annotations
Summary of the first part- protein
sequence databases and tools
• UniProt- the most comprehensive protein
sequence database. Connected to many other
databases and resources,
• Pfam- domain database. Many others… interpor,
prosite, BLOCKS etc.
• EC and GO classifications of protein function
OUTLINE
• Sequence:
Databases, domains, motifs & annotations
• Structure:
Secondary structure, structure
databases, visualization and identification
of functional site
Investigating & visualizing protein structures
From Sequence to Structure
• All information about the native structure of a protein
is encoded in the amino acid sequence + its native
solution environment.
• Many possible conformation  still only one or few
native folds are exhibited for each protein (Levinthal’s
paradox)
• Protein folding is driven by various forces:
– Ionic forces
– Hydrogen bonds
– The hydrophobic affect
– ...
Investigating & visualizing protein structures
Secondary Structure Prediction
Why predict secondary structures of proteins?
1) When the structure of the protein is still
unknown. This can serve as the first step for
structure prediction- first predict the secondary
structures, then how they are arranged together.
2)
For calculating better multiple
alignments or pairwise alignments.
sequence
Investigating & visualizing protein structures
Predicting 2° Structure

Each amino acid has a
different propensity for
being in each 2° structure.

For example, Proline causes a
kink which destroys the helix
structure. Thus, Proline is
usually found only at the
helix end.

The different structures
also have typical lengths.
Investigating & visualizing protein structures
Predicting 2° Structure
http://www.predictprotein.org/
Investigating & visualizing protein structures
Predicting 2° Structure
All these and more…
Investigating & visualizing protein structures
Predicting 2° Structure
 Input: Sequence
 Output: Secondary structure prediction,
globular regions, coiled-coil regions,
transmembrane helices, PROSITE motifs,
bound cystein…
 The Meta Predict Protein server now allows
many other options…
http://www.predictprotein.org/meta.php
Investigating & visualizing protein structures
Predicting 2° Structure

A common measure is Q3 = the % of amino acids
that were predicted correctly.
Authors
Chou-Fasman
Garnier
Levin
Rost & Sander


Year % acurracy
Method
1974
50%
propensities of aa's in 2nd structures
1978
62%
interactions between aa's
1993
69%
multiple seq. alignments (MSA)
1994
72%
neural networks + MSA
Today, Q3 is about 75-78% (as determined
objectively by CASP)
The theoretical limit is thougt to be about 90%
Investigating & visualizing protein structures
Predicting 2° Structure
E.g. PSIPRED
http://bioinf.cs.ucl.ac.uk/psipred/psiform.html
• A simple and accurate secondary structure
prediction method, incorporating two feedforward neural networks which perform an
analysis on output obtained from PSI-BLAST.
• Using a very stringent cross validation method
to evaluate the method's performance, PSIPRED
recent version achieves an average Q3 score of
80.7%.
Investigating & visualizing protein structures
Protein 3D Structures
A protein’s structure has a critical effect on its function:
1. Binding pockets
PDB ID 1nw7
Investigating & visualizing protein structures
Protein 3D Structures
A protein’s structure has a critical effect on its function:
2. Areas of specific chemical\electrical properties
Investigating & visualizing protein structures
Protein 3D Structures
A protein’s structure has a critical effect on its function:
3. Importance of the global fold for function
Investigating & visualizing protein structures
Tertiary structure = protein fold
Complete 3-dimensional structure
Why is it interesting ? isn’t the sequence enough?




The structure is more conserved
Detection of distant evolutionary
relationships
A key to understand protein
function
Structure-based drug design
Investigating & visualizing protein structures
RCSB- the Protein Data Bank
• The main & comprehensive database for biological
macro-molecular structures
• Each structure receives a PDB ID: a 4 letters
unique identifier
• Search by author, PDB id or any keyword.
• Download structures
Investigating & visualizing protein structures
RCSB- Protein Databank
http://www.rcsb.org/pdb/home/home.do
PDB ID: 3mht
Investigating & visualizing protein structures
RCSB- The Protein Data Bank
Download structure
The paper describing
the structure
Data concerning the
structureresolution, R-value….
Display
structure
Investigating & visualizing protein structures
RCSB- The Protein Data Bank
PDB files have a specific format:
•
•
•
•
•
•
•
•
•
TITLE
REMARK
COMPND
JRNL- reference
SEQRES- the original sequence
HELIX, BETA- secondary structure
ATOM – The actual protein/DNA/RNA chain
HETATM- additional atoms such as ligands, water etc.
…
Investigating & visualizing protein structures
RCSB – The Protein Data Bank
PDB files have a specific format:
ATOM
ATOM
ATOM
ATOM
HETATM
HETATM
HETATM
HETATM
HETATM
HETATM
HETATM
HETATM
HETATM
7
8
9
10
3139
3140
3141
3142
3143
3144
3145
3146
3147
SD
CE
N
CA
C6
N6
N1
C2
N3
C4
O
O
O
MET
MET
ILE
ILE
SAH
SAH
SAH
SAH
SAH
SAH
HOH
HOH
HOH
A
A
A
A
1
1
2
2
328
328
328
328
328
328
329
330
331
-29.059
-27.535
-29.656
-30.077
-11.642
-10.474
-11.895
-13.079
-14.120
-13.832
-29.525
-28.213
-24.619
Atom, residue
Numbering
or molecule
Chain if exists
28.614
29.074
32.903
33.171
26.514
26.661
25.334
25.090
25.887
27.092
42.890
42.867
35.287
71.539
70.866
69.094
67.730
89.489
90.103
88.899
88.350
88.278
88.861
90.934
93.588
96.173
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
26.90
16.57
25.93
25.49
17.97
14.50
23.10
16.93
16.05
14.31
24.84
8.11
17.96
Coordinates: X, Y,Z
http://www.wwpdb.org/documentation/format3.1-20080211.pdf
S
C
N
C
C
N
N
C
N
C
O
O
O
Investigating & visualizing protein structures
RCSB – The Protein Data Bank
More Sequences Than Structures
Discrepancy between the number of known sequences and
solved structures:
5,047,807 UniRef90 entries vs.
19988 90% Non-redundant structures
Computational methods are needed to
obtain more structures
Investigating & visualizing protein structures
Fold classification
Classification: clustering proteins into structural
families
Motivation?
Profound
analysis of evolutionary
mechanisms
Constraints on secondary structure
packing?
Classification
at domain level
Investigating & visualizing protein structures
Fold classification
http://scop.berkeley.edu
• The SCOP database aims to
provide a description of the
structural
and
evolutionary
relationships between all proteins
whose
structure
is
known,
including all entries in the PDB.
• The SCOP classification of
proteins has been constructed
manually, but with the assistance
of tools to make the task
manageable and help provide
generality.
Investigating & visualizing protein structures
Fold classification
1. Family: Clear evolutionarily relationship
Generally, this means that pairwise residue identities
between the proteins are 30% and greater.
2. Superfamily: Probable common evolutionary origin
Proteins that have low sequence identities, but whose
structural and functional features suggest that a common
evolutionary origin is probable are placed together in
superfamilies.
Investigating & visualizing protein structures
Fold classification
3. Fold: Major structural similarity
Same major secondary structures in the same arrangement
and with the same topological connections. Different proteins
with the same fold often have peripheral elements of
secondary structure and turn regions that differ in size and
conformation. In some cases, these differing peripheral
regions may comprise half the structure.
Proteins of the same fold category may not have a common
evolutionary origin: the structural similarities could arise
from convergent evolution.
Investigating & visualizing protein structures
Number
Growth of unique folds as defined by SCOP
Year
Investigating & visualizing protein structures
Fold classification
 Hierarchical
classification of protein
domain structures in the PDB.
 Domains
are clustered at five major
levels:
Class
Architecture
Topology
Homologous superfamily
Sequence family
Investigating & visualizing protein structures
Fold classification
• Class [C] - derived from secondary structure content
(automatic)- alpha, beta, alpha and beta, few.
• Architecture [A] - derived from orientation of secondary
structures (manual)
• Topology [T] - derived from topological connection and
secondary structures- (by automated structural alignment)
• Homologous Superfamily [H]/sequence family- clusters of
similar structures & functions.
Investigating & visualizing protein structures
SCOP Vs. CATH
Same SCOP family, different CATH
topologies: d1rh6b (a.6.1.7) / 1rh6B00
(1.10.1660.20) vs. d1g4da
(a.6.1.7) / 1g4dA00 (1.10.10.10)
Csaba et al., 2009
Different SCOP classes, same CATH
homologous superfamilies: d1bbxd
(b.34.13.1) / 1bbxD00
(2.40.50.40) vs. d1rhpa (d.9.1.1) /
1rhpA00 (2.40.50.40)
Investigating & visualizing protein structures
SCOP Vs. CATH
SCOP
class
fold
superfamily
family
CATH
class
architecture
topology
homologous superfamily
sequence family
CATH more directed toward structural classification,
SCOP pays more attention to evolutionary relationships
Investigating & visualizing protein structures
PdbSum
• A database providing an overview of all biological
macromolecular structures
• Connected to UniProt  find the sequence accession of a
known PDB ID
• Detailed description of many structure properties, e.g.:
–
–
–
–
–
–
EC number
Chains & ligands and their interactions
Clefts
Secondary structure
FASTA sequence of structure…
…
Investigating & visualizing protein structures
PdbSum
PDB ID
http://www.ebi.ac.uk/thornton-srv/databases/pdbsum/
Free text
Search by sequence
Investigating & visualizing protein structures
PdbSum
Useful tabs
UniProt
accession
Chains &
ligands
Investigating & visualizing protein structures
PdbSum
GO annotation
EC and reaction
Highlights from
the related paper
Investigating & visualizing protein structures
PdbSum
Protein tab
Secondary structurefrom the PDB
Investigating & visualizing protein structures
PdbSum
Ligand tab
The ligand’s
structure
LigPlotPredicts the
residues that
bind the ligand
Investigating & visualizing protein structures
Before the invention of computer graphics, trained artists were
employed for hand-drawing understandable picture of a protein
Irving Geis (1908 – 1997)
Investigating & visualizing protein structures
Features:
PyMol Viewer
• Viewing 3D Structures
• Rendering Figures
• Giving Presentations
• Animating Molecules
• Sharing Visualizations
• Exporting Geometry
Investigating & visualizing protein structures
Pymol Viewer:
Potassium channel from (kcsa) from
streptomyces lividans, pdb id 1bl8
Declan et al., 1998
Investigating & visualizing protein structures
View Manipulation
• Identify the different parts
of the screen:
-the external GUI window
-the internal GUI window.
• The internal window contains
the viewer, which displays the
molecule, and the command
line.
Investigating & visualizing protein structures
View Manipulation
To manipulate an object, we use the
letter icons near its name
- A – Action
- S – Show
- H – Hide
- L – Label
- C – Color
Investigating & visualizing protein structures
View Manipulation
Change the representation of the object to “Cartoon” using:
S (show)  As  Cartoon
Investigating & visualizing protein structures
View Manipulation
Other protein representations under “S” “As”:
• Lines
•Ribbons
• Sticks
• Dots
• Spheres
• Surface
Investigating & visualizing protein structures
Part 1: View Manipulation
Color by chain: C (color)  by chain
Investigating & visualizing protein structures
View Manipulation
Other coloring options:
• Color by spectrum: b-factor, rainbow
• Color by secondary structure (“SS”)
• Color by element:
• A lot of available colors,
other can be defined in the external GUI
“settings””colors…” “new”
Investigating & visualizing protein structures
Selecting and manipulating specific parts of
the molecule
• Select specific amino acids
by clicking on them .
• Select a range in the
sequence by clicking the
first residue, and then
“shift+click” on the last
residue.
• The selection will be
indicated on the structure
(in pink dots).
Investigating & visualizing protein structures
Selecting and manipulating specific parts of
the molecule
• In the object list, a new
object “(sele)” was added.
•This object represents
the current selection
• You can manipulate it
with the buttons next to
the object. For example,
change its representation
to sticks
•(“S” “As”  “Sticks”)
Investigating & visualizing protein structures
Selecting and manipulating specific parts of
the molecule
• Give a different name to the
selection, so you can easily
manipulate it later.
•Select the first chain again
(using the sequence) and
change it name to “chain1” by
pressing:
“Action  Rename Selection”
and typing “chain1”.
Investigating & visualizing protein structures
Making high-quality photos
1. Change the background color to white, with
“Display  Background  White”
on the external GUI menu:
Investigating & visualizing protein structures
Making high-quality photos
2. Type in the command line: “ray [x], [y]” ”… wait…
3. Save the image by: “Save”  “Image
Pay attention not to accidentally press on
the image before saving!
Investigating & visualizing protein structures
Making high-quality photos
Investigating & visualizing protein structures
Making high-quality photos
Investigating & visualizing protein structures
ConSurf
The goal: identification of functionally
important amino acids that mediate the
interaction of a query protein with ligands,
DNA/RNA, other proteins etc.
Approach: Functionally important amino acid sites
are often evolutionarily conserved
Investigating & visualizing protein structures
Consurf
Beta Class N6-Adenine DNA
Methyltransferase
Investigating & visualizing protein structures
ConSurf
The 3D structure of
Beta Class N6-Adenine
DNA Methyltransferase
has already been solved:
PDB id : 1nw7
Investigating & visualizing protein structures
Consurf
• The ConSurf webserver calculates
the evolutionary rate for each
position in the protein
• The results, mapped on the
structure, reveal residues crucial
for
function
and
structure
stability
• In this case, the ligand is bound in
a highly conserved cluster of
residues
http://consurf.tau.ac.il/
Investigating & visualizing protein structures
Consurf
The consensus
sequence
approach:
..W..
..W..
..W..
..W..
.. E..
.. G..
Investigating & visualizing protein structures
Consurf
However,
some
sequences
might be close
homologues of
each other
..W..
..W..
..W..
..W..
.. E..
primates
.. G..
Conclusion:
Assessing conservation without taking into
consideration the phylogenetic relations may lead to
uneven sampling in sequence space
Investigating & visualizing protein structures
Consurf
Phylogenetic reconstruction may be used to distinguish
between two possible cases:
1. Structural/functional constraints that truly result in
sequence conservation as a result of evolutionary
pressure.
2. Short evolutionary time that may be mistaken as
sequence conservation, while no evolutionary pressure
affects the examined position.
Investigating & visualizing protein structures
Consurf
Rate4Site:
an algorithm for calculating the evolutionary
rate at each amino acid site
Definition:
Evolutionary rate =
number of AA replacements/(site*year)
Conserved sites evolve slowly
variable sites evolve rapidly
Pupko et al., 2002
Mayrose et al., 2005
Investigating & visualizing protein structures
Consurf
Web-Server:
http://consurf.tau.ac.il/
Landau et al., 2005
Investigating & visualizing protein structures
Consurf coloring bar
The Rate4Site conservation scores are not specific
integers.
Such scores are impossible to display on a structure.
Hence, the ConSurf webserver divides them into 9
bins- 1 for highly variable , 9 for the most conserved
Investigating & visualizing protein structures
Consurf
The ConSurf webserver
Essential input- MSA and tree constructed
by ConSurf through “advanced options”:
1. PDB ID\PDB file\model-structure and chain
Essential and optional input:
1. PDB ID\PDB file\model-structure and chain
2. Constructed MSA, with the query sequence
included
3. Phylogenetic tree
http://consurf.tau.ac.il/index.html
Essential and Optional input:
Bayesian
Max Likelihood
1NW7
Check in
the PDBsum…
MSA
Sequence name
in the MSA
Tree
Email
http://consurf.tau.ac.il/index.html
Essential input:
1NW7
Check in
the PDBsum…
http://consurf.tau.ac.il/index.html
Essential input:
Email
Alignment
method
SWISS-PROT
UniProt
Additional
BLAST options
http://consurf.tau.ac.il/index.html
Calculation Finished:
Easy web-based viewer
Viewer for producing medium-quality images*
View scores
Produced or input MSA
View phylogenetic tree
Script for coloring in RasTop*
Instructions for PyMOl*
Investigating & visualizing protein structures
Consurf
Jmol- Easy web-based viewer
Investigating & visualizing protein structures
Consurf
Summary - MSA Quality
• ConSurf is dependent on the quality of the MSA.
• When an MSA is not given by the user, sequences
are automatically gathered by PSI-BLAST and
aligned by CLUSTALW with default parameters.
• Even though these alignments are usually good, it
is highly recommended to inspect the alignment
manually and with other tools in order to improve
the quality of the evolutionary data .
Investigating & visualizing protein structures
Consurf
A caveat: In some cases the functionally
important region may not be conserved
at all
The peptidebinding groove of
the MHC class I
heavy chain.
PDB id : 2vaa
Investigating & visualizing protein structures
PatchFinderidentification of functional sites
Patch- a spatially continuous
cluster of surface residues.
Problems:
– Subjectivity of boundaries.
– Difficult to apply on large
datasets
Investigating & visualizing protein structures
PatchFinder
Input: 1. Protein Structure
(1) Assignment of
conservation scores
(Rate4Site3)
2. Multiple sequence alignment (MSA)
(2) Identification of exposed
residues
(3) Extraction of the surface
patch of conserved residues
with the highest statistical
significance (ML-patch).
(4) Identification of nonoverlapping secondary patches
1Nimrod
et al., 2005
et al, 2008
3Mayrose et al., 2004
2Nimrod
Investigating & visualizing protein structures
PatchFinder- http://patchfinder.tau.ac.il/
Investigating & visualizing protein structures
Summary of structure-related
databases & tools
• Secondary structure prediction- PredictProtein,
Meta PredictProtein and PSIPRED.
• PDB, SCOP and CATH- collection and classification
of structures available by experiment.
• Structure visualization- PyMol
• Conservation analysis- Consurf and Patchfinder
Protein structure prediction
Structure Prediction Approaches
1. Homology (Comparative) Modeling
Based on sequence similarity with a protein for
which a structure has been solved.
2. Threading (Fold Recognition)
Requires a structure similar to a known structure
3. Ab-initio fold prediction
Not based on similarity to a sequence\structure
Ab-initio
Structure prediction from “first principals”:
Given only the sequence, try to predict the structure
based on physico-chemical properties
(energy, hydrophobicity etc.)
•
When all else fails  works for novel folds
•
Shows that we understand the process
The Force Field
(energy function)
A group of mathematical expressions describing the
potential energy of a molecular system
•
Each expression describes a different type of physicochemical interaction between atoms in the system:
•
Van der Waals forces
•
Covalent bonds
•
Hydrogen bonds
•
Charges
•
Hydrophobic effects
Non-bonded
terms
Approaches to Ab-initio Prediction
1. Molecular Dynamics
• Simulates the forces that governs the protein within water.
• Since proteins usually naturally fold, this would lead to the
native protein structure.
Problems:
• Thousands of atoms
• Huge number of time steps to reach folded protein
 feasible only for very small proteins
Approaches to Ab-initio Prediction
2. Minimal Energy
Assumption: the folded form is the minimal energy
conformation of a protein
Main principals:
• Define an energy function.
• Search for 3D conformation that minimize energy.
Ab-initio
2. Minimal Energy
• Use of simplified energy function
• Search methods for minimal energy conformation:
– Greedy search
– Simulated annealing
–…
Ab-initio
• Current methods (e.g. Rosetta) primarily utilize the
fact that although we are far from observing all
protein folds, we probably have seen nearly all substructures:
Local sequence-structure relationships:
• A library of known sub-structures
(fragments less than 10 residues) is created.
• A range of possible conformations for
each fragment in the query protein are selected.
Moult J. Philos. Trans. R. Soc. B. 361:453–458 (2006)
Ab-initio
Non-local sequence-structure relationships:
• The primary nonlocal interactions considered are hydrophobic
burial, electrostatics, main-chain hydrogen bonding etc.
Structures that are consistent with both the local and
non-local interactions are generated by minimizing
the non-local interaction energy in the space defined
by the local structure distributions.
Moult J. Philos. Trans. R. Soc. B. 361:453–458 (2006)
Ab-initio - Example
Moult J. Philos. Trans. R. Soc. B. 361:453–458 (2006)
Fold Recognition
(Threading)
Given a sequence and a library of folds, thread the sequence
through each fold. Take the one with the highest score.
• Method will fail if new protein does not belong to any fold in
the library.
• Score of the threading is computed based on known
physical chemistry properties and statistics of amino acids.
Threading: example
• structural template
4E
• neighbor definition
C3
• energy function
C2
ACCECADAAC
-3-1-4-4-1-4-3-3=-23
A1
E
E
aib j
positionsi, j
10
5
C
9
6 A
8
7 D
Eab
A
C
D
E
.
A C
-3 -1
-1 -4
0 1
0 2
. .
C
A
A
D
0
1
5
6
.
E …..
0 ..
2 ..
6 ..
7 ..
.
Find best fold for a protein sequence:
Fold recognition (threading)
1)
...
56)
...
MAHFPGFGQSLLFGYPVYVFGD...
-10
...
...
n)
...
-123
...
Potential fold
20.5
GenTHREADER
• Align the query sequence with each template
(requires some sequence homology!)
• Assess the alignment by:
– Sequence alignment score
– Pairwise potentials
– Solvation function
• Record lengths of: alignment, query, template
• Using Neural Network the overall score is computed.
Jones DT et al. J. Mol. Biol. 287: 797-815(1999)
GenTHREADER
Jones DT et al. J. Mol. Biol. 287: 797-815(1999)
I-TASSER- Hybrid Approach
• In a recent wide blind experiment, CASP7, I-TASSER
generated the best 3D structure predictions among all
automated servers.
•Based on the secondary-structure threading and the iterative
implementation of the Threading ASSEmbly Refinement
(TASSER) program.
•For predicting the biological function of the protein, the
I-TASSER server matches the predicted 3D models to the
proteins in 3 independent libraries which consist of proteins of
known enzyme classification (EC) number, gene ontology (GO)
vocabulary, and ligand-binding sites.
I-TASSER
Test Case:
Rosetta Vs. TASSER
Grey: Crystal
structure of Betannnn:
Purple: Rosetta
prediction, starting
from homology
modeling
Green: TASSER
predcition
Homology Modeling –
Basic Idea
1.
A protein structure is defined by
its amino acid sequence.
2.
Closely related sequences adopt
highly similar structures, distantly
related sequences may still fold
into similar structures.
3.
Three-dimensional structure of
proteins from the same family is
more conserved than their
primary sequences.
Triophospate ismoerases
44.7% sequence identity
0.95 RMSD
General Scheme
1.
Searching for structures related to the query sequence
2.
Selecting templates
3.
Aligning query sequence with template structures
4.
Building a model for the query using information from
the template structures
5.
Evaluating the model
Fiser A et al. Methods in Enzymology 374: 461-491(2004)
General Scheme
Homology modeling requires
handling structures & sequences
• Query- only the protein sequence is available- usually found
at the UniProt database
• Template- after identification, both structural and sequencerelated data should be found- UniPort (or NCBI databases),
RCSB and PDBsum
Homology modeling- querytemplate alignment
Different levels of similarity between the template & query
initiate various computational approaches:
Homology modeling- model
evaluation
Evolutionary Conservation
http://consurf.tau.ac.il
Homology modeling- model
evaluation
Evolutionary Conservation
http://consurf.tau.ac.il
Homology modeling- model
evaluation
Evolutionary Conservation
http://consurf.tau.ac.il
Homology Modeling
• The accuracy of the model depends on its
sequence identity with the template: