Transcript EC->PDB
Gene Annotation and Analysis
Lab Work
Reference: European Multimedia
Bioinformatics Educational Resource
Chapter 6: Fold Classification
• The aim of this part of the tutorial is to learn
about the structure of your protein.
• Where 3D coordinates are available, we start
by examining the Protein Data Bank (PDB)
summary files. We then examine some of the
structure classification resources, and
compare results to see if they are similar.
Step 1: Homologues with known
3D structure
• Homologues with known 3D structure
• In this step, we will seek homologues of known structure.
• i) Use your sequence to run BLAST at the EBI or NCBI , setting
PDB as the database. Choose the most significant hit and note
its PDB ID code. If a structure exists, it should be the first hit.
• ii) Alternatively, from your sequence's SWISS-PROT or TrEMBL
entry, you could use the EMBnet Direct BLAST option, again
selecting PDB as the database.
• Reflections...
– How many matches were statistically significant?
– Why are there fewer hits from a BLAST of PDB than from a BLAST of a
sequence database?
Step 2: PDB summary files
• Here, we will explore details of your structure from its PDB summary files.
• i) Supply the PDB ID code to the PDBsum query form. Examine the entry,
including the images provided, the secondary structure elements, any
associated ligand(s), the PDB header, and so on.
• ii) Alternatively, supply a keyword to the Search string box (e.g.,
"rhodopsin").
• Reflections...
– How do the secondary structure elements shown here compare with the predictions
from the previous Chapter? How accurate are the predictions?
– What information is stored in the PDB header? What is the purpose of having a
sequence stored in the header when sequence databases store this information? (HINT)
– Does your protein have an associated ligand? If so, which residues in the sequence
interact with the ligand? Referring back to your sequence alignment, are these residues
conserved? Would you expect them to be? Do any of them lie in the motifs defined by
PROSITE, BLOCKS or PRINTS?
Step 3: Protein Classification
• PDBsum unites a number of resources. We will now make use of this
feature to explore some structure classification databases and, where
appropriate, an enzyme classification resource.
• i) From PDBsum, follow the CATH link. Follow the link(s) under the Quick
Links heading to discover the position of your protein within the CATH
hierarchy and view its structural relatives. To navigate the hierarchy, click
on the links next to the hierarchy icons (
).
• ii) From PDBsum, follow the SCOP link to discover the position of your
protein within the SCOP hierarchy (under the heading Quick Links) and
view its structural relatives. Follow links to the NCBI's PDB entry ( ) and
explore the external links ( ). NOTE: For some proteins, there may be
more than one polypeptide chain - the SCOP hierarchies can be explored by
following these links independently.
• iii) Where applicable, PDBsum also links to the Enzyme Commission (EC)
classification. Follow the EC->PDB link to view this, and the corresponding
links to the Expasy ENZYME, KEGG and WIT databases.
Step 3: Protein Classification
• Reflections...
– CATH:
• What is the CATH number of your protein?
• What does this number mean in terms of its Class, Architecture, Topology and Homology?
(HINT)
• How does the structural hierarchy differ from the familial hierarchies used in PRINTS?
• If your structure belongs to more than one class, why might this be so?
– SCOP:
•
•
•
•
•
How is your protein classified in SCOP in terms of its Class, Fold, Superfamily and Family?
How does the structural hierarchy differ from the familial hierarchies used in PRINTS?
Is the Pfam link to the same entry found in Chapter 3? If not, how is it related?
Does the classification differ from that given by CATH. If so, why might this be so? (HINT)
If your structure belongs to more than one class, why might this be so?
– EC->PDB:
• If there is no EC->PDB link in your entry, why might this be so?
• If there is a link, what is the EC number of the entry, and what catalytic role does this number
reflect? (HINT)
Step 4: Protein structure
visualisation
•
•
•
•
Here, we will visualise the structure of your protein and become familiar with molecular
viewers and the PDB file format.
i) Follow the link from PDBsum to the PDB entry. Click on QuickPDB (this requires a Javaenabled browser). Highlight residues that you know to be important, such as motifs identified
in the protein family database searches. NOTE: Theoretical models are no longer available
from the main PDB directory.
ii) To download the raw PDB file, return to the PDB entry and follow the link to Download
Files on the right-hand side menu. In the Download files menu, choose the uncompressed
PDB link. View the file in a text editor to get a feel for how the annotation fields and atomic
coordinates are encoded.
iii) There are several PDB structure viewers and molecular visualisation packages available for
download. Some examples are listed below.
Downloadable Molecular Structure Viewers
PDB structure viewers
RasMol , QuickPDB and Deep View
non-PDB format viewer
(accepts files in an NCBI-specific
format)
Cn-3D
Step 4: Protein structure
visualisation
• Reflections...
– How do the conserved motifs relate to the structure?
– What functional inference, if any, can you deduce from the
relative positions of the conserved motifs in 3D? E.g., do
they congregate around an active site? (HINT)
– In the PDB file, what is the name of the field used to store
the 3D coordinates? (HINT)
Quiz: Chapter 6
1. Databases such as CATH and SCOP are used to identify:
A. The structural family to which a protein belongs.
B. The genic family to which a protein belongs.
C. Homologous proteins.
D. Analogous proteins.
2. Resources such as EC->PDB are used to identify:
A. The structural class of proteins.
B. The catalytic activity of enzymes with known structure.
C. The family to which a protein belongs.
D. Details of the reaction mechanism of a protein.
3. In CATH, proteins are grouped together at the topology-level on the basis that they
share:
A. The same gross secondary structure composition.
B. The same secondary structures but different connectivities.
C. The same overall shape and connectivity of secondary structures.
D. A common ancestor.
Quiz: Chapter 6
4. For SCOP, which of the following statements is TRUE:
A. Entries are created using automated methods only.
B. Entries are created using automated and manual methods.
C. Entries are created using manual methods only.
D. Entries are derived from CATH.
5. Coordinates for known protein structures are housed in?
A. CATH.
B. SCOP.
C. PDBsum.
D. PDB.
Information
6.1
6.2
6.3
6.4
6.5
6.6
PDB
PDB Summary
CATH
SCOP
EC->PDB
Visualisation of Protein Molecules
6.1 PDB
• The Protein Data Bank (PDB) is the principal
repository of biological macromolecule structures.
These are derived from a number of different
experimental techniques (under the Materials and
Methods section)including electron, x-ray and
neutron diffraction, and NMR. The PDB is maintained
by a non-profit consortium, termed the Research
Collaboratory for Structural Bioinformatics (RCSB).
Several mirrors are available worldwide from which
PDB entries may be viewed and downloaded.
6.1 PDB
• Stored in a text format PDB files contain a 'header' and a main
body, which stores the atomic coordinates of all the resolved
atoms in the structure. The header includes the following
details:
– information on the protein, organism, etc.
– literature citations
– protein sequence (which may be different from those found in
sequence databases, e.g., if the protein has been engineered to
facilitate crystallisation)
– the method by which the structure was obtained
– crystal packing and refinement information
– secondary structure information (e.g., helix from residues 13-25, turn
from residues 26-30, etc.)
6.1 PDB
6.1.1 Growth of the PDB
• Unlike DNA sequencing, protein structure determination is
not yet a fully automated process. Different techniques have
different limitations. In crystrallography, for example,
obtaining crystals can be difficult; or, having got crystals,
finding candidates that will diffract well can be problematic.
Whatever the technique used, building a robust structural
model from the raw data is a further time-limiting factor. The
rate of submission of new structures to the PDB is thus far less
than the deposition rate of sequence data to the central
sequence repositories: e.g., in July 2002, PDB contained
16,507 entries, while in June 2002 GenBank contained
17,471,000 sequences - see Fig 6.2. Note that both of these
figures are highly redundant, so the number of unique
structures and sequences is very much smaller.
6.1.1 Growth of the PDB
•
Fig 6.2. Difference in the growth of the number of sequences in GenBank vs. the number of
3D structures in PDB. The graph has been truncated at 1994 to keep the curves on the same
scale.
6.2 PDB Summary
• PDBsum provides summary information for all
proteins of known structure. The structure
summary includes details of resolution and R
factor, secondary structure, associated ligands,
fold cartoons, ligand interactions, and so on.
Brief summaries of the summary information
are also available for each entry.
6.3 CATH
•
•
•
•
•
•
CATH is a hierarchical classification of protein structural relationships derived using
a combination of automatic and manual methods. CATH identifies different classes
by means of a unique number (by analogy with the E.C. system for enzymes), as
well as a descriptive name. The acronymn denotes:
Class - the highest level of the classification, derived from overall secondary
structure content and packing
Architecture - describes the gross arrangement or orientation of secondary
structures, independent of their connectivities
Topology - relates both to the overall shape and connectivity of the secondary
structures
Homologous superfamily - clusters protein domains that share both sequence and
structural similarity (and are hence believed to be homologous)
In addition, a Sequence (S-level) subset clusters H-level structures on the basis of
sequence identity. Domains in the same S-level have sequence identities >35%
(with at least 60% of the larger domain equivalent to the smaller), indicating highly
similar structures. Fig 6.3 depicts a few examples of architectures recognised in
CATH.
6.3 CATH
6.4 SCOP
•
•
•
The SCOP (Structural Classification Of Proteins) database describes structural and
evolutionary relationships between proteins of known structure. The database has
been constructed using a combination of manual inspection and automated
methods, because current automatic sequence and structure comparison tools
cannot identify all structural relationships reliably. Proteins are classified in a
hierarchical fashion to reflect their structural and evolutionary relatedness.
Within the hierarchy, the principal levels describe the family, superfamily and fold:
proteins are clustered into families with clear evolutionary relationships if they
have sequence identities >= 30% (but this is not an absolute measure); proteins
are placed in superfamilies when, in spite of low sequence identity, their structural
and functional characteristics suggest a common evolutionary origin; and proteins
are classed as having a common fold if they have the same major secondary
structures in the same arrangement and with the same topology, whether or not
they have a common evolutionary origin. In these cases, the structural similarities
could have arisen as a result of physical principles that favour particular packing
arrangements and fold topologies.
Boundaries between such levels may be subjective, but the higher levels generally
reflect the clearest structural similarities.
6.5 EC->PDB
• The Enzyme Structures Database (EC->PDB) relates known enzyme
structures deposited in the PDB to their Enzyme Commission (EC)
classification and provides links to the ExPASy ENZYME Data Bank. The EC
classification comprises the following broad categories:
• E.C.1.-.-.- Oxidoreductases
E.C.2.-.-.- Transferases
E.C.3.-.-.- Hydrolases
E.C.4.-.-.- Lyases
E.C.5.-.-.- Isomerases
E.C.6.-.-.- Ligases
• Entries also include links to the Kyoto Encyclopedia of Genes and
Genomes, KEGG (an effort to computerise current knowledge of molecular
and cellular biology in terms of interacting genes and molecular
pathways), and the Evolutionary Analysis of Metabolism, PUMA2
6.6 Visualisation of Protein
Molecules
• Many protein structure viewers use atomic coordinates and
secondary structure information directly from the PDB (these
include RasMol, QuickPDB and Deep View). Others use their
own format (e.g., Cn-3D). Each has various display options, to
highlight different residues or residue types, display the atoms
in different styles, etc., as shown in Fig 6.4. Programs such as
Cn-3D and QuickPDB allow users to highlight areas of interest,
which can be useful for mapping known motifs to 3D space
(e.g., conserved regions in some families surround the active
site, even if they are not close together in sequence).
6.6 Visualisation of Protein
Molecules
•
Fig 6.4. Different protein structure viewers displaying the ubiquitin-like signalling protein,
Nedd8 (PDB ID: 1NND). (A) Deep View, (B) Rasmol, (C) QuickPDB and (D) CN3D. (A) illustrates
classical ball and stick mode, (B) cartoon mode, (C) a wireframe α-carbon trace, with a small
section of the structure highlighted in blue, and (D) a hybrid display with amino acid chains in
cartoon mode and non-amino acid atoms in space-filling mode.
6.6 Visualisation of Protein
Molecules
• Some programs are more than just viewers.
For example, Deep View has functions for
superposition of different structures, virtual
amino acid mutations, interfacing with SwissModel, and so on (see Chapter 7). There are
several advanced features for structural
biologists, including importing electron
density maps to build structures, and various
integrated modelling tools for energy
minimisation.