Transcript Document
High Throughput Processing of
the Structural Information of the
Protein Data Bank
Zoltán Szabadka, Vince Grolmusz
Department of Computer Science
Eötvös University, Budapest
What is wrong with the PDB?
• It is not uniform, each author has a different style
• It is hard to process it automatically
– Residue numbering is not always sequential
– The chemical symbols of the atoms are often missing
– It is not easy to tell how many ligands there are in an entry, chain
ids are not used consistently
– It is not clearly indicated if a molecule has missing atoms, and
which atoms are missing
• There is a need for a “front-end” database to the
PDB
Flow of data
Internet
download and check for updates
local PDB mirror
structural decomposition
database of structure and
coordinate data
SQL query
test sets of
docking algorithms
SQL query
list of
binding sites
SQL query
statistical
information
What type of molecules are there
in a PDB entry?
•
•
•
•
•
Protein chains (P)
DNA/RNA chains (N)
Ligands (L)
Metals and other small ions (I)
Water molecules (W)
Information stored in the database
• Covalent structure of molecules
• List of components of each entry
• Coordinate data for each atom
• Interactions between molecules
E/R diagram of the database
covalent structure
id
id
molecule
contains
atom
bond
symbol
type
id
contains
monomer
num
type
E/R diagram of the database
component structure
id
entry
contains
component
molecule id
pdbid
contains
type
id
interaction
length
atom
(x,y,z)
PDB file formats
PDB format
This is the original PDB file format, it contains data records in separate
lines, each with fixed length and format, eg. ATOM, HETATM,
SEQRES, CONECT, etc.
mmCIF format
This is a relational database description language, a file contains data
tables called categories.
XML format
The same tables are described by XML tags. The file sizes are
huge, a file contains more data tags then data.
Structural units of an entry
•
•
•
•
•
The basic structural unit of both the PDB and the mmCIF format is the so
called monomer. It can be a molecule, a molecule fragment or just an atom.
Each such monomer has an at most three letter long code, called monomer id,
eg. ALA for alanine, MG for magnesium ion, ACE for acethyl group, or HOH
for water.
A protein chain consists of many amino acid monomers, each having a
sequence number that indicates its position within the chain.
Similarly, DNA/RNA chains consist of many nucleic acid monomers.
Metals, small ions, water and most ligands are one monomer having a unique
monomer id.
• The basic problem is that there are certain ligand molecules that
consist of two or more monomers, and this information is not
always properly annotated in the PDB entries in either formats.
mmCIF data categories
• entity
List of molecules in the entry, can be of three types: polymer, non-polymer and
water. Each molecule has an entity id.
• entity_poly
Contains the type of polymer entities, eg. polypeptide(L)
• struct_asym
List of the components in the asymmetric unit. Each component has an
asym id and an entity id.
• pdbx_poly_seq_scheme
Describes the sequence of monomers in a polymer entity.
• pdbx_nonpoly_scheme
List of the monomers belonging to the non-polymer entities.
• atom_site
Coordinate data for atoms, whose positions could be experimentally determined.
Structural decomposition
based on the mmCIF format
•
•
•
•
•
First we read the list of components in the asymmetric unit.
For each component, we read its entity type, and for each polymer entity, its
polymer type.
Then we read the sequence of monomers for the polymer entities, and the list
of monomers belonging to the non-polymer entities.
The structure of monomers if known ‘a priori’ from a file named
components.cif, which can be found at RCSB’s web site.
So for each monomer, we have a list of atoms, lacking coordinate information.
Now we go through the table atom_site, and for each atom, we find the
monomer it belongs to, and fill the coordinates for the atom just found. If an
atom of a monomer is not found, it will be marked as missing.
Definition of molecule types
• Protein chain:
a polymer entity of type “polypeptide(L)”, which is at
least 10 monomers long
• DNA/RNA chain: a polymer entity, which is at least 5 monomers
long and its type is either “polydeoxiribonucleotide”, “polyribonucleotide”, or
more then half of its monomers are nucleic acids (A,C,G,I,T,U monomer id)
• Ion: there is a predefined list of monomer ids, containing metals and small
ions
• Water: the monomers of the water entity
• Ligand: all monomers, that do not belong to the above categories will
form the set of ligand monomers
Ligands and binding sites
• We define a graph on the atoms that have coordinate data. It will have
two types of edges:
– covalent: if the distance of the two atoms is less then 1.25 times the sum
of their covalent radii
– VdW: if it is not covalent, but the distance of the two atoms is less then
the sum of their Van der Waals radii
• The graph is built using a 3 dimensional kd-tree in O(n log n) time
• We go through the edges:
– if an edge of covalent type connects two ligand molecules, then they will
be joined together in one new molecule
– if an edge connects a ligand to a protein chain, then this intermolecular
interaction will be recorded in the protein-ligand interaction table,
marking the binding site of this ligand on the protein surface
PDB version: June 6, 2005
•
•
•
•
•
•
•
Number of PDB entries: 31,217
Number of entries processed: 26,445
Number of protein chains: 59,842
Number of different sequences: 18,333
Number of ligands: 53,834
Number of different ligand molecules: 6,016
Number of all atoms: 269,237,779
–
–
–
–
Number of atoms in protein chains: 240,243,785
Number of atoms in DNA/RNA chains: 7,709,842
Number of atoms in ligands and ions: 2,479,339
Number of atoms in water: 18,804,813
Distribution of elements in
ligands and ions
Inorganic elements
Organic elements
H
C
O
N
P
S
Other
MG
FE
CA
ZN
CL
NA
MN
F
K
CU
CD
W
I
BR
HG
X
CO
NI
Other
The distribution of the organic and the most frequent inorganic elements among the ligands
and ions. We found 70 different elements.
Distribution of elements in
protein chains
Element Number
H
120638461
C
75710684
O
22672185
N
20660541
S
540432
SE
20730
%
50,22
31,51
9,44
8,60
0,22
0,01
There were 17 different elements
in the protein chains, the tables
show the number of occurrences,
and for the non-standard elements,
the monomers that contain them.
Element Number
Monomers
P
MIS, CSP, PTR, LLP, SEP,
499 TPO, CYQ, GPL, PAS,
ASQ, NEP, SDP, LYX
F
116
AS
HG
I
BE
B
BR
CL
PB
V
53
48
13
9
4
4
2
2
2
EFC, FTR, YOF, BFD, LEF,
4FW, 4F3, MFC
CAS, CAF, CZZ, CSR, CZ2
CMH
TYI, PHI
BFD
CLB, CLD, SBL, SBD
DBY
CLB, CLD
CSB
SVA
Distribution of protein monomers
LEU
ALA
GLY
VAL
GLU
SER
LYS
ASP
THR
ILE
ARG
PRO
ASN
PHE
GLN
TYR
HIS
MET
TRP
CYS
MSE
8,81
8,09
7,58
6,97
6,50
6,22
5,99
5,75
5,72
5,44
4,97
4,68
4,40
3,91
3,81
3,52
2,44
2,05
1,47
1,45
0,17
8,77
8,25
7,66
7,08
6,57
6,10
5,93
5,73
5,71
5,56
4,95
4,65
4,34
3,90
3,73
3,47
2,46
2,14
1,43
1,39
0,14
The table shows the distribution of the 20 natural
amino acids and selenomethionine in the different
chains and in all chains. The other non-standard
monomers are listed below.
ACE
MLY
CGU
PCA
SEP
NH2
CME
PTR
KCX
CSD
MLE
TPO
YOF
CEA
CAS
LLP
CSO
OCS
CSW
TYS
186
172
147
122
85
83
76
55
48
48
46
44
39
37
30
28
24
22
21
20
ABA
CXM
CSS
DAL
CSX
TPQ
FME
MLZ
MVA
IIL
SME
CSE
MHO
STY
NLE
M3L
SAR
SEB
BMT
MEN
19
18
16
16
15
15
15
14
11
10
10
9
9
9
8
8
8
7
7
7
5HP
YCM
SCY
FTR
SAC
MIS
DLE
AYA
TRQ
IAS
TRN
BFD
CMH
DSN
CSR
NEM
OMT
HIC
DAR
CYG
7
7
6
6
5
5
5
5
5
4
4
4
4
4
4
4
4
4
3
3
ASI
ALY
HMR
ORN
SET
NEP
TYI
CAF
HTR
TA4
SEC
DOH
CSB
DTR
DMT
STA
MME
DGL
ASQ
CSP
3
3
3
3
3
3
3
3
3
3
2
2
2
2
2
2
2
2
2
2
BHD
CYM
NVA
MSA
CMT
DAH
143
CZ2
TRO
LEF
HSL
DCY
DVA
MSO
NIY
LYZ
CCS
CSZ
C5C
PAS
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
1
1
AEI
PAQ
OSE
SNC
TBM
DHN
CR5
LLY
EFC
IML
DBY
2MR
SEG
CYD
GHG
DMG
LYX
ASB
DDE
CYQ
MHL
MCL
MFC
CLD
GLZ
PCC
DHA
DPN
SVA
TMD
CSA
S1H
AHP
AHB
4F3
SBD
GPL
TYQ
CAY
PHI
ARO
LAL
CLB
BAL
C6C
DAS
OAS
5CS
MPT
NPH
DSE
CY4
TRF
SOC
DHI
TMB
GLH
CZZ
4HT
DTY
EHP
3AH
DHL
MTY
BUC
MGY
DAB
PEC
HLU
MDO
SBL
GLQ
TYY
BCS
175
PYX
MNV
SDP
TYN
4FW
Protein-Ligand interactions
10gs
A
condition
1
2
C
3
4
Conditions:
1
bond type=VDW
2a
no missing atom from protein
2b
<10% missing atoms from protein
3
no missing atom from ligand
4
protein size btween 1000 and 10000
5
ligand size between 10 and 100
interaction
1
50988
1&4
45872
1&4&2b
20055
1&4&2b&5
13176
1&4&2b&3&5
10285
1&4&2a&3&5
6053
entry int. type
12798
15289
12072
14196
5752
6558
4660
4900
3655
3691
2193
2261
The table above shows the number
of protein-ligand interactions, the
number of entries they occur in, and
the number of different interaction
types while more and more conditions are met.
Distribution of missing atoms
number of PDB entries
10000
8000
6000
4000
2000
0
0
1-10
11-100
101-1000
1001-10000
10001-
number of missing atoms
The distribution of the number of missing atoms from protein chains in the PDB
entries. Note, that there are relatively few entries, where only a few atoms are missing.
Distribution of missing segments
6000
500
5000
400
4000
300
3000
40
37
34
31
28
25
22
19
16
13
10
7
4
40
37
34
31
28
25
22
19
16
13
10
0
7
0
4
100
1
1000
1
200
2000
The distribution of the lengths of
missing chain segments at the
beginning, at the middle and at the end
of the chains. The length is measured
in amino acids. Note that in the
middle of the chain, typically 4-7
amino acids are missing.
1400
1200
1000
800
600
400
200
40
37
34
31
28
25
22
19
16
13
10
7
4
1
0
Thank You!