Macromolecular Structure Database group
Download
Report
Transcript Macromolecular Structure Database group
EMBL-EBI
3D databases and data
warehouse technology
EMBL-EBI
Overview
Overall Strategy
Terms and background
Populating the databases
Clean up processes
How can I use the database?
What next
EMBL-EBI
What is a database?
By the term ‘database’ we refer to the system
rather than the data
Indexed file space
Also used as a shorthand for a database
management system (DBMS)
Methods for accessing and changing data
Controls for referential integrity
EMBL-EBI
Normalisation
Data fields in a normalised database appear only once
RESIDUE
CHAIN
ID
attr
A
185
...
...
CHAIN ID
COMPONENT
SEQ COMP ID
ID
attr
A
1
ASP
ASP
-1
A
2
LYS
LYS
+1
...
...
...
...
...
Data fields in a denormalised database are repeated in
different places
RESIDUE
CHAIN
ID
attr
A
185
...
...
CHAIN ID
A
A
...
COMPONENT
SEQ COMP ID CHAINattr COMPattr
-1
1
ASP
185
ID
attr
ASP
-1
2
LYS
185
+1
LYS
+1
...
...
...
...
...
...
EMBL-EBI
Structural hierarchy
assembly
molecule (entity)
chain
residue
EMBL-EBI
ASU and assemblies
assembly
ASU
chain
chain
residues
residues
EMBL-EBI
The pipeline
archive
PDB
services
pdb
edited
PDB
pdb
manual
edit
archive DB
data
warehouse
post-load
processes
distribution
cif
EMBL-EBI
The first steps
archive
PDB
services
pdb
edited
PDB
pdb
manual
edit
archive DB
data
warehouse
post-load
processes
distribution
cif
EMBL-EBI
The first steps
A series of scripts
Parses non-standard header records
Fills in chain identifiers
Outputs a first cut clean file
Manual editing
~1000 entries require manual editing
The result is a PDB format file that can be
passed to the subsequent automatic steps
EMBL-EBI
bizarre errors …
1ew1
...
...
ATOM
47
N6
A A
2
2.068
5.433
-2.482
ATOM
47
N6
A A
2
2.068
5.433
-2.482
...
...
ATOM
59 1H6
A A
2
1.160
5.722
-2.818
ATOM
59 1H6
A A
2
1.160
5.722
-2.818
ATOM
60 2H6
A A
2
2.901
5.700
2.985
ATOM
60 2H6
A A
2
2.901
5.700
-2.985
...
...
EMBL-EBI
automatic processing
archive
PDB
services
pdb
edited
PDB
pdb
manual
edit
archive DB
data
warehouse
post-load
processes
distribution
cif
EMBL-EBI
process details
Automatic cleanup (d2c)
Incorporates quaternary structure information
Runs a lot of checks and corrections
Outputs mmCIF file
Loading
Metadata-driven custom loader
Load through views with insert triggers
Many heuristics also applied to data within these triggers
EMBL-EBI
Using reference data
Variations in legacy data
Hinders accurate searches
Hinders links to other services
Match data against
controlled vocabularies
Within scripts
Within database during load
Semi-automated
Use string matching algorithms
Effective when controlled
vocabulary well maintained
$COLI
COLI
E. COLI
E.COLI
ESCHERCHIA COLI
ESCHERICHI $COLI
ESCHERICHIA $ COLI
ESCHERICHIA $COLI
ESCHERICHIA COLI
ESCHERICHIA COLI.
EXCHERICHIA COLI
EXPRESCHERICHIA COLI
EMBL-EBI
Chemical Components
More difficult to deal with
Where coordinates and nomenclature do not agree,
have to make a judgement on which, if either, are correct
We maintain a curated database of compounds, against
which legacy data is compared
atom nomenclature – ongoing; relatively easy to correct where
the compound has been correctly identified
Stereochemistry – may indicate that the compound name is
incorrect
EMBL-EBI
Ligand nomenclature
Ligands are often named inconsistently
or even entirely incorrectly, e.g. a-Dmannose (MAN) vs. b-D-mannose
(BMA)
Errors are detected using a graphbased structure comparison algorithm
MAN
BMA
EMBL-EBI
not all cases resolvable
1d7t
DTY 4 in chain A, model 1
- is it D or L ??
HEADER
DE NOVO PROTEIN
TITLE
NMR STRUCTURE OF AN ENGINEERED CONTRYPHAN CYCLIC PEPTIDE
TITLE
19-OCT-99
2 (MOTIF CPXXPXC)
...
MODRES 1D7T DTY A
4
TYR
D-TYROSINE
...
HET
DTY
A
4
21
...
HETNAM
DTY D-TYROSINE
...
FORMUL
1
DTY
C9 H11 N1 O3
1D7T
EMBL-EBI
post-load processing
archive
PDB
services
pdb
edited
PDB
pdb
manual
edit
archive DB
data
warehouse
post-load
processes
distribution
cif
EMBL-EBI
process details
Involved in deriving data and building crosslinks
to other services
Geometric information
Analysing non-polymer components and assembling
full entities from individual components
Links to taxonomy and sequence databases
EMBL-EBI
transformation to DW
archive
PDB
services
pdb
edited
PDB
pdb
manual
edit
archive DB
data
warehouse
post-load
processes
distribution
cif
EMBL-EBI
process details
Set of SQL scripts
Supports Oracle (routinely) and MySQL (development)
Periodically undertake full transform
takes a couple of weeks
Provide weekly incremental patches
much faster
Supports transforms into different data marts
EMBL-EBI
coming soon …
Continuing cleanup
HET group curation
Sequence cross-references
Citations
More choice on downloads
Data marts (even single tables)
Groups of entries
Release of clean PDB files (end 2006)
EMBL-EBI
who did what
archive
PDB
services
pdb
edited
PDB
pdb
manual
edit
archive DB
data
warehouse
post-load
processes
distribution
cif