Macromolecular Structure Database group

Download Report

Transcript Macromolecular Structure Database group

EMBL-EBI
3D databases and data
warehouse technology
EMBL-EBI
Overview
 Overall Strategy
 Terms and background
 Populating the databases
 Clean up processes
 How can I use the database?
 What next
EMBL-EBI
What is a database?
 By the term ‘database’ we refer to the system
rather than the data
 Indexed file space
 Also used as a shorthand for a database
management system (DBMS)
 Methods for accessing and changing data
 Controls for referential integrity
EMBL-EBI
Normalisation
 Data fields in a normalised database appear only once
RESIDUE
CHAIN
ID
attr
A
185
...
...
CHAIN ID
COMPONENT
SEQ COMP ID
ID
attr
A
1
ASP
ASP
-1
A
2
LYS
LYS
+1
...
...
...
...
...
 Data fields in a denormalised database are repeated in
different places
RESIDUE
CHAIN
ID
attr
A
185
...
...
CHAIN ID
A
A
...
COMPONENT
SEQ COMP ID CHAINattr COMPattr
-1
1
ASP
185
ID
attr
ASP
-1
2
LYS
185
+1
LYS
+1
...
...
...
...
...
...
EMBL-EBI
Structural hierarchy
assembly
molecule (entity)
chain
residue
EMBL-EBI
ASU and assemblies
assembly
ASU
chain
chain
residues
residues
EMBL-EBI
The pipeline
archive
PDB
services
pdb
edited
PDB
pdb
manual
edit
archive DB
data
warehouse
post-load
processes
distribution
cif
EMBL-EBI
The first steps
archive
PDB
services
pdb
edited
PDB
pdb
manual
edit
archive DB
data
warehouse
post-load
processes
distribution
cif
EMBL-EBI
The first steps
 A series of scripts
 Parses non-standard header records
 Fills in chain identifiers
 Outputs a first cut clean file
 Manual editing
 ~1000 entries require manual editing
 The result is a PDB format file that can be
passed to the subsequent automatic steps
EMBL-EBI
bizarre errors …
1ew1
...
...
ATOM
47
N6
A A
2
2.068
5.433
-2.482
ATOM
47
N6
A A
2
2.068
5.433
-2.482
...
...
ATOM
59 1H6
A A
2
1.160
5.722
-2.818
ATOM
59 1H6
A A
2
1.160
5.722
-2.818
ATOM
60 2H6
A A
2
2.901
5.700
2.985
ATOM
60 2H6
A A
2
2.901
5.700
-2.985
...
...
EMBL-EBI
automatic processing
archive
PDB
services
pdb
edited
PDB
pdb
manual
edit
archive DB
data
warehouse
post-load
processes
distribution
cif
EMBL-EBI
process details
 Automatic cleanup (d2c)
 Incorporates quaternary structure information
 Runs a lot of checks and corrections
 Outputs mmCIF file
 Loading
 Metadata-driven custom loader
 Load through views with insert triggers
 Many heuristics also applied to data within these triggers
EMBL-EBI
Using reference data
 Variations in legacy data
 Hinders accurate searches
 Hinders links to other services
 Match data against
controlled vocabularies
 Within scripts
 Within database during load
 Semi-automated
 Use string matching algorithms
 Effective when controlled
vocabulary well maintained
$COLI
COLI
E. COLI
E.COLI
ESCHERCHIA COLI
ESCHERICHI $COLI
ESCHERICHIA $ COLI
ESCHERICHIA $COLI
ESCHERICHIA COLI
ESCHERICHIA COLI.
EXCHERICHIA COLI
EXPRESCHERICHIA COLI
EMBL-EBI
Chemical Components
 More difficult to deal with
 Where coordinates and nomenclature do not agree,
have to make a judgement on which, if either, are correct
 We maintain a curated database of compounds, against
which legacy data is compared
 atom nomenclature – ongoing; relatively easy to correct where
the compound has been correctly identified
 Stereochemistry – may indicate that the compound name is
incorrect
EMBL-EBI
Ligand nomenclature
 Ligands are often named inconsistently
or even entirely incorrectly, e.g. a-Dmannose (MAN) vs. b-D-mannose
(BMA)
 Errors are detected using a graphbased structure comparison algorithm
MAN
BMA
EMBL-EBI
not all cases resolvable
1d7t
DTY 4 in chain A, model 1
- is it D or L ??
HEADER
DE NOVO PROTEIN
TITLE
NMR STRUCTURE OF AN ENGINEERED CONTRYPHAN CYCLIC PEPTIDE
TITLE
19-OCT-99
2 (MOTIF CPXXPXC)
...
MODRES 1D7T DTY A
4
TYR
D-TYROSINE
...
HET
DTY
A
4
21
...
HETNAM
DTY D-TYROSINE
...
FORMUL
1
DTY
C9 H11 N1 O3
1D7T
EMBL-EBI
post-load processing
archive
PDB
services
pdb
edited
PDB
pdb
manual
edit
archive DB
data
warehouse
post-load
processes
distribution
cif
EMBL-EBI
process details
 Involved in deriving data and building crosslinks
to other services
 Geometric information
 Analysing non-polymer components and assembling
full entities from individual components
 Links to taxonomy and sequence databases
EMBL-EBI
transformation to DW
archive
PDB
services
pdb
edited
PDB
pdb
manual
edit
archive DB
data
warehouse
post-load
processes
distribution
cif
EMBL-EBI
process details
 Set of SQL scripts
 Supports Oracle (routinely) and MySQL (development)
 Periodically undertake full transform
 takes a couple of weeks
 Provide weekly incremental patches
 much faster
 Supports transforms into different data marts
EMBL-EBI
coming soon …
 Continuing cleanup
 HET group curation
 Sequence cross-references
 Citations
 More choice on downloads
 Data marts (even single tables)
 Groups of entries
 Release of clean PDB files (end 2006)
EMBL-EBI
who did what
archive
PDB
services
pdb
edited
PDB
pdb
manual
edit
archive DB
data
warehouse
post-load
processes
distribution
cif