Macromolecular Structure Database group

Download Report

Transcript Macromolecular Structure Database group

EMBL-EBI
Integration of Sequence and 3D
structure Databases
“The key to Bioinformatics is
integration, integration, integration”
Bioinformatics: Bringing it all together technology feature, M. Chicurel ,Nature 419,751, 2002
EMBL-EBI
“Coordinates by themselves just specify shape and are not necessarily of intrinsic
biological value, unless they can be related to other information”
Integrative database analysis in structural genomics, Mark Gerstein, Nature Structural Biology 7, 960 , 2000
“Only the development of integrated bioinformatics systems
will enable the manipulation of complex biological information”
Editorial, Bioinformatics 18 (12), 1551, 2002
“The information management challenge for the future will be to develop
new ways to acquire, store and retrieve not only biological data per se,
but also those data in the context of biological knowledge”
Biological Databases and Informatics Program Announcement NSF 02-058
EMBL-Bank
DNA sequences
Array-Express
Microarray
Expression Data
Uniprot
Protein
Sequences
EMSD
Macromolecular
Structure Data
EnsEMBL
Human Genome
Gene Annotation
EMBL-EBI
Integration With Uniprot
eFamily Project
Future Plans
EMBL-EBI
Integration With UniProt
UniProt (Universal Protein Resource) is
the world's most comprehensive
catalogue of information on proteins. It is
a central repository of protein sequence
and function created by joining the
information contained in Swiss-Prot,
TrEMBL, and PIR.
http://www.ebi.ac.uk/uniprot/index.html
EMBL-EBI
MSD / UniProt:
Two Different Database Systems
Services
UniProt
Agreed
common
mechanism
for exchange
of information.
Services
MSD
“One of the major benefits of using databases for data storage is for data sharing”
EMBL-EBI
MSD/Uniprot Collaboration
 Collaboration between MSD (Sameer Velankar, Phil McNeil)
and UniProt (Virginie Mittard, Daniel Barrell) groups
 Depends upon
Clean UniProt (UNP) cross references in the DBREF
records for each chain (where possible)
Clean taxonomy ids for each PDB chain
Taxonomy for PDB Source and UniProt OS must be the
same
EMBL-EBI
 Cleanup of the DBREF records in the PDB entries
Cleanup of the UniProt cross references in PDB entries
 Cleanup of Source Information
NCBI Taxonomy IDs
 Cleanup of the Reference information
 Update UniProt entries
 Source, Reference, Secondary structure information
 Supply Additional Information
 revision date, experimental method, resolution, R-factor
 Residue-by-residue mapping between MSD and UniProt
enables chimaeras to be handled correctly
EMBL-EBI
Sequence Schema
EMBL-EBI
Residue by Residue Mapping to UniProt
PDB
CHAIN
UNP
SERIAL
1HG1
A
P06608
1HG1
A
1HG1
PDB_RES
PDB_SEQ
UNP_RES
UNP_RES
ANNOTATION
1
ALA
22
A
NOT OBSERVED
P06608
2
ASP
23
D
NOT OBSERVED
A
P06608
3
LYS
24
K
NOT OBSERVED
1HG1
A
P06608
4
4
LEU
25
L
1HG1
A
P06608
5
5
PRO
26
P
1HG1
A
P06608
6
6
ASN
27
N
1HG1
A
P06608
7
7
ILE
28
I
1HG1
A
P06608
8
8
VAL
29
V
1HG1
A
P06608
9
9
ILE
30
I
1HG1
A
P06608
10
10
LEU
31
L
1HG1
A
P06608
11
11
ALA
32
A
EMBL-EBI
Display of Mappings
EMBL-EBI
Integration With IntEnz
 IntEnz is the name for the Integrated relational Enzyme
database and is the most up-to-date version of the
Enzyme Nomenclature.
 The IntEnz relational database implemented and
supported by the EBI is the master copy of the
Enzyme Nomenclature data.
 MSD uses the UniProt accession code(s) mapped to
each chain to link to the IntEnz EC number
 This done directly via the MSD and IntEnz Oracle
relational databases
http://www.ebi.ac.uk/intenz/index.html
EMBL-EBI
eFamily
http://www.efamily.org.uk/
The eFamily project is designed to integrate the information contained
in five of the major protein databases.
EMBL-EBI
eFamily Core Activities
To integrate the information contained in the five major protein databases.
 The member databases (CATH, SCOP, MSD, Interpro, and Pfam) contain
information describing protein domains.
 For SCOP, CATH and MSD the data is primarily concerned with 3D
structures
 In InterPro and Pfam the focus is mainly on the sequences.
 It is often difficult for biologists to navigate from protein sequence to
protein structure and back again.
 eFamily aims to provide the scientific community with a coherent and
rich view of protein families that allow users seamlessly to navigate
between the worlds of protein structure and protein sequence, by
improved data resources and integration via grid technologies.
EMBL-EBI
DATA INTEGRATION
Common Domains
definition
CATH
HMM prediction
UniProt
SCOP
GO
Mapping &
curation
Curated
GO
Mapping per
residue
MSD mapping
Residues/Sequence
Mapping start – end
Curated
Curated
PROSITE
InterPro
Pfam
Curated
GO
EMBL-EBI
Complexity of Mappings
InterPro-UniProt(s)
An InterPro entry is a collection of one
or more UniProt entries
 Unlike PDB concept of CHAIN does not
exist in UniProt
UniProt-PDBCHAIN(S)
 UniProt entry is always
numbered from 1 to N
CATH/SCOP DOMAIN
PDBCHAIN(S)
 PDB SEQRES Residue numbering is
from 1 to N
 PDB CHAIN (ATOM Records) Residue
CATH/SCOP DOMAIN
UniProt
numbering is not necessarily 1 to N
 UniProt to PDB Mapping
can be one to many
InterPro-CATH/SCOP
 PDB CHAIN to UniProt
Mapping can be one to many
EMBL-EBI
MSD-SCOP Mapping for 1cbw
Chains
PDB Residue Range
Swiss-Prot Residue Range
SCOP Domain
EMBL-EBI
MSD-CATH Mapping for 1cbw
Chains
Swiss-Prot Residue Range
PDB Residue Range
CATH Domains
EMBL-EBI
MSD-Pfam Mapping for 1cbw
Swiss-Prot Residue Range
Chains
PDB Residue Range
Pfam Domain
EMBL-EBI
Practical Applications of
Database Integration
EMBL-EBI
Mappings Used in Pfam
 Pfam now uses
UniProt to structure mapping
from MSD Search Database
 Saves duplication of effort
and weeks of compute
 Use mapping for annotation
of alignments
Pfam domains highlighted on structure
of RuBisCo (8ruc)
EMBL-EBI
Mappings Used in Interpro
EMBL-EBI
Mappings Used in SCOP
EMBL-EBI
Comparison of SCOP, CATH and Pfam
Domains
SCOP, CATH and Pfam have
developed web-services for
describing their particular
domain families. These
services can be queried with
a protein identifier, protein
accession or PDB identifier.
The databases use the
MSD/UniProt mapping to
translate between the
sequence and structure
domains
EMBL-EBI
XML & Web Services
The eFamily project has developed a XML schema to describe:
 Domains
 Annotation
 Sequence Alignments
 Structure Alignments
This will be used to provide web-services as part of the eFamily project.
More information about the XML schema is available at http://www.efamily.org.uk/xml/efamily/documentation/efamily.shtml
We are also developing a perl based API for the eFamily XML which will
be available from eFamily site as well as via bio-perl.
The MSD residue-by-residue mapping is made available in XML format
based on the eFamily schema.
EMBL-EBI
Future Plans
EMBL-EBI
Mapping Annotation
EMBL-EBI
EMBL-EBI
Residue Mapping Program 1
 Makes use of cleaned-up cross-reference & taxonomy data,
SEQRES and ATOM/HETATM records from the PDB and the
sequence from the UniProt entry to align and map each
residue.
 Makes connected segments from the PDB ATOM/HETATM
records for each chain
 These are then aligned against the SEQRES records and all
the alignments for the segments are merged to get the
SEQRES-ATOM alignment
 This enables any unobserved residues to be considered
EMBL-EBI
Residue Mapping Program 2
 A similar operation is performed on the UniProt sequence and
connected segments from the ATOM/HETATM records to get
the UNP-ATOM alignment
 The SEQRES-ATOM and UNP-ATOM alignments are then
merged to get the final alignment
 This is repeated for each chain in the PDB archive (with a UNP
cross-reference
 The mapping is loaded into the MSD relational database and
validated
EMBL-EBI
Integrating data from MSD into CATH
 Protocols have been developed for
regular imports of a subset of MSD data
warehouse into a local CATH database set
up in ORACLE 9i
 For example, information on the
biological unit and on protein-ligand
interactions will be integrated to increase
functional annotations for CATH domain
families
EMBL-EBI
MSD & CATH Data Exchange
Two step process of data synchronisation
 Data are moved from the MSD search database to the
CATH-UCL site using a combination of Oracle Export/Import
and SQL*Loader utilities
 Subsequent updates in the MSD database are pushed to the
CATH site using an incremental replication mechanism.
 Data from the CATH site are pushed to the MSD site, using
the same two step process
 The two databases are synchronised
EMBL-EBI
Structure
SCOP
CATH
Sequence
UniProt (neé Swiss-Prot
/Trembl/PIR), InterPro, Go,
Pfam
Function
IntEnz
Literature
Medline