Annotation_02Nov2009 - European Bioinformatics Institute

Download Report

Transcript Annotation_02Nov2009 - European Bioinformatics Institute

Annotation Procedures for Structural Data
Deposited in the PDBe at EBI
EBI is an Outstation of the European Molecular Biology Laboratory.
The Protein Data Bank in Europe (PDBe) group
• Established in 1996 at the European Bioinformatics
Institute – autonomous structural database capability in
Europe.
• One of the four sites around the world where structural
data can be deposited.
• Stable and clean repository for macromolecular structure
data.
• Services that allow users to access,
search and retrieve structural data
from a single web access point.
Data Processing at PDBe
Depositor
AutoDep4.0
“Raw” PDB file
Automated
+ Manual
Curation
Depositor’s comments
“Annotated” PDB file
Structure release
Data Deposition at the PDBe using AutoDep4.0
• Structure deposition and archival tool developed at the
PDBe (EBI).
• Based on Java/XML technology.
• Available freely under license for academic and industry
users.
• Easy to install and use for in-house archiving before
deposition to the PDB via the PDBe interface.
http://www.ebi.ac.uk/pdbe-xdep/autodep
The Curation Process
• Raw information obtained from the Depositor a) atomic coordinates (proteins, nucleic acids, Ligands,
solvents)
b) source of the macromolecule
c) number of protein chains present in the asymmetric
unit
d) experimental data (structure factor file)
• Three Phases of Curation –
1)Automated Curation
2)Manual Curation
3)Final Checks.
Automated Curation
• Consists of series of programs written in Fortran
and Perl
• Annotators contribute ideas and programs in
order to improve the curation process
• We work in a Unix command line interface
• This is the first Step : a big wrapper
The Wrapper
• Automatically generates:
• Chain ID for every HETATM and HOH (gets the chain ID of the
closest polypeptide chain)
• Quaternary structure, according to PISA (REM300&350)
• Structure validation: Close contacts (REM500) and chirality
checks
• Solvent molecules that lie farther than expected from the protein
(REM525)
• HELIX, SHEET, SSBOND, CISPEP records
• Residue by residue Mapping against the Uniprot database
• Dohlc output
Contents of a Curated PDB file
Sequence related information:
1)Sequences (SEQRES) – all macromolecules present during
crystallization, including expression tags and residues missing from the
coordinates due to disorder.
2) Sequence Database reference (DBREF) - provides mapping (FASTA
alignment) between the sequence (SEQRES) against the Uniprot
database.
Checks made …
•
Is the Uniprot accession number correct? The sequence similarity
between the Uniprot sequence and the target sequence should be
minimum ~95%
• Identification of N- and C-termini cross references with the Uniprot
and addition of fragment information (if any) to the COMPND record.
• Merge the data from the Uniprot entry to COMPND (Molecule name),
SOURCE (Scientific name of the organism) and KEYWDS
• Addition of EC number, if available
09.10.07
Macromolecular Structure Database
http://www.ebi.ac.uk/msd/
Curation procedures continued….
1) If no Sequence database reference available: the sequence is selfreferenced (i.e. the database reference will be the PDB entry itself).
2) Additional details regarding the sequence (gaps, cloning artifact, structural
disorder is provided in REMARK 999
3) Disagreement between a Uniprot sequence and the sequence present in the
PDB file (SEQADV): marked as a) Engineered Mutation, b) conflict or
c) microheterogeneity.
4) Residues missing from the coordinates – listed in REMARK 465
5) Non-hydrogen atoms missing from the coordinates- listed in REMARK 470
6) Zero-occupancy residues - REMARK 475
7) Zero-occupancy atoms - REMARK 480
8) Related PDB entries (same Uniprot Accession numbers) are listed in
REMARK 900
9) Backbone discrepancies
Ligand Curation
•
Ligands interacting with a protein/DNA chain → substrate, product, inhibitor
(drug molecule), metal ion, modified amino acid or nucleotide.
•
MODRES token added for Modified amino acids and nucleotides which are part
of the polymer (i.e. protein/DNA) chain.
•
Specialized software (Do Het Link and Connect records) used to get the bond
type, stereochemistry and IUPAC compliant name for each ligand in the
structure.
•
DOHLC is a graph based structure comparison algorithm – checks each
ligand/HET with dictionary definition, renames residues and atoms.
Generates REMARK 620(metal coordination), LINK and CONECT records.
DOHLC failing – bad geometry, incomplete ligand or new HETGROUP
If no match found for a HETGROUP – new ligand created
HETGROUP with missing atoms - REMARK 610
HETGROUP with zero-occupancy atoms – REMARK 615
•
•
•
•
•
Generating Assembly Information
•Biological unit – Biologically relevant form of the molecule
•Quaternary structures – the way protein chains tend to associate with one another
•The matrices forming the quaternary structure are reported as BIOMT records in
REMARK 350
ASU Contents
Expand
Crystal
Symmetry
Analyze
surface and
contacts
Possible
Assemblies
Loss of accessible surface area >10% of total surface.
True complexes also look good !
Best !!
PISA assemblies
1E94
PISA assembly
Structure validation
• Final Checks:
• Programs check for PDB format accuracy and internal
consistency
• Manual check by another Annotator
• Automatic generation of the letter to depositor + Manual
addition of special comments
09.10.07
Macromolecular Structure Database
http://www.ebi.ac.uk/msd/