RCSB Workshop on Biological Macromolecular Structure Models

Download Report

Transcript RCSB Workshop on Biological Macromolecular Structure Models

Topic 4: Current Database Resources
and Models
Contributors: Alexei Adzhubei, Stephen
Bryant, Torsten Schwede, Kim Henrick,
Daron Standley
Discussion Leader: Torsten Schwede
Workshop on Biological Macromolecular Structure Models
RCSB PDB • Piscataway, NJ • November 19-20, 2005
RCSB Workshop on Biological Macromolecular Structure Models
Topic 4: Current database resources and models
Some questions for discussion:
 Can models be made available in real time?
 How are models currently made available to the community?
 What resources are required to do this?
 How are they archived?
 How are they validated?
How are models currently made available to the community?
 Standalone software:
 Software for: Comparative modelling (e.g. MODELLER, Insight-II, MacroModel,
Composer, MolIDE, WhatIf, etc.), protein-protein docking, virtual screening,
molecular dynamics, etc.
 Interactive modelling Servers:
 Servers for comparative protein modelling, fold recognition, ab initio modelling,
secondary structure prediction, protein-protein docking etc.
 Model Databases:
 Databases of protein structure models, (e.g. ModBase and SWISS-MODEL
Repository), and several specialized modelling resources.
What are the roles of model databases?
 Repository - Function
"Where can I download the data for the model described in the paper Chaplin et al. (2005) with the
model ID xxxx ?"
 Assignment of stable accession code required (DOI, AC-code, etc).
 Archiving required; single point of entry (DB or portal).
 Information access provider
"Which structural information is currently available for my 'gene/ORF/protein' of interest of my
specific pet organism?"
 Information should be up-to-date and and complete; frequent updates required
 Evaluation and accuracy assessment is essential; clear assignment of expected
information accuracy to query results (experimental > comparative > ab initio etc.)
 Archive Function
"How accurate was the prediction of protein/complex XYZ by method SuperModel in the year 2002?"
 Archive of models superseded by experimental structures or better models for
method evaluation purposes (e.g. results of CASP, EVA, CAPRI, etc.).
Public databases with model data content

RCSB - PDB

Subset of model data (excluded from the main archive)

http://www.rcsb.org/pdb/cgi/models.cgi
Protein Model Archives


MDB

Searchable database with models from PDB based on the MDB dictionary (mmCIF and PDBML)

http://sgp.uio.no/mdb/
PMDB

Model repository allowing for model deposition; also contains models from previous CASPs

http://a.caspur.it/PMDB/
Protein Model Databases with updated data content:


ModBase

Exhaustive set of comparative models generated automatically by ModPipe (MODELLER)

http://modbase.compbio.ucsf.edu/
SwissModel Repository

Large-scale automated homology-modeling by SWISS-MODEL pipeline

http://swissmodel.expasy.org/repository/
... and many other resources specialized on certain protein families, organisms, genomes,
methods, etc.
How can models be made available in real-time?
Options for discussion:
1. Centralized model database
2. Federated distributed databases
3. Distributed Annotation System (DAS)
4. Interactive Modelling META-servers:
"modelling on the fly"
What resources are required?
 Storage requirements for model repository / archives
Example: Structure Navigator:
Avg. Size of a single Model
Query Size: 233 Residues
Residue Alignment:
Cα Coords:
All-atom Coords:
4.7 KB
5.8 KB
46 KB
 PDB format (compressed): ~ 1 kB per residue
 1.5 TB for all atom model of UniProt (1.5 x 109 aar, excluding splice
variants) in compressed PDB format.
How can models be made available in real-time?
Option 1: Centralized model database for all modelling information
 Advantages:
 Single entry point, data format homogeneity can be ensured
 Allows for direct comparison of data sets, e.g. for evaluation purposes
 Disadvantages:
 Data volumes will be immense (e.g. for full atom models, MD trajectories,
docking results, etc.)
 Huge effort for data standardization and documentation; modelling data is
complex and requires detailed documentation, interpretation, and
accuracy evaluation.
 Huge effort for weekly data update (Comparative models need to be
updated e.g. when the target-template alignment has changed or better
templates are available)
How can models be made available in real-time?
Option 2: System of federated distributed databases
 Exchange of "meta-data" between participating sites, describing for which
targets models are available, and what is their expected accuracy.
 Advantages:
 Exchange of "meta-data" allows for synoptic view of available models
 Consistency of accuracy evaluation can be ensured by standardized software stack
 Relatively small volume of weekly data exchange for updates
 Complex documentation of individual models would remain the task of the individual
partner sites
 Disadvantages:
 Model coordinate data is not available as a single archive.
Example:
InterPro
How can models be made available in real-time?
Option 3: Similar to Distributed Annotation System (DAS)

DAS, the Distributed Annotation System, is a system conceptually composed of a Reference Server
and Annotation Server(s). A single client (e.g. Web Browser) integrates information from multiple
servers by gathering annotation information from multiple distant web sites, collate the information,
and display it to the user in a single view. (http://www.biodas.org)
 Advantages:
 Little coordination is needed among the various information providers.
 Disadvantages:
 Limited to sequence-centric model views (Sugars? Docking?)
 Model coordinate data is not available as a single archive.
 Model meta data is not available as a single archive.
 Model Accuracy assessment ?
How can models be made available in real-time?
Option 4: Modelling META-servers
Real-time meta server (similar to bioinfo.pl; meta-PredictProtein) submitting to different
modeling servers and collating the results.
 Advantages:
 Only limited development necessary;
 low requirements for bandwidth and storage
 Disadvantages:
 "Real time" means several hours up to several days between request and result,
depending on the type of query.
 Limited to sequence-centric protein model views (Sugars? RNA? Ligand docking?
MD Trajectories?)
 Model Accuracy assessment? Selection and ranking of results?
 Integration of manual models from model repositories?
 How can we tell in advance if it will be possible to deliver a suitable model for a
given query in a meta-server approach?
Model Validation
Model validation is necessary to assign a confidence score to a protein model when the
correct structure is not known (unlike CASP, where the correct answer is known to the
evaluator).
 Simple geometry checks are insufficient to validate models.
 Statistical potentials, force fields, etc. are, in our experience, not performing well for
ranking models from different methods.
 Historical performance: Automated modelling methods can be evaluated by accessing
their performance on a large set of blind predictions (EVA project). This retrospective
dataset allows to predict the expected accuracy for a new target protein.
 Consensus methods can be applied if several independent models exist for the same
protein. However, there is an open issue concerning the diversity of the model
ensemble, i.e. the consensus model is not necessarily the best model. Successful
consensus methods use a well defined set of input components. Scoring of arbitrary
collection of models is not solved.
 No commonly agreed validation standard exists in the modelling community.
Things to do ...

Standards for Model Accuracy Validation

Model validation needs to account for the biological application of the model.
Functional assignment by fold recognition, design of point mutations, or virtual
ligand screening have different accuracy requirements.

Data standard for model exchange and archiving:

Use common data formats (e.g. mmCIF/XML etc.)

Sequence and cross-references to sequence DBs

Specific data for different types of models, e.g. Coordinates of the model(s);
structural or functional annotation, trajectories, etc.

Supporting evidence (e.g. Alignments and templates underlying comparative
models)


Confidence measures: per model, per chain, per residue, per atom

Parameters and methods used for all individual modelling steps
Define minimal model annotation standard for data exchange
RCSB Workshop on Biological Macromolecular Structure Models
Topic 4: Current database resources and models
Some questions for discussion:
 Can models be made available in real time?
 How are models currently made available to the community?
 What resources are required to do this?
 Which models should be archived?
How are they archived?
 How are they validated?
Example: InterPro
PDB - MSD
CATH
ModBase
SwissModel Repository
Example: InterPro
Slides by Daron Standley
Real-time Model Server
Model Data Server or Modeling Meta Server?
Input: AA sequence, sequence ID, Gene Ontology ID, keyword
Primary Output: “consensus” alignment, Seconday/Tertiary
Structure, functional annotation, confidence score
Secondary Output: quaternary structure, interactions, literature
refs, fold classification, list of sequence/structural neighbors,
alternate solutions
Structure Navigator
http://www.pdbj.org/strucnavi/
Over 35,000,000 pre-computed alignments
Both SOAP and Web-based interface
Rapidly generates alignments and 3D superposition
Storage Requirements for Structure Navigator
Ave. Size of single Model
Query Size: 233 Residues (Bytes)
Residue Alignment: 4.7 KB
Cα Coords: 5.8 KB
All-atom Coords: 46 KB
Slides by Alexei Adzhubei
Current public, searchable databases with protein
models as primary data content
 RSCB PDB
 has a subset of model data (excluded from the main archive)
http://www.rcsb.org/pdb/cgi/models.cgi
 SwissModel Repository
 SwissModel automated homology-modelling
http://swissmodel.expasy.org/repository/
 MODBASE
 models generated automatically by MODELLER
http://modbase.compbio.ucsf.edu/
 MDB
 models from PDB, compatible with PDB, built using the MDB dictionary (mmCIF
and PDBML) http://sgp.uio.no/mdb/
 PMD
 CASP models http://a.caspur.it/PMDB/
 ?
How to store and access models?
Create a central repository?
Create a network of databases?
Major questions
 what models to store (automatic, expert)
 what primary data to store
 what generated data to store (e.g. model accuracy) and
how to generate it
 what data to distribute to participating databases in a
network of databases
Things to do
Develop a “Model data description” standard
acceptable for everyone
Develop data exchange format based on either
mmCIF, PDBML or other and use it for data
exchange between databases and servers
Develop a “Minimal Data” standard for data
exchange
Slides by Kim Henrick
MODELS of what?
1.
2.
3.
4.
5.
Protein
Aminoacids
Sugars
Docked polymer-polymer
Ligand docking
1. Models generated automatically should be:
(a) done on demand or stored
depends on resources for cpu
(b) if stored – recalculated each week where new sequences
contribute to alignment (HMM etc)
2. Models generated with manual intervention should always
be stored
3. Models with some experiment: e.g.
(a) large multimeric complexes – part EM, part xray solution
scattering, part dynamics
(b) models of large fibre like structures built from xray of single
domains, small fragments or NMR determined fragments e.g. pili
fibres, multidomains cell-cell adhesion, muscle, heparin, keratan
How are they archived?
Require UML/mmCIF/XML/SQL to describe:
a. sequence and cross-references to sequence DB info
b. coordinates of the model(s) possibly an ensemble of highest scoring
set
c. per model, per chain, per residue confidence measures
d. Method used for domain boundaries & specification of domains
e. Template identification procedure
f. Algorithms used for backbone, loop and side chain
construction/versioning
g. Alignments underlying template based modelling (HMM’s)
h. Scoring/ranking method
i. Annotation of the model
j. Model refinement methods applied
k. Molecular force fields used
l. Describe model dependencies – e.g. on Uniprot weekly updates?
m. Specific data for different types of models
How are they validated?
simple geometry checks insufficient – model should have
near perfect geometry anyway – dont know is it possible to validate
a theoretical model?
Validation really only by tests against targets with structures
unknown to testers e.g. capri/casp
Also possible
By matching predicted residue interactions with txt mining
By matching predicted residue pairs with known function
By matching predicted function to experimental details
Old Slides
Public databases with model data content
Other examples of specialized model resources ...
... for specific protein families
 GPCRDB, Information system for GPCRs including 3D models
 PKR, The Protein Kinase Resource
... for selected organisms & genomes
 3DGenomics, structural annotations for ~100 proteomes
 PlasmoDB, Comparative models for Plasmodium Falciparum
 Arabidome3D, Comparative models for Arabidopsis Thaliana
More projects are in the process of generating large model data sets, e.g. the
Human Proteome Folding project.
Which type of models?
 Which type of models?
 Proteins, large multimeric complexes, sugars, nucleotides, docked ligands, mixed
models (fibre diffraction, scattering in solution, EM, etc.)
 How much experimental information has contributed to a model?
Level of detail / resolution
[ High-resolution experiments ]
[ QM/MM Simulations ]
[ Medium-resolution experiments ]
[ MD Simulations ]
[ Low-resolution experiments + Modelling ]
[ High-homology Comparative Modelling ]
[ Low-homology Comparative Modelling / Docking ]
[ Ab initio predictions ]
Experimental vs. model contribution