High Performance Application Program Interfaces to

Download Report

Transcript High Performance Application Program Interfaces to

Enabling Rapid Interaction
with the Protein Data Bank
Alexy Khrabrov
Rutgers University
John D. Westbrook
Rutgers University
Goals
•
Provide application and database access to macromolecular
structure data
•
Follow standards-based approach (OMG MMS finalized 2001)
•
Build on informatics structure of PDB data ontology
•
Provides high performance access
•
Direct access to compact binary data structures (e.g.
coordinates)
•
Provide broad granularity of access (individual atoms to
biological assemblies)
Program Level Access to the
Details of Molecular Structure
Ligand – Which ligands are contained
within the entry?
Chain/Entity – Extract the sequence
and coordinates for each molecular
entity.
Secondary Structure – Extract
helices and sheets for the entry.
Residues/Atoms - What is the
environment of this residue? Extract
the coordinates for a selection of
atoms or residues.
API Architecture Features
• API organization based on PDB Exchange Data Dictionary access methods are provided at the level of data
categories/classes
• PDB Exchange Dictionary provides the content to automatically
generate:
• OMG Interface Definition Language (IDL) and access classes
• SQL queries required to support Corba server
• Software to load PDB datafiles in memory or into a
supporting relational database engine
Current Data Dictionaries
http://deposit.pdb.org/mmcif/
• PDB data exchange (XML Schema/CIF)
•
•
•
•
•
•
•
•
• Including structural genomics and data harvesting extensions
mmCIF
NMR
3D-EM
Modeling
Crystallization
Symmetry
Image data
BIOSYNC
Extending Data Dictionaries for
Deposition
• X-ray
– macromolecular naming, source organism, crystallization and
cell parameters, data collection, structure solution and
phasing, model building, refinement, model quality
• NMR
– explicit details on sample preparation, contents and
conditions, constraints, force constants, related statistics
• Protein Production
– source information, target gene production, bacterial cloning,
bacterial expression, purification
Elements of Dictionary Metadata
• Data Attributes
– Definition
– Examples
– Data type (primitive type/regular expression patterns)
– Range or allowed values
• Classes
– Categories
– Subcategories
– Category groups
• Associations
– Parent-child relationships
– Interdependencies/exclusivity
– Methods
Automatic Production of
Macromolecular Structure
API Components
PDB Exchange
Dictionary + API
Specific Data
Dictionaries
Metamodel
Framework
CORBA IDL, SQL Schema,
XML DTD/Schemas,
Data Loaders
Database Access Classes
Macromolecular Structure
API Data Flow
mmCIF
Parsers
XML Files
mmCIF Data Files
(Data Reference
Standard)
Relational
Database
CORBA
Server
A
p
p
l
i
c
a
t
i
o
n
s
Metadata Framework
• PDB Exchange Dictionary
• Defines content model
• Grouping Dictionary
• Maps dictionary content to API organization
• Assigns attributes to API aggregate data types
and indices
• Schema Mapping Dictionary
• Maps content to physical storage layer
Automatic Generation of IDL
• Metadata framework is input data for
automated generation of Corba IDL
• IDL is a platform independent definition of
API
• IDL is used to produce client stubs and
server skeleton classes on any platform
Automatic Generation of API Server
• Metadata framework is input data for
automated generation of server access
classes -
• SQL access methods
• Implementation of abstract skeleton methods
using DB2 CLI
• Integrate with any custom server methods
API Server Extension
• Extend content model through PDB
exchange data dictionary
• Extend supporting dictionaries in metadata
framework
• Autogenerate IDL
• Autogenerate skeleton implementations
• Integrate custom code
Supporting Alternative APIs
• Adapt IDL autogenerator
• Revise MDF->IDL to MDF->new API spec
• Adapt autogenerator of server skeleton
implementations
• Integrate custom methods
Server Availability
• OpenMSS toolkit provides Java interface to
Oracle/MySQL using JDBC (core mmCIF classes)
• C++ server using native interface to DB2 (EEE)
implemented on 4-node Linux cluster (NDB beta
test in Sept.)
• Installation of DB2 (EEE) at SDSC underway to
support high-performance access
Client Program Examples
DsMmsMacromolecularStructure.idl excerpt:
struct AtomSite
{
string id;
IndexId type_symbol;
AtomIndex label;
IndexId label_entity;
VectorXYZ cartn;
float occupancy;
float b_iso_or_equiv;
};
Client Program Examples
A primary requirement of the design was that it present an interface that was clearly defined and
easy to use from the point of view of developing new applications. The code examples in this
section illustrate how client programs can use the API to quickly access macromolecular structure
data. As a simple example the following Python code fragment will print out the atom identifier
and the Cartesian (x, y, z) position for atoms in the macromolecule 4hhb.
Example 1. Retrieving the AtomSite list for hemoglobin (4HHB) and printing the atomic
coordinates.
try:
sid = ”4HHB"
e = ef.get_entry_from_id(sid);
except:
print "cannot get entry %s, exiting!" % sid
sys.exit(1)
print "got entry!"
# Get the atom site list
atoms = e.get_atom_site_list()
print "got %d atoms total" % (len(atoms))
print "A few atoms:"
for a in atoms[:10]:
print "%s\t%.3f %.3f %.3f" %
(a.id, a.cartn.x, a.cartn.y, a.cartn.z)
Example 2. Listing symmetry information and the residues ranges for the helices of the
hemoglobin (4HHB).
# Get the symmetry information
s = e.get_sym_info()
print "space group: %s" % s.space_group
print "cell constants: "
c = s.acell.unit_cell
print "a=%.3f, b=%.3f, c=%.3f" % \
(c.length_a, c.length_b, c.length_c)
print "alpha=%.3f, beta=%.3f, gamma=%.3f" % \
(c.angle_alpha, c.angle_beta, c.angle_gamma)
# Get the secondary structures
sconfs = e.get_struct_conf_list()
print "Secondary structures:"
for a in sconfs:
print a.id, '\t', \
a.beg_auth.asym.id, a.beg_auth.comp.id, a.beg_auth.seq.id, \
'\t-->', \
a.end_auth.asym.id, a.end_auth.comp.id, a.end_auth.seq.id
Client Availability
• Example clients provide category-level access in
Java OpenMMS and C++ native servers
• Clients available in Java, C++ and Python
• C++ API extended to support efficient detailed
molecular selections (e.g. coordinates of secondary
structure elements, symmetry related molecular
elements, biological assemblies)
Access
• Protein Data Bank Site
• http://www.pdb.org/
• OpenMMS site (Java implementation)
• http://openmms.sdsc.edu
• PDB Software Download Site (C++ and
Python implementation)
• http://deposit.pdb.org /mmcif/FILM/
• PDB Dictionary Resource Site
• http://deposit.pdb.org /mmcif/
• PDB Beta Data Site
• ftp://beta.rcsb.org/pub/pdb/uniformity/data/