Transcript PowerPoint

A Virtual File System for the
PubChem Chemical Structure and
Bioassay Database
Wolf-D. Ihlenfeldt
Xemistry GmbH
Königstein, Germany
PubChem on the Web
PubChem Project Mission
 Provide comprehensive public access to screening
data generated by NIH Roadmap Initiative and other
public research projects
 Link assay results, structures screened, literature
references, basic computed properties, external
information sources
 Convenient and free queries and download of filtered
structure and assay data for further research
 Wait a moment - they call that convenient ?!?
Problems with Interactive Data
Retrieval in PubChem
 Separation between text/data (Entrez) and structure query
systems with inconsistent interfaces
 Dumbed-down structure query interface, but overengineered
text query tools
 Obscure Entrez syntax for combining multiple subqueries
 Quirky Entrez approaches regarding numerical queries,
quoting, field names, output formats, history titles, auto
query expansion…
 History of history problems
Interactive Data Retrieval in
PubChem
 Very limited customization of downloadable data
content
 Full structure data record only as ASN.1 blob,
optionally with gratutious homebrew XML wrapper
 SD-file is incomplete, a structure approximation and
still not compatible with exact interpretation of MDL
standards
 Nevertheless, well done system for browsing, but not
for serious data collection
Routes to Programmatic Data
Retrieval from PubChem
Some disconnected components exist:
 Entrez e-utils
Basic access to Entrez text databases, get status, retrieve ID sets,
some record data or set history via simple text-based queries
 PubChem structure display pages
Can be abused for direct download of single records in ASN.1 format,
bypassing the FTP wait queue
 PubChem Power User Gateway (PUG)
XML/ASN.1 specification for executing simple structure queries and
getting ID sets, history handle from PubChem servers
 No direct SQL server db access ever, that‘s policy!
The Cactvs Toolkit
 Universal scripting environment for chemical data
processing
 Framework of chemical objects (ensembles,
reactions, tables, …), dynamically defined object
properties with associated computation methods,
and extension modules (I/O modules for different
types of files, database access, data type handlers,
command extensions,…)
 Lazy computation – request some data on an
object, and a way will be found to get it if possible
Cactvs and PubChem
 Cactvs Toolkit licensed by NCBI as integral component
of the PubChem software suite
 Used for file I/O, syntax verification, property
computation, structure depiction, structure
identification via hashcodes, interface to NIST InChI
suite, fingerprints, sub/superstructure & formula
search system, WWW structure sketching
 Only externally available toolkit that understands
PubChem data structures (ASN.1 specs for
substances, compounds, assays, and PUG) – including
literature references, conformer data, etc.
Basic PubChem Integration
 Ensemble object creation via CID:
set eh [ens create $cid]
Direct download and parsing of binary ASN.1 record via display
page. Also supported as file I/O module.
 Computation of CID and SIDs from structure:
set cid [ens get $eh E_CID]
set sidlist [ens get $eh E_SIDSET]
Parsing of Entrez E-utils output from submission of InChI string
as text search
Basic PubChem Integration
 Compound name lookup
set iupacname [ens get $eh E_IUPAC_NAME]
Direct download and parsing of XML CID display record,
extracting OpenEye computed name
 CAS number lookup
set casno [ens get $eh E_CAS]
Direct download and parsing of XML SID set display records
which contain depositor-supplied names, using pattern
recognition
Initial PubChem Integration
 CAS number I/O module
set eh [molfile read $casfile]
Look up CID as generic term via E-utils, download ASN.1
record via CID. Also supported as object creation command
set eh [ens create $cas]
The PubChem Virtual File Project
 Improved access to PubChem database
make it indistinguishable from a local, read-only structure
file in Cactvs scripting environment
 Input functions
transparently read structures and all their data from
PubChem
 Query functions
convenient development and archival of queries exceeding
the capabilites of Web interfaces and PUG, maintaining
standard Cactvs query and retrieval syntax
General Approach
 Implement a Cactvs I/O module
I/O modules incorporate function tables with rich set of
functions that are automatically called in specific situations,
capability flags, documentation fields, etc.
 Hidden, automatic use of Entrez E-utils and PUG
Run as many tasks as possible on Entrez/PubChem structure
search, data download and local processing only as last resort
 Optimize for sake of efficiency and just being nice
Use caching techniques to reduce network and server load,
observe NCBI script access rules
PubChem Virtual File I/O
Code sample:








filex load pubchem
Contact Entrez e-utils,
19
molfile open <pubchem>
get database status
molfile0
E-utils, get 5K sector of
molfile count molfile0
12002343
record-CID map, then
molfile read molfile0
ens0
single-record ASN.1
ens props ens0
download
via display
…E_INCHI E_IUPAC_NAME E_NCBI_COMPOUND_ID
E_EXACT_MASS
E_TPSA E_SMILES E_SMILES/2….
page
Single-record
ASN.1
ens get ens0 E_CID
Try to load compressed
1
download
via
display
CID
use
bit
vector
from
molfile read molfile0
page
ens1
xemistry.com, fallback
molfile set molfile0 record 999999
are more e-utils queries
for record/CID map
sectors
Simple PubChem Queries
Code sample:
set fh [molfile open <pubchem>]
set cidlist [molfile scan $fh „structure >= $smarts“ \
{proplist E_CID}]
Operations behind the scenes:
 Set-up of PUG record
 Post PUG, monitor return status
 Cache CID result data
 Direct access to result set, no structure download
Intermediate PubChem Queries
Code sample:
set fh [molfile open <pubchem>]
set enslist [molfile scan $fh \
„or {structure = $smiles1} {structure = $smiles2}\
{structure = $smiles3}“ enslist]
Operations behind the scenes:
 Create and post PUG records, get history keys
 Perform server-side e-utils result merge via history
keys
 Retrieve CID set
 Download structures as ASN.1 blobs via CID
Power PubChem Queries
Code sample:
set stfh [molfile open $mysdfile]
set fh [molfile open <pubchem>]
set th [molfile scan $fh \
„and {structure ~>= $stfh 95} {formula >= \[M\]0} \
{E_NMOLECULES = 1} {E_STEREO_COUNT(1) >= 1}“ \
{table E_CID score E_SMILES E_FORMULA record image} \
{} 1000]
table write $th similar_in_pubchem.xls
Bioassay access is unfortunately not yet part of PUG.
Summary

Goal: Make PubChem finally conveniently accessible as data
source for local work

Feature: Read all data from PubChem records, and further
manipulate it to your heart‘s content

Feature: Write and conserve complex queries beyond what
you can do with the Web interface

Feature: Export data in many more formats than possible
via the Web interface

Future: Sort out remaining problems with caching and field
access in complex queries, use parallel PUG submissions,
integrate assay data access
Availability
 Is a standard component of 3.353 and later CACTVS
toolkit releases
 Free academic downloads from www.xemistry.com
for multiple platforms (Linux, MS Windows, MacOSX,
Solaris, BSD)
 Also part of basic commercial toolkit, to be distributed
with regular customer updates