Transcript - ChemAxon

Web Interface to Dictionary of Natural
©
Products
at Astra Zeneca
Péter Várkonyi, DECS Cheminformatics Astra Zeneca R&D Mölndal, Sweden
email: [email protected]
Web-based application and database exploiting DNP data
The purpose of the application is to make DNP data easily accessible AstraZeneca-wide using a simply
straigthforward query interface. The query has to incorporate chemical structure search, and the results have to be
obtained in electronic form including chemical structures.
ABSTRACT
Dictionary of Natural Products (The Chapman & Hall) is a collection of chemical substances of natural sources updated twice a year. The data are catalogized, and this is the
only way to retrieve information from the database. It was decided to make the database searchable and available to all researchers at the company. The data were placed
in an ORACLE database including the chemical structures of the substances. The structures are converted and processed with JChem. The query interface is a web
application employing the Java Server Pages (JSP) technology. The most important fields (id, full chemistry name, CAS number, molecular weight, importance, and
pharmaceutical importance) and the chemical structure are selected to be searchable. The chemical structure entry is done with MarvinSketch applet in the query. The
alphanumeric search is conducted in ORACLE using JDBC and the chemical structure search is using JChem. The chemical structures in the results of the query are
presented with MarvinView applet. Besides the searchable fields and the chemical structure, the bibliographic references are displayed.
Dictionary of Natural Products of The Chapman & Hall
In pharmaceutical research the substances produced by natural organism always played a significant role. Dictionary of Natural Products is one of the most extensive
collection of these substances. It contains nearly 200 000 records corresponding to approximately 40 000 parent compounds. DNP contains beside the name and chemical
structure of the substance, some of its most important properties, description of its significance, information about the hazard associated with it, and bibliographic references
to it.
Table 1
The data are coming in 2 files:
- a text file containing the compound identifiers and all the alphanumeric data
the "records" of the text file are meant to facilitate printing the record in
publishing quality not to search content of the fields.
- an SDF file
it contains compound identifiers and the chemical structures in MDL's SD
format.
The available fields can be divided into 4 types in the text file:
identifiers and catalogue numbers, like:
UKEY
DNP record identifier
CASM
Chemical Abstracts registry number
ALDR
Aldrich catalogue number
properties, like:
MOLF
Molecular formula
OPTR
Optical rotation
LGPE
Partition co-efficient data (experimental)
use or importance, like:
UIMP
use, importance
DUIM
pharmaceutical importance
HAZD
hazard
bibliographic reference data, like
DATE
date of reference
INIT
initials of author
The full list of available fields in DNP is shown in Table 1.
aldr
boil
casm
ctfl
dens
derd
devs
diag
docn
docl
docs
dref
entr
exno
exnx
fluk
gens
hazd
hazf
indx
lgpc
lgpe
melt
misc
molf
molw
name
optr
phys
pkas
prog
rare
rgrp
rtec
sigm
solp
sorc
sref
stra
subs
supc
supp
syns
tocn
uimp
ukey
vard
xcas
-
Fields in Directory of Natural Products
Aldrich catalogue number
Boiling point/sublimation point
CAS Registry number
Connection table status code
Relative density
Derivative descriptor
Development status
Diagram code
Entry number(s) in printed work(s)
Latest DOCN
Special DOCN
Derivative cross-reference
Control number
Exchange number (Chapman & Hall number)
Old exchange number
Fluka catalogue number
General statement
Hazard
Hazard flag
Index name
Partition co-efficient data (calculated)
Partition co-efficient data (experimental)
Melting point/freezing point
Miscellaneous information
Molecular formula
Molecular weight
Entry name
Optical rotation
Physical description, solvent of recrystallization
pKa value
Progress code
Rare Chemicals Library number
Linear diagram
RTECS registry number
Sigma catalogue number
Solubility
Source, synthesis
Diagram cross-reference
Structure by analogy
Subset code
Supelco catalogue
Supplier data
Synonym
Type of compound code
Use, importance
Unique key
Variant descriptor
Additional entry specific CAS registry number
As a first step we designed the application to search and display the most relevant information beside chemical
structure. The application is designed to be easily expandable with further fields to be either searched or displayed.
The selected fields are as follows:
CAS number
DNP record id
Molecular Formula
Molecular Weight,
Use, importance
Pharmaceutical importance
Name
The database environment we used to store and search the data is ORACLE. The chemistry information is dealt with
ChemAxon's JChem JDBC link to the ORACLE database. The tables of the database and their relation is shown in
Fig 1.
Fig. 1 The tables and their relationship in the ORACLE database
REFERENCE
DNPSTRUCT
UKEY
REFERENCE
ORDER
one to many
realtionship
CD_ID
CD_STRUCTURE
CD_SMILES
CD_FORMULA
CD_MOLWEIGHT
CD_HASH
CD_FLAGS
CD_TIMESTAMP
CD_FP1
.
.
.
CD_FP16
UKEY
SINGLE1
UKEY
NAME
CASM
MOLF
MOLW
UIMP
DUIM
one to one
relationship
The application interface is using Java Server Pages (JSP) technology. The query form is using ChemAxon's
MarvinSketch applet to enter the chemical structure. The results form incorporates ChemAxon's MarvinView applet to
display the chemical structure. The bibliographic references are displayed on a separate form on demand. The
chemistry functionality of the search and exporting the results to various electronic formats are done by JChem as
well.
Query Form
Result Form
Reference Form
Bibliographic reference fields
date
etal
init
page
ptee
rkey
rsrt
rtag
surn
titl
voln
xtra
Date of reference
Multiple author indicator
Initials of author
Page number
Patentee
Reference unique key
Sort key for references
References contents tag
Author’s surname
Journal or book title
Volume number
Extra information
The data are categorized by the parent compounds and sorted into chemical structure groups (e.g.: carbohydrates, oxygen heterocycles, polyketides, etc). A parent
compound can have several derivatives and variants, and the derivatives also can have several variants.
References
Dictionary of Natural Products on CD-ROM. The Chapman & Hall, 2006
Chapman & Hall Export Format, Version 9.0, July, 1997.
Acknowledgement
In developing this application the JSP example application in JChem manual was consulted.
Szabolcs Csepregi and Szilárd Dóránt of ChemAxon helped me through the initial learning phase of using JChem.