JChem - ChemAxon

Download Report

Transcript JChem - ChemAxon

Benchmarking
JChem Oracle and Instant-JChem (and more)
Tobias Kind
FiehnLab at UC Davis Genome Center
November 2006
Free Academic Licenses for JChem and
Instant JChem provided by
1
ChemAxon product suite
Source: Chemaxon.com
We have free academic licenses for all products
2
Metabolomics @ Fiehnlab- The science of the small molecules
Compound Classes:
• sugars
• amino acids
• steroids
• fatty acids
• lipids
• phospholipids
• organic acids ...
Molecules under investigation
(shown with ChemAxon Marvin)
Visit us @ fiehnlab.ucdavis.edu
3D model of a molecule with surface plot
(shown with ChemAxon MarvinSpace)
3
Metabolomics is a truly emerging science
...tries to identify all small molecules (< 2000 Da)
in all life forms in a comprehensive manner
Life Science Tree:
Genomics (DNA)
Transcriptomics (RNA)
Proteomics (Proteins)
Metabolomics (Small Molecules)
4
Techniques and tools
• Analytical techniques (LC-MS, GC-MS, NMR, IR)
• BioInformatics, Cheminformatics
LTQ-FT-MS
Gas Chromatography
GC-TOF-MS
Liquid Chromatography
LC-MS
BioInformatics and Cheminformatics
Statistics (Statistica Dataminer)
Open Source + commercial software
5
We use cheminformatics tools for
mass spectrometry based structure elucidation
No.
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
96
Relative Intensity (%)
88
80
mass spectrum
molecular ion
isotopic pattern
accurate mass
72
64
56
48
40
32
Formula
Generator
24
16
8
944
946
948
950
952
954
956
958
960
962
964
966
m/z
O
48
32
24
16
8
8
948
950
952
954
956
958
960
962
964
no
953
40
16
966
954
944
946
948
950
952
954
m /z
957
957
956
958
966
960
962
964
960
962
964
966
m/z
952
OH
OH
OH
O
952
96
96
88
88
?
80
72
64
56
48
953
40
32
OH
80
72
yes
64
56
48
953
40
32
24
OH
OH
OH
64
56
32
946
{exhaustive
drill down}
80
72
24
944
O
OO
O
Relative Intensity (%)
48
40
96
88
Relative Intensity (%)
O
HO
HO
HO
OH
64
56
OH
O
Relative Intensity (%)
O
O
HO
80
72
O
O
HO
Mass
952.071
952.089
952.081
952.098
952.09
952.097
952.088
952.08
952.098
952.09
951.945
951.914
951.932
951.906
951.924
951.915
951.907
951.944
951.962
952
?
96
88
Relative Intensity (%)
Formula: C41H28O27
one structural isomer shown
millions of isomers possible
Formula
C61H22N4OP2S2
C61H22N4O3P2S
C61H23N4OP3S
C61H23N4O3P3
C61H24N4OP4
C61H29O4PS3
C61H30O2P2S3
C61H31P3S3
C61H31O2P3S2
C61H32P4S2
C61H119N6O
C61H123O4S
C61H123O6
C61H124O2PS
C61H124O4P
C61H125O2P2
C61H126P3
C61H127N2S2
C61H127N2O2S
24
954
16
954
16
8
944
946
948
950
952
954
957
957
956
958
8
966
960
962
964
966
944
m/z
946
948
950
952
954
957
957
956
958
Automatic
Isotopic
Pattern
Filter
966
966
m/z
O
(one possible result)
DB
Search
molecular
isomer
generator
See our BMC Bioinformatics paper:
{fast approach but
non comprehensive}
No.
1
2
3
4
Formula
C41H28O27
C41H28N8O20
C41H28N16O13
C41H28N24O6
Mass
952.082
952.142
952.202
952.262
{slow approach,
needs constraints,
comprehensive}
Metabolomic database annotations via query of elemental compositions:
Mass accuracy is insufficient even at less than 1 ppm ; http://www.biomedcentral.com/1471-2105/7/234
6
What are
JChem and Instant-JChem?
JChem and Instant JChem are cheminformatics tools for handling small
molecule structures together with substance data (logP, fingerprint, pKa, toxicity,
meta-information) + searches + filter + web connections and more
Difference: JChem = complex package and Instant-JChem = one single tool
Picture ChemAxon
JChem
Instant-JChem
7
Benchmarking
Instant-JChem and JChem Oracle (and more)
Myth 1: JChem+Oracle is faster than Instant-JChem+Apache Derby – Reality: lets see...
Myth 2: JAVA is slow – Reality: Its fast (70% of C++).
Myth 3: Old Intel Netbust Xeons (Netburst) are slow – Reality: Yes.
Myth 4: Oracle is a hazzelfree and handsome DB for beginners – Reality:
Myth 6: 2 CPUs are better than one – Reality: Yes.
Myth 7: Comparing apples with oranges (in germany pears) is unfair - c'mon...
Only first myth left.
8
Happy Oracle Ace
paid 10K for certificate
A bit of Oracle Reality
Oracle works, lots of people invested lots of mony (ORCL market cap = 92 billion dollars)
Its good for large data (TByte) - Its overkill for a small DB.
If you plan to install it on your production workstation (a big No No)
• It will eat 600-800 MB of your valuable RAM (for nothing, on WINXP 32 bit)
• It will create 15,049 files in 2,029 folders (for what?)
• It will create a lot of hassle with certain network setups (DHCP)
• RTFM (read the … manual) is no joke and you need to learn SQL (try the free Aqua Data Studio)
• Complete learning will take you 1..2 years, but gives you extreme flexibility
If you plan to install JCHEM + Oracle you need
• JChem (includes cartride for Oracle)
• Oracle
• Apache Tomcat
• 1-2 days time (ChemAxon documentation is good,
but too many things can go wrong with Oracle)
1st time Oracle user
9
A bit of Instant JChem Reality v1.0
A) Download
http://www.chemaxon.com/instantjchem/
B) Install
C) It Runs instantly
• inbuilt Apache Derby DB
• JAVA engine included
• complete JChem included
• out-of-the-box tool
• can connect to other DBs
10
Importing Structures into Instant JChem
During import in Instant JChem only one CPU works.
The fingerprint calculation is probably not multi-threaded.
(Solution: work pool = make pool for n CPUs)
Short import time is critical for user convinience,
but not for long term database projects.
11
Importing Structures into Instant-JChem
influence of JAVA hotspot compiler
JAVA VM runs in to modes: with client compiler and server compiler (directories under JRE)
If you run any calculation intensive programs alwyas use server mode, in a batch file call java –server XYZ
Good and fast 
Bad and slow 
12
Influence of JAVA hotspot compiler
Importing Structures into Instant-JChem
Import of 250k structures
into Instant-JChem
600
Import of 250k structures
(NCI99.smi) into InstantJChem: Server JVM is
20% faster!
seconds
500
400
300
200
100
0
JAVA server
JAVA client
lower is better
Testsystem: Dual Opteron 254 (2,8 GHz); WINXP-32bit; 2.88 GByte RAM (10 GByte/s transfer rate);
ARECA-1120 RAID5 (read/write 200 MByte/s and burst rate 500 MByte/s);
QSOFT Ramdisk Enterprise 1,2 GByte ( read write 1 GByte/s transfer)
13
Influence of JAVA hotspot compiler with Instant-JChem
Task:
Search for substructure in a 3 million compound database
and calculate the Lipinski Rule of 5 on all the 4632 results.
JAVA server mode:
JAVA client mode:
15 seconds (30% faster)
21 seconds
Cl
N
N
SMILES: NC1=CC=NC2=C1C=CC(Cl)=C2
If you want to speed-up this query
you need to pre-calculate and
include all descriptors already in the database
http://en.wikipedia.org/wiki/Lipinski's_Rule_of_Five
(mass() <= 500) && (logP() <= 5) && (donorCount() <= 5)
&& (acceptor Count() <= 10) (acceptor count for C and H)
14
Influence of number of CPUs with Instant-JChem
Task: Search for a substructure in a 3 million
compound database and calculate the Lipinski Rule of
5 on all the 4632 results
2 CPUs
1 CPU
JAVA server mode:
15 seconds
33 seconds
JAVA client mode:
21 seconds
44 seconds
Doing the Lipinski utilizes both CPU cores!
Try Intel Quad! Try Opteron 8x!
Testsystem: Dual Opteron 254 (2,8 GHz); WINXP-32bit; 2.88 GByte RAM (10 GByte/s transfer rate);
ARECA-1120 RAID5 (read/write 200 MByte/s and burst rate 500 MByte/s);
QSOFT Ramdisk Enterprise 1,2 GByte ( read write 1000 MByte/s transfer)
15
Influence of number of CPUs with Instant-JChem
Task: Search for a substructure in a 3 million compound database
and calculate the Lipinski Rule of 5 on all the 4632 results (on the fly)
Cl
N
N
1 CPU (1x2.8 GHz)*
2 CPUs (1x2.8 GHz)*
8 CPUs** (2 GHz)
33 seconds
15 seconds
4 seconds
Doing the Lipinski utilizes multiple CPU cores!
However a single logP calculation is dependent on CPU speed, not CPU cores.
Use AMD Opteron 8xCPU systems (or better). For cheaper setups use Intel Core 2 Quad (QX6700).
Testsystem*: Dual Opteron 254 (2.8 GHz); WINXP-32bit; 2.88 GByte RAM (10 GByte/s transfer rate);
Testsystem** : 4 x Dual-Core Opteron 870 2.0 GHz; CentOS 64-bit, 32 GByte RAM, 3.5 GB set for JAVA heap space
16
Influence of number of CPUs on complex calculations
with Instant-JChem
Task: Search in 1000 compounds from PubChem-1000-demo and calculate on-the-fly:
Hits
1 CPU (1x2.8 GHz)*
2 CPUs (1x2.8 GHz)*
8 CPUs** (2 GHz)
Bioavailability
832
30 s
17 s
7.5 s
Ghose filter
255
14 s
8s
4.4 s
Lead likeness
531
53 s
25 s
9.8 s
Lipinski rule of
5
776
15 s
7.5 s
4.7 s
Muegge filter
277
7.5 s
4.2
3.4 s
Veber filter
Take home message:
774
1.7 s
2.5 s
1.5
The more complex the request – the more CPUs you need.
The lead likeness has 7 filters and reaches a 5-8 times speed-up with more CPUs.
Testsystem*:
Testsystem** :
Dual Opteron 254 (2,8 GHz); WINXP-32bit;
2.88 GByte RAM (10 GByte/s transfer rate);
4 x Dual-Core Opteron 870 2 GHz; CentOS 64-bit,
32 GByte RAM, 3.5 GB set for JAVA heap space
17
Scaling complex calculations to larger DBs
with Instant-JChem
Task: Now search in 250,000 compounds from NCI2000 and calculate on the fly:
Hits
Direct
Query
Calculation
8 CPUs** (2 GHz)
extrapolated
time from
1000er DB
Bioavailability
227,997
<1s
380 s
2055 s
5
Ghose filter
160,047
<1s
230 s
2762 s
12
Lead likeness
159,656
<1s
1255 s
2947 s
2
Lipinski rule of
5
199,821
<1s
176 s
1210 s
7
Muegge filter
145,234
<1s
299 s
1783 s
6
Veber filter
215,377
<1s
20 s
696 s
35
Take home message:
Obtained
speed-up
Do not extrapolate calculational times from different or smaller DBs.
The speedups here are 2-35 larger than expected.
Pre-calculate values once and store them in the DB and query values later.
Testsystem** :
4 x Dual-Core Opteron 870 2 GHz; CentOS 64-bit,
32 GByte RAM, 3.5 GB max set for JAVA heap space
1.5 GByte JAVA heap space used.
18
Derby database file sizes for Instant- JChem+Apache Derby
Compounds only
100k structures
1 Mio structures
10 Mio structures
20 Mio structures
~30 MByte
~300 MByte
~3 GByte
~6 Gbyte
If you have dual or quad cores turn drive compression on.
You can save almost 50% space, speed overhead is low.
19
Instant-JChem on disk based and RAMDisk based systems
People who said the OS has efficient disk caching lied.
A large RAMDISK can speed up your system extremely.
A) If you have money – buy a Solid State Disk
RAMSAN-400; 128 GByte; Price $252,720
3,000 MB/s random sustained external throughput.
B) If you have some money – buy a RAID5 card.
ARECA ARC-1120 for 8 HDs, Price $500
200-400 MB/s read and write access
C) If you have litte money – buy a RAMDISK
and stuff as much RAM in as possible (take a 64-bit OS)
500-1000 MB/s read and write access
...a normal hard drive has ~30-50 MB/s transfer rate
20
Instant-JChem on disk based and RAMdisk based system
A) Heap Memory max 800 MByte (OK)
Load 3 Mio compound DB from Ramdisk:
Load 3 Mio compound DB from RAID5 disk:
2 seconds
11 seconds (factor 5)
Search Substructure from RAMDISK DB:
Search Substructure from RAID5 DB:
instant (imemory buffered)
instant (memory buffered)
B) Heap Memory max 200 MByte (too low)
Load 3 Mio compound DB from Ramdisk:
Load 3 Mio compound DB from RAID5 disk:
19 seconds
25 seconds (factor 1.3)
Search Substructure from RAMDISK DB:
Search Substructure from RAID5 DB:
22 seconds
38 seconds (factor 1.7)
Take home message: give JAVA (JChem)
as much heap memory as you can. For 3 Million
structures you need minimum 300 MByte heap space.
No Heap memory:
Performance degradation:
Everything must be read
from disk;
My RAID5 is already
extremely fast, still the
RAMDISK is even faster
21
JChem+Oracle DB on Xeon
vs.
Instant-JChem+Apache Derby DB on Opteron
(apples vs. oranges)
Task:
Import and indexing 3 million compounds (NCI2000 duplicated to 3 Mio)
3GHz Dual Xeon
with 2GB system memory - JChem+Oracle DB = 5801 seconds (96 minutes)
2.8 GHz Dual Opteron
with 2,88 GB memory - Instant-JChem+Apache Derby = 5333 seconds (88 minutes)
Take home message:
If you have a (modest) modern computer
it can handle JChem and Instant-JChem and
a local database can be faster than a remote database 
Source Xeon data: Oracle Cartridge Benchmark
http://www.chemaxon.com/jchem/FAQ.html#benchmark3
22
Instant-JChem+Apache Derby DB on Socrates*
vs.
Instant-JChem+Apache Derby DB on Dual Opteron 2.8 GHz (WIN-XP)**
vs.
JChem+Oracle DB on Dual Xeon 3 GHz (W2003 Server)***
(more apples vs. oranges)
Task: Search for a substructures in a 3 million
compound database (NCI2000x12)
# Hits
InstantJChem+Derby*
InstantJChem+Derby**
JChem+Oracle***
0
0
0
0
O=C1ONC(N1c2ccccc2)c3ccccc3
204
0
0
0
[#6]-c1cc(-[#6])nc(NS(=O)(=O)c2ccccc2)n1
1224
0
0
0
c1ncc2ncnc2n1
65,208
2s
7s
14 s
Clc1ccccc1
274,608
5s
15 s
43 s
O=Cc1ccccc1
443,580
9s
28 s
85 s
C1CN1c2cnnc3c(cncc23)C4=CSC=C4
Take home message: Instant-JChem is fast (nothing more).
Source: Instant-JChem (own system), JChem (ChemAxon website)
Socrates*: 4x Dual Opteron 870 2GHz; CentOS 64-bit, 32 GByte RAM, 4 GB set for JAVA
Opteron**: Dual Opteron 254 (2,8 GHz); WINXP-32bit; 2.88 GByte RAM (10 GByte/s transfer );
ARECA-1120 RAID5 (read/write 200 MByte/s and burst rate 500 MByte/s);
QSOFT Ramdisk Enterprise 1,2 GByte ( read write 1000 MByte/s transfer)
Xeon: Dual Intel Xeon 3GHz, 2GB memory, 160GB IDE hard drive;
Windows 2003 5.2; Oracle 9.2.0.7.0 DB buffer 1 GB; 1.5.0_06-b05 Apache Tomcat/5.5.12
23
A 20 million compound DB with Instant-JChem
in a local Derby DB (WinXP-32bit)
• Import is heavily disk dependent
• several hundred million read/write operations to disk (JAVA writes in 4 KB chunks)
• JAVA heap space used during import is around 600 MByte
• import time is not linear anymore
• WIN XP 32-bit + NTFS desperatly try to cache the 6 GByte database file,
even if there is only 3 GByte memory maximum available (1 GByte max for cache).
• index creation (import smiles): 20h (too long)
• open index for search: 1 min
• substructure search: > 1min (to long)
• 20 Mio currently to large for Instant-JChem v1.0
 use JChem+Oracle (or MySQL, MS SQL)
• Aim: Full PubChem data (15-20 Mio) locally
24
Some general JAVA + JChem speed advices
1. Always use server JVM (check directory bin\client and bin\server)
check batch or sh file options for JAVA –server xyz xyz.jar
2. Use 64-bit systems; the JAVA maximum heap space for LINUX or WIN
as 32-bit system is only 1.6 GByte -Xms=1600m
3. Use only multicore machines (AMD Opterons, Intel Quad)
4. Use the fastest disks you can buy (WD Raptor) or use RAID5 or RAID6
for large files (PubChem SDF data for 5 Mio compounds = 30 GByte)
5. Give Instant-JChem as much memory as you have - minimum 500 MByte
for extreme speed (no wait time for searches)
25
Let’s not forget competitors
Many good systems exist:
MDL (ISIS Base), ACDLabs (ACD/ChemFolder Enterprise), Tripos (Sybyl+Auspyx),
Molecular Networks (Carol), CDK and Taverna, Accelrys (Accord), Daylight (Thor and
Merlin), CambridgeSoft (ChemOffice Enterprise), Molsoft (ICM+MolCart)
Why is ChemAxon better?
Two reasons:
•
•
The programs work under WINDOWS and LINUX
ChemAxon has the best and most responsive public forum:
Critics is taken seriously, requested features are implemented ASAP,
and a public response within 1-3 days. WHY? Many commercial licencees.
Remember, for academics all free.
26
Results and conclusion
JChem Oracle vs. Instant-JChem
1. Instant-JChem+Derby is as fast or faster than JChem+Oracle for DBs < 3 Mio
2. If you want to have fun and results at your fingertip: Instant-JChem
3. If you want extreme flexibility and you know JAVA+SQL: JChem-Oracle
4. We are far away from handling billions of structures in a DB (with modest efforts)
We will handle such large number of structures file stream based with cluster support.
5. Software producers (in general) need to put more efforts into software development
for multi-core CPUs + clusters under Windows and LINUX.
27