BiomedicineandLifeSciencesI_HurngChunLee_03282007

Download Report

Transcript BiomedicineandLifeSciencesI_HurngChunLee_03282007

Enabling Grids for E-sciencE
Building Grid-enabled Virtual Screening
Service for Drug Discovery
Ying-Ta Wu1 and Hurng-Chun Lee2
1Academia
Sinica Genomic Research Center
2Academia Sinica Grid Computing Center (ASGC)
www.eu-egee.org
EGEE-II INFSO-RI-031688
EGEE and gLite are registered trademarks
Outlines
Enabling Grids for E-sciencE
• Avian flu drug analysis on the Grid
• Developing grid-enabled virtual screening service
• The next large-scale virtual screening on avian flu
EGEE-II INFSO-RI-031688
The virtual screening
Enabling Grids for E-sciencE
Millions of chemical compounds available
in laboratories
300,000 Chemical compounds:
ZINC
Chemical combinatorial library
High Throughput Screening
$2/compound, nearly impossible
Molecular docking (Autodock)
~137 CPU years, 600 GB data
Data challenge on EGEE,
Auvergrid, TWGrid
~6 weeks on ~2000 computers
Target (PDB) :
Neuraminidase (8 structures)
EGEE-II INFSO-RI-031688
Hits sorting
and refining
In vitro
screening
of 100 hits
1st large-scale avian flu virtual screening on the Grid
Enabling Grids for E-sciencE
• In 2006, a grid-enabled high-throughput screening
against the H5N1 virus was performed
– Matching 300,000 ligands against 8 targets using AutoDock
– The computing requirement of 137 CPU years was tackled by
the 6-weeks high-throughput screening (HTS) activity on EGEE,
AuverGrid and TWGrid
– Two different computing models (WISDOM and DIANE) were
adopted for submitting docking jobs concerning two different
user aspects (scalability and interactivity)
– The goal is to analyze the efficiency of the known drugs to the
possible Neuraminidases mutations
EGEE-II INFSO-RI-031688
High-throughput screening using WISDOM
Enabling Grids for E-sciencE
• WISDOM: Wide In-Silico Docking On Malaria
• The platform has been successfully tested in previous challenge
• a workflow of Grid job handling: automatic job submission, status check and report, error recovery
• push model job scheduling + batch mode job handling
Results
Compounds list
Software
Storage
Element
Parameter settings
Target structures
Compounds sublists
Site1
Computing
Element
Statistics
User interface
Compound
s database
EGEE-II INFSO-RI-031688
Storage
Element
Results
Computing
Element
Site2
Software
Interactive screening using DIANE + GANGA
Enabling Grids for E-sciencE
• DIANE: Distributed Analysis Environment
• An overlay system on top of a variety of distributed computing environment, taking care of all
synchronization, communication and workflow management details on behalf of application
• A lightweight framework for parallel scientific applications in master-worker model
• Pull model job scheduling + interactive mode job handling with flexible failure recovery mechanism
EGEE-II INFSO-RI-031688
The grid statistics
Enabling Grids for E-sciencE
WISDOM
Total number of completed dockings
Estimated duration on 1 CPU
Duration of the experience
Cumulative number of the Grid jobs
Max. number of concurrent CPUs
Crunching rate
Approximated distribution efficiency
Approximated throughput
DIANE
2 * 106
308,585
88.3 years
16.7 years
6 weeks
4 weeks
54,000
2580
2,000
240
912
203
46 %
84 %
2 sec/docking
10 sec./ docking
•
~600 GBytes of docking results are produced and archived on the Grid
•
~83% were successfully completed according to the Grid Logging and
Bookkeeping; only ~70% of results were really produced on the Grid storage
element
EGEE-II INFSO-RI-031688
Enrichment of primary in silico HTS
Enabling Grids for E-sciencE
Original Type: T06
DAN 35%
pKd=5.3
4AM 13%
pKd=7.5
• 2qwe: Zanamivir (known
drug)
• five out of six known
effective compounds can
be identified in the first
15% of the ranking
pKd=7.3
GNA 2.4%
E = (5/6)/15% = 5.5
(< 1 in most cases)
Ki=4uM
GNA=zanamivir
EGEE-II INFSO-RI-031688
15% cut off
Ki=150nM
Ki=1nM
Mutation effects
Enabling Grids for E-sciencE
300,000 x15% = 45,000
45,000 x 5% = 2,250
EGEE-II INFSO-RI-031688
autodock
re-rank
Effects of point mutation
Enabling Grids for E-sciencE
• Most known effective
inhibitors lose their
affinity in binding
with a mutated target
2qwe: 2.4%  11.5%
1f8c: 13%  55%
T01
DNA
4AM
55%
11.5%
EGEE-II INFSO-RI-031688
E119A
Enabling Grids for E-sciencE
Q: How to deliver an user-friendly service integrating the
high-throughput virtual screening and the data
analysis?
EGEE-II INFSO-RI-031688
Lessons learnt
Enabling Grids for E-sciencE
• The grid-enabled virtual screening application does benefit the
drug analysis in terms of money and time.
–
–
–
–
137 CPU years in 6 weeks using about 2000 grid worker nodes
Primary HTS helps filter out 85% of compounds
Global enrichment rate: 5.5
Mutations do affect the efficiency of the know drugs and potential hits
• Gaps between the current system and a real end-user application
– Lack of a well-annotated ligand database
– Lack of a friendly user interface to run the virtual screening process on
the Grid
– Lack of an easy-to-use interface to access the produced docking
results for further analysis
– Lack of an automatic refinement pipeline
EGEE-II INFSO-RI-031688
GUI - first step to real end-user application
Enabling Grids for E-sciencE
Interactive analysis
Job History
EGEE-II INFSO-RI-031688
Progress monitoring
Refinement pipeline
Enabling Grids for E-sciencE
NA X-ray Structure
pdb_2HU0 & pdb_2HU4
ChemDiv most recent 200K
Life Chemical most recent 200K
Data flow
Process
Targets
Compounds
PDBQS
Formater
PDBQ Formater
(SDF to PDBQ)
Initial Screening
(20 runs)
ADME Evaluation
Scaffolds Clustering
Fragment Statistic
Database
Compound
Annotation
Family category remark
dlg parser
Energy Table
+
Compound
Structure
threshold holder
(keep top ~50% of total)
Refinement
Screening
(50 runs)
dlg parser
threshold holder
(< 1500 ligands per target)
additional refinement
(< 100 ligands per target)
EGEE-II INFSO-RI-031688
Energy Table
+
Compound
Structure
Visualization
&
Statistic Analysis
Common database
Enabling Grids for E-sciencE
•
Chemical properties to better annotate the compounds
•
Results essential for further analysis are extracted and stored in a result database
•
Database access through AMGA
– for access control
– for data replication
EGEE-II INFSO-RI-031688
Proposal of the 2nd data challenge
Enabling Grids for E-sciencE
• Proposed plan:
– Testing phase: May, 2007
– Official launch: June, 2007
• Biology goals
– Further analysis on the effect of the mutations
– Further analysis on the open conformation of NA
• Grid goals
– Improving the service usability
– Enabling the refinement pipeline
– Reducing researchers’ effort in data analysis
EGEE-II INFSO-RI-031688
Credit
Enabling Grids for E-sciencE
•
Docking workflow preparation
–
–
–
–
•
Contact point: Y.T. Wu
E. Rovida
P. D'Ursi
N. Jacq
Grid resource management
–
–
–
–
Contact point: J. Salzemann
TWGrid : H.C. Lee, H. Y. Chen
AuverGrid : E. Medernach
EGEE : Y. Legré
EGEE-II INFSO-RI-031688
•
Platform deployment on the Grid
–
–
–
•
Contact point: H.C. Lee, J. Salzemann
M. Reichstadt
N. Jacq
Users (deputy)
–
–
–
–
–
–
–
–
J. Salzemann (N. Jacq)
M. Reichstadt (E. Medernach)
L. Y. Ho (H. C. Lee)
I. Merelli, C. Arlandini (L. Milanesi)
J. Montagnat (T. Glatard)
R. Mollon (C. Blanchet)
I. Blanque (D. Segrelles)
D. Garcia
Mini Workshop
Enabling Grids for E-sciencE
• tomorrow afternoon from 2 pm at Conference Room 4
• Discussions on the collaboration issues of the 2nd
avian flu data challenge
EGEE-II INFSO-RI-031688
DIANE Directory Service
Enabling Grids for E-sciencE
• Improving the scalability of
the DIANE framework
• The Directory Service is a
server containing a list of all
the masters
• The Master register itself to
the Directory Service
• The Workers obtain a Master
through the Directory Service
• Directory Service has an
algorithm for the load
balancing of the workers and
prioritization of the masters
EGEE-II INFSO-RI-031688