ECCROct18-06 - Digital Science Center

Download Report

Transcript ECCROct18-06 - Digital Science Center

Overview of Chemical Informatics and
Cyberinfrastructure Collaboratory
October 18 2006
Geoffrey Fox
Computer Science, Informatics, Physics
Pervasive Technology Laboratories
Indiana University Bloomington IN 47401
[email protected]
http://www.infomall.org
http://www.chembiogrid.org
1
Activities



Local Teams, successful Prototypes and International
Collaboration set up in 3 initial major focus areas
• Chemical Informatics Cyberinfrastructure/Grids with services,
workflows and demonstration uses building on success in other
applications (LEAD) and showing distributed integration of
academic and commercial tools
• Computational Chemistry Cyberinfrastructure/Grids with
simulation, databases and TeraGrid use
• Education with courses and degrees
Review of activities suggest we also formalize work in two further areas
• Chemical Informatics Research – model applicability and datamining
• Interfacing with the User - interaction tools and portal optimized for
particular customer groups
Also have started an activity to identify “customers” for
Cyberinfrastructure and its implied Chemistry eScience model
2
CICC Senior Personnel








Geoffrey C. Fox
Mu-Hyun (Mookie) Baik
Dennis B. Gannon
Marlon Pierce
Beth A. Plale
Gary D. Wiggins
David J. Wild
Yuqing (Melanie) Wu











From Biology, Chemistry,
Computer Science, Informatics
at IU Bloomington and
IUPUI (Indianapolis)


Peter T. Cherbas
Mehmet M. Dalkilic
Charles H. Davis
A. Keith Dunker
Kelsey M. Forsythe
Kevin E. Gilbert
John C. Huffman
Malika Mahoui
Daniel J. Mindiola
Santiago D. Schnell
William Scott
Craig A. Stewart
David R. Williams
3
CICC Infrastructure Vision






Drug Discovery and other academic chemistry and pharmacology
research will be aided by powerful modern information technology
ChemBioGrid set up as distributed cyberinfrastructure in eScience model
ChemBioGrid will provide portals (user interfaces) to distributed
databases, results of high throughput screening instruments, results of
computational chemical simulations and other analyses
ChemBioGrid will provide services to manipulate this data and combine in
workflows; it will have convenient ways to submit and manage multiple
jobs
ChemBioGrid will include access to PubChem, PubMed, PubMed Central,
the Internet and its derivatives like Microsoft Academic Live and Google
Scholar
The services include open-source software like CDK, commercial code from
vendors from BCI, OpenEye, Gaussian and Google, and any user
contributed programs
ChemBioGrid will define open interfaces to use for a particular type of
service allowing plug and play choice between different implementations
4
CICC
Chemical Informatics and Cyberinfrastucture Collaboratory
Funded by the National Institutes of Health
www.chembiogrid.org
CICC
CICC Combines Grid Computing with Chemical Informatics
Large Scale Computing Challenges
Chemical Informatics is non-traditional area of high
performance computing, but many new, challenging
problems may be investigated.
NIH
PubMed
DataBase
Chemical
informatics
text analysis
programs can
process
100,000’s of
abstracts of
online journal
articles to
extract
chemical
signatures of
potential
drugs.
OSCAR
Text
Analysis
Initial 3D
Structure
Calculation
Molecular
Mechanics
Calculations
Cluster
Grouping
Toxicity
Filtering
Science and Cyberinfrastructure
CICC is an NIH funded project to support chemical
informatics needs of High Throughput Cancer
Screening Centers. The NIH is creating a data deluge
of publicly available data on potential new drugs.
.
Docking
OSCAR-mined molecular signatures can
be clustered, filtered for toxicity, and
docked onto larger proteins. These are
classic “pleasingly parallel” tasks. Topranking docked molecules can be further
examined for drug potential.
Quantum
Mechanics
Calculations
NIH
PubChem
DataBase
POVRay
Parallel
Rendering
IU’s
Varuna
DataBase
Big Red (and the TeraGrid) will
also enable us to perform time
consuming, multi-stepped
Quantum Chemistry
calculations on all of PubMed.
Results go back to public
databases that are freely
accessible by the scientific
community.
CICC supports the NIH mission by combining state of
the art chemical informatics techniques with
• World class high performance computing
• National-scale computing resources (TeraGrid)
• Internet-standard web services
• International activities for service orchestration
• Open distributed computing infrastructure for scientists
world wide
5
Indiana University Department of Chemistry, School of Informatics, and Pervasive Technology Laboratories
CICC Prototype Web Services
Basic cheminformatics
Molecular weights
Molecular formulae
Tanimoto similarity
2D Structure diagrams
Molecular descriptors
3D structures
InChI generation/search
CMLRSS
R and Excel
Application based services
Compare (NIH)
Toxicity predictions (ToxTree)
Literature extraction (OSCAR3)
Clustering (BCI Toolkit)
Docking, filtering, ... (OpenEye)
Varuna simulation
Key Ideas
Add value to PubChem with additional distributed services
and databases
 Develop nifty ideas like VOTables
 Wrapping existing code in web services is not difficult
 Provide “core” (CDK) services and exemplars of typical tools
 Provide access to key databases via a web service interface
 Provide access to major Compute Grids

Next steps?
Define WSDL interfaces to enable global production of
compatible Web services; refine CML
 Add more services (identify gaps)
 Add more databases, including 3D structural info
 Demonstrate use of services in other pipelining tools (KDE,
Knime – Pipeline Pilot already done)
 Extend Computational Chemistry (Varuna) Services
 Routine TeraGrid and Big Red use
 “Production” on OSCAR3 CDK Gamess Jaguar
 Develop more training material

Web Service Locations
Indiana University

Clustering

VOTables

OSCAR3

Toxicity classification

Database services
Cambridge University

InChI generation / search

CMLRSS

OpenBabel
SDSC
Typical
TeraGrid Site
InfoChem

SPRESI
database
NIH
PubChem …..
Compare …..
Penn State University
(now moved to IU)
CDK based services

Fingerprints

Similarity calculations

2D structure diagrams

Molecular descriptors
Cheminformatics Education at IU






Linked to bioinformatics in Indiana University’s School of Informatics
• School of Informatics degree programs BS, MS, PhD
Programs offered at both the Indianapolis (IUPUI) and Bloomington
(IUB) campuses
• Bioinformatics MS and track on PhD
• Chemical Informatics MS and track on PhD
• Informatics Undergraduates can choose a chemistry cognate (change
to Life Sciences )
PhD in Informatics started in August 2005 and offers tracks in
• bioinformatics; chemical informatics; health informatics; humancomputer interaction design; social and organizational informatics;
more to come!
Good employer interest but modest student understanding of value of
Cheminformatics degree
3 core courses in Cheminformatics plus seminar/independent studies
Significant interest in distance education version of introductory
Cheminformatics course (enrollment promising in Distance Graduate
8
Certificate in Chemical Informatics)
Current Status












Web site http://www.chembiogrid.org
Wiki chosen to support project as a shared editable web space
Building Collaboratory involving PubChem – Global Information System
accessible anywhere and at any time – enhance PubChem with distributed
tools (clustering, simulation, annotation etc.) and data
Adopted Taverna as workflow as popular in Bioinformatics but we will
evaluate other systems such as GPEL from LEAD
Demonstrated CI-enhanced Chemistry simulations
Initiated Data-mining, User interface and Chemical Informatics tools research
Prototyped large set of runs on local Big Red 23 Teraflop supercomputer
(OSCAR3 and modeling moving to CDK Gamess Jaguar)
Initial results discussed at conferences/workshops/papers
• Gordon Conferences, ACS, SDSC tutorial
First new Cheminformatics courses offered
Advisory board set up and met – this is second meeting
Videoconferencing-based meetings with Peter Murray-Rust and group at
Cambridge roughly every 2-3 weeks
Good or potentially good interactions with Local HTS in CGB, NIH DTP,
Scripps, Lilly and Michigan ECCR
9
MLSCN Post-HTS Biology Decision Support
Percent Inhibition or
IC50 data is retrieved
from HTS
Question: Was this
screen successful?
Workflows encoding plate
& control well statistics,
distribution analysis, etc
Question: What should the
active/inactive cutoffs be?
Workflows encoding
distribution analysis of
screening results
Question: What can we learn
about the target protein or cell
line from this screen?
Workflows encoding
statistical comparison of
results to similar screens,
docking of compounds
into proteins to correlate
binding, with activity,
literature search of active
compounds, etc
Compounds submitted to
PubChem
PROCESS
CHEMINFORMATICS
Grids can link data
analysis ( e.g image
processing developed in
existing Grids),
traditional Cheminformatics tools, as well
as annotation tools
(Semantic Web,
del.icio.us) and enhance
lead ID and SAR analysis
A Grid of Grids linking
collections of services at
PubChem
ECCR centers
MLSCN centers
GRIDS
10
Example HTS workflow: finding cell-protein relationships
A protein implicated in tumor
growth with known ligand is
selected (in this case HSP90 taken
from the PDB 1Y4 complex)
The screening data from a
cellular HTS assay is
similarity searched for
compounds with similar
2D structures to the
ligand.
Docking results and
activity patterns fed into
R services for building of
activity models and
correlations
Least
Squares
Regression
Similar structures to the
ligand can be browsed
using client portlets.
Similar structures are
filtered for drugability, are
converted to 3D, and are
automatically passed to
the OpenEye FRED
docking program for
docking into the target
protein.
Random
Forests
Neural
Nets
Once docking is complete,
the user visualizes the highscoring docked structures
in a portlet using the JMOL
applet.
11
Varuna environment for molecular modeling (Baik, IU)
Researcher
Chemical
Concepts
Papers
etc.
ChemBioGrid
Experiments
Reaction
DB
QM
Database
PubChem, PDB,
NCI, etc.
DB Service
Queries, Clustering,
Curation, etc.
QM/MM
Database
Simulation Service
FORTRAN Code,
Scripts
Condor
TeraGrid
Supercomputers
“Flocks” 12
Methods Development at the CICC








Tagging methods for web-based annotation exploiting del.icio.us
and Connotea
Development of QSAR model interpretability and applicability
methods
RNN-Profiles for exploration of chemical spaces
VisualiSAR - SAR through visual analysis
 See http://www.daylight.com/meetings/mug99/Wild/Mug99.html
Visual Similarity Matrices for High Volume Datasets
 See http://www.osl.iu.edu/~chemuell/new/bioinformatics.php
Fast, accurate clustering using parallel Divisive K-means
Mapping of Natural Language queries to use cases and workflows
Advanced data mining models for drug discovery information
13
Structure of Proposal


a) Define audience that we are targeting
b) Cyberinfrastructure Framework with Key services -Registry, Computing, portal, workflow
• Exemplar Chemoinformatics Services
• Exemplar workflows using services
• Defined WSDL for key cases defined to allow others to
contribute
• Tutorial



c) Education
d) IT/Cyber-enhanced Computational Chemistry
e) Cheminformatics Research
• Systems
• Tools and Modeling
14
Questions



We expect to respond to “big” NIH RFP in about 4 months
Should we partner with Michigan?
Who is “customer” and how do we get more?
• Do/Should chemists want our or more generally NIH’s product?
• Interactions with “large” and “small” industry




What is balance between infrastructure, computational
chemistry, Cheminformatics tools and research, chemical
informatics systems and interfaces?
Should we stress literature (OSCAR3) project?
Balance of applications and generic capabilities?
How should we structure education component?
• Field does not have strong student appeal compared to Bioinformatics

We are strong in Computer Sciences
(Grids/Cyberinfrastructure) but doubtful if any CS reviewers
• We are strong in Cheminformatics systems but not clear a recognized
activity and how do we justify claim that Grids/Cyberinfrastructure/Open
Access “good”

Should we link more with biology?
15
Covering our bases: Who are our “Customers”?
INDIANA-MICHIGAN Chemical Informatics Center
Cyberinfrastructure
Webservices, Workflows, HTS-Tools, new DBs
Rest of
the World
NIH
Lilly
"Classical Chemical Informatics" - Contents
Structure-Based Drug Design; Generation, Curation
and Refinement of Protein-Ligand Interactions;
Docking, Homology Modeling, QSAR
"New Areas to Conquer"
Chemical Literature Processing; Cellular
Pharmacokinetics; Traditional Chemical Research
fields that were so far not reached by Informatics
Cheminfo-Aware
Science Community
Cheminfo-Ignorant
Science Community
16
What do we need to conquer traditional chemical Research Community
Chemist
- only interest in a small subset.
- want more DATA on this small set.
Computational Tools,
In-house DB's
PubChem; other DB's
- High-Fidelity Structural Data, Redox Potentials, Spectroscopy, Transition
State Structures, Energies, Molecular Orbitals…..
17
“Departments” of the future Center
Infrastructure/Technology Developers
and Providers
Application Scientists (Customers)
Computer Science
Develop scalable, robust and efficient
Containers & Cyberinfrastructure
Medicinal Chemistry
Develop new models, produce new
scientific concepts, new methods
Informatics
Develop new services, data structures,
algorithms, tools
Chemistry
Conquer new fields, increase the
information content
Build Cyberinfrastructure, design
databases, workflow, support Web
services with interface standards, wrap
codes as services;
Support infrastructure
Core group develops requirements for
infrastructure and codes as services and
tests infrastructure with key exemplar
projects. Allow broad use by all
18