ECCR_IU_Mar15-07 - Digital Science Center
Download
Report
Transcript ECCR_IU_Mar15-07 - Digital Science Center
Overview of Chemical Informatics and
Cyberinfrastructure Collaboratory
March 15 2007
Geoffrey Fox
Computer Science, Informatics, Physics
Pervasive Technology Laboratories
Indiana University Bloomington IN 47401
[email protected]
http://www.infomall.org
http://www.chembiogrid.org
1
Indiana University Summary
Indiana University is focusing on two major areas:
• Creating a comprehensive, easily accessible infrastructure for
chemoinformatics tools and data sources, linked with PubChem and
made available as web services, and partnering with screening centers and
other users to demonstrate how this infrastructure can be usefully applied
– Infrastructure can include any tools, not just ours (commercial/open source,
chemoinformatics, bioinformatics, and so on)
– New, custom applications can be built quickly using existing services in a
similar way to Google Maps and other “web 2.0” resources
• Being a central hub of chemoinformatics education, including offering
distance courses on chemoinformatics theory and techniques, practical
workshops on using chemoinformatics resources, and freely available webbased educational resources
– We currently offer a Ph.D, M.S. and graduate certificate (distance) in chemical
informatics
– Distance education program allows you to “pick and choose” courses to meet
educational needs: certificate is awarded on completion of four courses
CICC Senior Personnel
Geoffrey C. Fox
Mu-Hyun (Mookie) Baik
Dennis B. Gannon
Kevin E. Gilbert
Rajarshi Guha
Marlon Pierce
Beth A. Plale
Gary D. Wiggins
David J. Wild
Yuqing (Melanie) Wu
From Biology, Chemistry,
Computer Science, Informatics
at IU Bloomington and
IUPUI (Indianapolis)
Peter T. Cherbas
Mehmet M. Dalkilic
Charles H. Davis
A. Keith Dunker
Kelsey M. Forsythe
John C. Huffman
Malika Mahoui
Daniel J. Mindiola
Santiago D. Schnell
William Scott
Craig A. Stewart
David R. Williams
3
CICC
Chemical Informatics and Cyberinfrastucture Collaboratory
Funded by the National Institutes of Health
www.chembiogrid.org
CICC
CICC Combines Grid Computing with Chemical Informatics
Large Scale Computing Challenges
Chemical Informatics is non-traditional area of high
performance computing, but many new, challenging
problems may be investigated.
NIH
PubMed
DataBase
Chemical
informatics
text analysis
programs can
process
100,000’s of
abstracts of
online journal
articles to
extract
chemical
signatures of
potential
drugs.
OSCAR
Text
Analysis
Initial 3D
Structure
Calculation
Molecular
Mechanics
Calculations
Cluster
Grouping
Toxicity
Filtering
Science and Cyberinfrastructure
CICC is an NIH funded project to support chemical
informatics needs of High Throughput Cancer
Screening Centers. The NIH is creating a data deluge
of publicly available data on potential new drugs.
.
Docking
OSCAR-mined molecular signatures can
be clustered, filtered for toxicity, and
docked onto larger proteins. These are
classic “pleasingly parallel” tasks. Topranking docked molecules can be further
examined for drug potential.
Quantum
Mechanics
Calculations
NIH
PubChem
DataBase
POVRay
Parallel
Rendering
IU’s
Varuna
DataBase
Big Red (and the TeraGrid) will
also enable us to perform time
consuming, multi-stepped
Quantum Chemistry
calculations on all of PubMed.
Results go back to public
databases that are freely
accessible by the scientific
community.
CICC supports the NIH mission by combining state of
the art chemical informatics techniques with
• World class high performance computing
• National-scale computing resources (TeraGrid)
• Internet-standard web services
• International activities for service orchestration
• Open distributed computing infrastructure for scientists
world wide
4
Indiana University Department of Chemistry, School of Informatics, and Pervasive Technology Laboratories
CICC Web Service Infrastructure
Cheminformatics Services
Statistics Services
Database Services
Core functionality
Fingerprints
Similarity
Descriptors
2D diagrams
File format conversion
Computation functionality
Regression
Classification
Clustering
Sampling distributions
3D structures by
CID
SMARTS
3D Similarity
Docking scores/poses by
CID
SMARTS
Protein
Docking scores
Applications
Applications
Docking
Predictive models
Filtering
Feature selection
Druglikeness
2D plots
Toxicity predictions
Arbitrary R code (PkCell)
Mutagenecity predictions
PubChem related data by
Anti-cancer activity predictions
Pharmacokinetic parameters
CID, SMARTS
OSCAR Document Analysis
InChI Generation/Search
Computational Chemistry (Gamess, Jaguar etc.)
Grid Services
Varuna.net
Quantum Chemistry
Portal Services
Service Registry
Job Submission and Management
Local Clusters
IU Big Red
TeraGrid, Open Science Grid
RSS Feeds
User Profiles
Collaboration as in Sakai
Web Service Locations
Cambridge University
InChi generation / search
CMLRSS
OpenBabel
Indiana University
Clustering
VOTables
OSCAR3
Toxicity
classification
Toxicity classification
Database
services
Databaseservices
Statistics
services
VCC
Laboratory
ALogPS
NCI
CSLS
University of
Cologne
NMRShiftDB
Where Does The Functionality Come From?
University of
Michigan
PkCell
gNova Consulting
DigitalChemistry
BCI fingerprints
DivKMeans
Cambridge University
InChi generation / search
OSCAR
NIH
PubChem
PubMed
CDK
Cheminformatics
European Chemicals Bureau
ToxTree toxicity predictions
OpenEye
Docking
Indiana University
VOTables
NCI DTP predictions
Database services
R Foundation
R package
CICC Infrastructure Vision
Drug Discovery and other academic chemistry and pharmacology
research will be aided by powerful modern information technology
ChemBioGrid set up as distributed cyberinfrastructure in eScience model
ChemBioGrid will provide portals (user interfaces) to distributed
databases, results of high throughput screening instruments, results of
computational chemical simulations and other analyses
ChemBioGrid will provide services to manipulate this data and combine in
workflows; it will have convenient ways to submit and manage multiple
jobs
ChemBioGrid will include access to PubChem, PubMed, PubMed Central,
the Internet and its derivatives like Microsoft Academic Live and Google
Scholar
The services include open-source software like CDK, commercial code from
vendors from BCI, OpenEye, Gaussian and Google, and any user
contributed programs
ChemBioGrid will define open interfaces to use for a particular type of
service allowing plug and play choice between different implementations
8
Cheminformatics Education at IU
Linked to bioinformatics in Indiana University’s School of Informatics
• School of Informatics degree programs BS, MS, PhD
Programs offered at both the Indianapolis (IUPUI) and Bloomington
(IUB) campuses
• Bioinformatics MS and track on PhD
• Chemical Informatics MS and track on PhD
• Informatics Undergraduates can choose a chemistry cognate (change
to Life Sciences )
PhD in Informatics started in August 2005 and offers tracks in
• bioinformatics; chemical informatics; health informatics; humancomputer interaction design; social and organizational informatics;
more to come!
Good employer interest but modest student understanding of value of
Cheminformatics degree
3 core courses in Cheminformatics plus seminar/independent studies
Significant interest in distance education version of introductory
Cheminformatics course (enrollment promising in Distance Graduate
9
Certificate in Chemical Informatics)
Example: Spreading chemoinformatics education with CIC courseshare
•
•
•
We have partnered with the University of
Michigan to offer our introductory
chemoinformatics (I571) course
concurrently at Indiana University and
the University of Michigan as a CIC
courseshare, so UM pharmacy,
chemistry and engineering students can
be trained in chemoinformatics
techniques for course credit at UM
In addition, individual students in
academia, government, and small and
large life science companies have taken
the class remotely from all over the
country for credit towards the graduate
certificate
Uses mixture of web conferencing
(Breeze), videoconferencing, and online
resources for maximum flexibility
– Minimally all that is required is a
telephone and internet-connected PC
– Students can replay any of the classes
using just a regular PC
– Most recent course wiki is available at
http://cheminfo.informatics.indiana.edu/dj
wild/I571_2006_wiki
Giving a class remotely to UM students with video and web conferencing
MLSCN Post-HTS Biology Decision Support
Percent Inhibition or
IC50 data is retrieved
from HTS
Question: Was this
screen successful?
Workflows encoding plate
& control well statistics,
distribution analysis, etc
Question: What should the
active/inactive cutoffs be?
Workflows encoding
distribution analysis of
screening results
Question: What can we learn
about the target protein or cell
line from this screen?
Workflows encoding
statistical comparison of
results to similar screens,
docking of compounds
into proteins to correlate
binding, with activity,
literature search of active
compounds, etc
Compounds submitted to
PubChem
PROCESS
CHEMINFORMATICS
Grids can link data
analysis ( e.g image
processing developed in
existing Grids),
traditional Cheminformatics tools, as well
as annotation tools
(Semantic Web,
del.icio.us) and enhance
lead ID and SAR analysis
A Grid of Grids linking
collections of services at
PubChem
ECCR centers
MLSCN centers
GRIDS
11
Example HTS workflow: finding cell-protein relationships
A protein implicated in tumor
growth with known ligand is
selected (in this case HSP90 taken
from the PDB 1Y4 complex)
The screening data from a
cellular HTS assay is
similarity searched for
compounds with similar
2D structures to the
ligand.
Docking results and
activity patterns fed into
R services for building of
activity models and
correlations
Least
Squares
Regression
Similar structures to the
ligand can be browsed
using client portlets.
Similar structures are
filtered for drugability, are
converted to 3D, and are
automatically passed to
the OpenEye FRED
docking program for
docking into the target
protein.
Random
Forests
Neural
Nets
Once docking is complete,
the user visualizes the highscoring docked structures
in a portlet using the JMOL
applet.
12
Example: PubDock
• Database of approximately 1 million PubChem structures (the most druglike) docked into proteins taken from the PDB
• Available as a web service, so structures can be accessed in your own
programs, or using workflow tools like Pipeline Polit
• Several interfaces developed, including one based on Chimera (below)
which integrates the database with the PDB to allow browsing of
compounds in different targets, or different compounds in the same target
• Can be used as a tool to help understand molecular basis of activity in
cellular or image based assays
Example: R Statistics applied to PubChem data
• By exposing the R statistical package, and the Chemistry Development Kit
(CDK) toolkit as web services and integrating them with PubChem, we can
quickly and easily perform statistical analysis and virtual screening of
PubChem assay data
• Predictive models for particular screens are exposed as web services, and
can be used either as simple web tools or integrated into other applications
• Example below uses DTP Tumor Cell Line screens - a predictive model
using Random Forests in R makes predictions of probability of activity
across multiple cell lines (available at http://rguha.ath.cx/~rguha/ncidtp/dtp)
Varuna environment for molecular modeling (Baik, IU)
Researcher
Chemical
Concepts
Papers
etc.
ChemBioGrid
Experiments
Reaction
DB
QM
Database
PubChem, PDB,
NCI, etc.
DB Service
Queries, Clustering,
Curation, etc.
QM/MM
Database
Simulation Service
FORTRAN Code,
Scripts
Condor
TeraGrid
Supercomputers
“Flocks” 15
Methods Development at the CICC
Tagging methods for web-based annotation exploiting del.icio.us
and Connotea
Development of QSAR model interpretability and applicability
methods
RNN-Profiles for exploration of chemical spaces
VisualiSAR - SAR through visual analysis
See http://www.daylight.com/meetings/mug99/Wild/Mug99.html
Visual Similarity Matrices for High Volume Datasets
See http://www.osl.iu.edu/~chemuell/new/bioinformatics.php
Fast, accurate clustering using parallel Divisive K-means
Mapping of Natural Language queries to use cases and workflows
Advanced data mining models for drug discovery information
Physics-based Scoring Algorithms
16
What Do You Get in a Web Service?
WSDL for all services available
Collected on a web page
Available in a UDDI repository
Javadocs or plain text descriptions
Source code and associated unit tests
Various client examples
Web pages (via PHP)
Python
Chimera
Web Service Vision
Web services provide a neutral approach to
exposing functionality
You can utilize them in
Workflow tools – Pipeline Pilot, Taverna, XBaya
Desktop clients – Chimera, custom
Web pages
They can be located anywhere
On your desktop
Intranet
Internet
Web Service Vision
Literally anything can be made into a web
service
Libraries
Standalone programs
Commerical code
Open-source code
RSS Feeds
Provide access to DB's via RSS feeds
Feeds include 2D/3D structures in CML
Viewable in Bioclipse, Jmol as well as Sage
etc.
Two feeds currently available
SynSearch – get structures based on full or partial
chemical names
DockSearch – get best N structures for a target
R, CDK & PubChem
Goals
Access cheminformatics from within R
Access PubChem data from within R
rcdk package allows to do cheminformatics
within R using CDK functionality
rpubchem provides access to PubChem
compound data and bioassay data
Searchable via assay ID, keywords
J. Stat. Soft, 2007, 18(6)
Databases
Most of our databases aim to add value to
PubChem or link into PubChem
3D structures (MMFF94)
We maintain a local mirror for testing, data mining
Searchable by CID, SMARTS, 3D similarity
Docked ligands (FRED)
906K drug-like compounds into 7 ligands
Will eventually cover ~2000 targets
(Cheminformatics) Algorithm
Development
Goals
Focus on interpretability and applicability
Devise novel approaches to clustering problems
Investigate the utility of low dimensional
representations for a variety of problems
Examples
Ensemble feature selection (JCIM, in press)
Cluster counting with R-NN curves (in revision)
Chemical Data Mining
Working on screening data with Scripps, FL
Random forests (modeling & feature selection)
Naïve Bayes (modeling)
Identifying features indicative of toxicity
Domain applicability
NCI DTP Cell line activity predictions
Random forest models for 60 cell lines
All available as
downloadable R models
web services (supply SMILES, get prediction) with
web page clients