Chemoinformatics & Cyberinfrastructure
Download
Report
Transcript Chemoinformatics & Cyberinfrastructure
Chemoinformatics
David Wild, [email protected]
Bioinformatics Retreat, Feb 2nd, 2007
Indiana University School of
David Wild, Geoffrey Fox, Bioinformatics retreat, February 2007. Page 1
Current state of chemoinformatics research
• What works and what doesn’t
–
–
–
–
Fingerprints, clustering and diversity
QSAR - predictive and descriptive methods, virtual screening
3D similarity, pharmacophores & docking
Visualization, organization and navigation of chemical datesets
• Current buzz areas in chemoinformatics
• How can we use our internal strengths to do something new,
important and impressive?
Indiana University School of
David Wild, Geoffrey Fox, Bioinformatics retreat, February 2007. Page 2
What works and what doesn’t
•
•
•
2D structure and similarity searching well established
– Lots of papers comparing fingerprints for similarity
– Some recent evidence Scitegic ECFPs better for recall of actives
Clustering well established but definite room for improvement
– Traditional methods Wards, K-means, Jarvis Patrick
– Recently single pass similarity cutoff methods used for very fast organization >0.85 for similar activity, >0.55 for QSAR
– Data mining methods - ROCK, Chameleon, Cure, etc unexplored
– Diversity hot -> cold -> smart
QSAR - poor relation of academic work to industry usefulness
– Lots of papers: “this method works best on this dataset”
– Random forests appear practically to work rather well
– Interpretability vs predictive ability
– Predictive methods for LogP, pKa, solubility, etc work reasonably
– Virtual screening virtually useless unless tied in with HTS screening process.
However, is useful for exploring around leads.
Indiana University School of
David Wild, Geoffrey Fox, Bioinformatics retreat, February 2007. Page 3
What works and what doesn’t
• Mostly, 3D methods haven’t worked out yet
– Similarity & QSAR - Almost every paper: 2D better for recall and
precision but 3D methods give “interesting ideas”. Useful for “lead
hopping”
– Pharmacophore searching not widely used
– Docking - very useful for visual inspection, poor correlation of scoring
functions with binding
• Visualization, organization and navigation of datasets
– Still not clear how to work with datasets > few hundred compounds
– Dot plots, spreadsheet-based methods work minimally
– Need for UI design and research
Indiana University School of
David Wild, Geoffrey Fox, Bioinformatics retreat, February 2007. Page 4
The current buzz in chemoinformatics
• Decorporatization and commoditization of data and software
– MLSCN, PubChem, open source, small companies
– Crisis for the software companies, nice for academia
– Pharma companies in the brown stuff without a paddle
• Integration with other “ics”
– Data mining chemical/genomic information
– Linking compounds -> proteins -> pathways, etc (e.g. KEGG)
• Fuzzy boundaries, integration with science and informatics
– Microsoft 2020 vision for science
• Integration of text and structure searching
• Semantic web, services and mashups will probably have a
BIG impact: exporting best of breed… what happens to the
rest?
Indiana University School of
David Wild, Geoffrey Fox, Bioinformatics retreat, February 2007. Page 5
Suggested collaboration areas
• Chem/bio/complex systems mashups using web services in
each of the areas: nice, confined projects for students once you
have the infrastructure
• Chem and complex can work together on integrating text and
structure-based searching, indexing and crawling (e.g.
networks of web services and databases), and intelligent
agents
• Data mining of chemogenomic information
• Integration of advanced chemoinformatics methods with
systems biology and pathway mapping tools
• Performing research to establish best practices for areas of
chemoinformatics
• Tackling algorithmic problems for which there is currently no
good solution - docking and scoring
Indiana University School of
David Wild, Geoffrey Fox, Bioinformatics retreat, February 2007. Page 6
Cyberinfrastructure
Geoffrey Fox
Computer Science, Informatics and Physics
Indiana University School of
David Wild, Geoffrey Fox, Bioinformatics retreat, February 2007. Page 7
Cyberinfrastructure
Supports distributed science – data, people, computers
Exploits Internet technology (Web2.0) adding (via Grid
technology) management, security, supercomputers etc.
It has two aspects: parallel – low latency (microseconds)
between nodes and distributed – highish latency (milliseconds)
between nodes
Parallel needed to get high performance on individual 3D
simulations, data analysis etc.; must decompose problem
Distributed aspect integrates already distinct components
Cyberinfrastructure is in general a distributed collection of
parallel systems
Cyberinfrastructure is made of services (usually Web services)
that are “just” programs or data sources packaged for
distributed access
TeraGrid: Integrating NSF Cyberinfrastructure
Buffalo
Wisc
UC/ANL
Utah
Cornell
Iowa
PU
NCAR
IU
NCSA
Caltech
PSC
ORNL
USC-ISI
UNC-RENCI
SDSC
TACC
TeraGrid is a facility that integrates computational, information, and analysis resources at the
San Diego Supercomputer Center, the Texas Advanced Computing Center, the University of
Chicago / Argonne National Laboratory, the National Center for Supercomputing Applications,
Purdue University, Indiana University, Oak Ridge National Laboratory, the Pittsburgh
Supercomputing Center, and the National Center for Atmospheric Research.
Today 100 Teraflop; tomorrow a petaflop; Indiana 20 teraflop today.
Cyberinfrastructure at IU
Interpreted broadly (Web presences), there are many activities at
IU
Interpreted narrowly as the “programmable web” or “using Grid
technologies” there are large projects in atmospheric, earthquake,
ice-sheet sciences, network systems, particle physics,
Crystallography and Cheminformatics
• IU has an international reputation in both parallel and
distributed Cyberinfrastructure including education, research
and resources
• IU has #31 Supercomputer in world and is part of two major
National activities TeraGrid and Open Science Grid
There are several well known Bioinformatics Grids such as BIRN
(mainly images) and caBIG (cancer databases) from NIH and
MyGrid from UK (EBI)
Could be opportunities to link Biology and Informatics/CS in
Cyberinfrastructure projects
Cyberinfrastructure motivated by Web 2.0
Capture the power of interactive Web/Grid sites
Programmableweb.com
enabling
363
Web 2.0 people
API’s to create, collaborate and build on each
others
work
Need
Similar
Life Science
Portal for Tools and Data
Web services, workflows, portals and ontologies
• Web Services allow us to quickly develop and deploy new tools,
interfaces that cross disciplines and are broadly accessible
– Can use simple HTTP and ignore Web Service complications
• Workflows (called mashups in Web 2.0) allow us to string
together collections of web services to do computation that is
tailored to the science (as a one-off or for re-use).
– Develop core capabilities as services and use in many different ways as
in 770 Google map mashups
• API’s/Languages/Data structures/Ontologies (WSDL AJAX
JSON at low level) allow us to describe workflows and services
in discoverable, standard ways, such that reasoning tools can
piece them together to match queries
• Portals enable composable reusable user interfaces
• Distributed posting of services and easily available composition
tools enable “everybody” to contribute
– Interesting implications for “broader participation”
Indiana University School of
David Wild, Geoffrey Fox, Bioinformatics retreat, February 2007. Page 12
Model and Data Sharing
Cyberinfrastructure requires agreed sharing standards (data
structures, API’s, protocols, ontologies, languages) as intrinsically
internationally distributed
There are agreed data structures for taking
SequenceProteinFoldingInteraction Transparently, e.g. BLAST
Nothing at the level where genomics and proteomics is important: cells
and tissues.
Partial answers: CellML, FieldML, SBML which do not link to
relevant standards outside Biology
Need to connect models at these levels. Need Standard ontologies/data
structures for cell behaviors to allow connections and validation
Need to connect Models like SBW (Systems Biology
Workbench)/BioSpice ->Cell-level models (Compucell) ->Tissue level
models (Physiome)
Model builders at these scales not CS-sophisticated. Models NOT
interoperable and don’t use useful general ideas
Glazier organizing activity in this area with H. Sauro (U. Washington),
W. Li (UCSD-SDSC), Hunter (U. Auckland) and NIH
• Link to Open Grid Forum standard setting and community
activities
http://www.chembiogrid.org
Database enabled
quantum
chemistry
computations
Services to link
PubChem,
Supercomputers,
results of high
throughput
Screening centers
Education; IU
has unique
Cheminformatics
degrees
Portals
Chemical Informatics web service infrastructure
• Database Services
– Local NIH DTP Human
Tumor Cell Line set
– Local PubChem mirror
– Derived properties
database
– Pub3D, PubDock
– Synonym service
– VARUNA quantum
chemistry database
• Statistics (based on R)
– Regression, Neural Nets,
Random Forest
– LDA
– K-means clustering
– Plotting
– T-test and distribution
sampling
Indiana University School of
• Computation Services
– OpenEye FRED, OMEGA,
FILTER, …
– Cambridge OSCAR3
– BCI fingerprint generation, Ward’s,
Divisive K-means clustering
– Tox Tree
– Similarity & fingerprint calculations
(CDK)
– Descriptor calculation (CDK)
– 2D structure diagrams (CDK)
– 2D->3D File format conversions
David Wild, Geoffrey Fox, Bioinformatics retreat, February 2007. Page 15
Workflows - Taverna (taverna.sourceforge.net)
Indiana University School of
David Wild, Geoffrey Fox, Bioinformatics retreat, February 2007. Page 16
Indiana University School of
David Wild, Geoffrey Fox, Bioinformatics retreat, February 2007. Page 17
PubDock - Chimera-based interface
Indiana University School of
David Wild, Geoffrey Fox, Bioinformatics retreat, February 2007. Page 18
Kemo - A ChatBot for PubChem
•
•
•
•
Indiana University School of
Uses ALICE chatbot
www.alicebot.org
AIML used to define
knowledge base, e.g.
reaction to common
phrases like FIND ME,
WHAT IS THE LOGP OF,
etc
Can iteratively improve
knowledge base
Accesses PubChem
through web service
interface
David Wild, Geoffrey Fox, Bioinformatics retreat, February 2007. Page 19
Workflow in Xbaya - a meteorology tool!
http://www.extreme.indiana.edu/xgws/xbaya/
Indiana University School of
David Wild, Geoffrey Fox, Bioinformatics retreat, February 2007. Page 20
Indexing the world’s chemical information
AND computational functionality
• Crawl and index web pages, journal articles, etc. for
– Structures (InChIs, SMILES)
– Images (converted using Clide or ChemReader)
– Names (converted using OSCAR3 or similar package)
– Other information (IR spectra, reactions, etc…)
• Technology still immature, but improving quickly
• Problem with access to journal articles: we will assume open access in
the future!
• Expose computational functionality as web services, contextualize in an
OWL-S ontology (semantics), and publish in a UDDI
• Now we know what information we have, and what we can do with it
• Develop bots and intelligent agents to automatically do useful things
Indiana University School of
David Wild, Geoffrey Fox, Bioinformatics retreat, February 2007. Page 21