Knowledge Discovery and Data Mining in Chemistry
Download
Report
Transcript Knowledge Discovery and Data Mining in Chemistry
Mopping up the Flood of
Data with Web Services
Gary Wiggins
Indiana University
School of Informatics
[email protected]
Overview of the Talk
Data
Mining and Knowledge Discovery
DMKD in Bioinformatics
DMKD in Chemistry
Public Chemistry Databases for DMKD
Overview of Web Services
NIH-funded Projects Underway or Planned at
Indiana University
Educational Opportunities at IU
Data Mining and Knowledge
Discovery (DMKD)
Techniques
began to be used around 1989
Rapid growth in the mid 1990s, with
DMKD field emerging around 1995
Built on DM tools such as Machine
Learning
Data Mining
One
of the steps in Knowledge Discovery
Concerned with the actual extraction of
knowledge from data
Efficient and scalable methods for mining
interesting patterns and knowledge and
discovering hidden facts contained in large
databases
Data Mining Techniques
Efficient
classification methods
Clustering
Outlier analysis
Frequent, sequential, and structured
pattern analysis
Visualization and spatial/temporal analysis
tools
Knowledge Discovery (KD)
“KD
is a nontrivial process of identifying
valid, novel, potentially useful, and
ultimately understandable patterns from
large collections of data.”
--Fayyad et al., as quoted by Cios and Kurgan
The KD process involves:
Understanding and preparation of the data
Data Mining (DM)
Verification and application of the discovered
knowledge
Framework for KD Process
Steps
range from very few, e.g.,
Data collection and understanding
Data mining
Implementation
To
multi-step models, e.g., Cios and
Kurgan’s six-step DMKD process model
Cios and Kurgan’s Six-Step DMKD
Process Model
Understanding
the problem domain
Understanding the data
Preparation of the data
~50% or more of effort spent on this step
Data
mining
Evaluation of the discovered knowledge
Using the discovered knowledge
General Data Mining/
Data Analysis Systems
SAS
Enterprise Miner
SPSS
Insightful S-Plus
IBM DB2 Intelligent Miner
Microsoft SQLServer 2005
SGI MLC++ and MineSet Tree Visualizer
Inxight VizServer
Trends: Major Conferences
Knowledge Discovery and Data Mining (KDD) 2005
International Conference on Machine Learning (ICML)
2006
http://www.informatik.uni-trier.de/~ley/db/conf/kdd/kdd2005.html
http://www.icml2006.org/icml2006/technical/accepted.html
SIAM Conference on Data Mining 2006
http://www.siam.org/meetings/sdm06/proceedings.htm
12th Annual SIGKDD International Conference on
Knowledge Discovery and Data Mining,
Philadelphia, August 20-23, 2006
Areas of Interest on the Research Track:
Applications of data mining (biomedicine, business, e-commerce, defense)
Data and result visualization
Data warehousing
Data mining for community generation, social network analysis and graph-structured data
Foundations of data mining
Interactive and online data mining
KDD framework and process
Mining data streams
Mining high-dimensional data
Mining sensor data
Mining text and semi-structured data
Mining multi-media data
Novel data mining algorithms
Privacy and data mining
Robust and scalable statistical methods
Pre-processing and post-processing for data mining
Security issues
Spatial and temporal data mining
Trends in DMKD
OLAP (On-Line Analytical Processing)
Data warehousing
Association rules
High Performance DMKD systems
Visualization techniques
Applications of DM
More recently:
Database products that incorporate DM tools
New developments in design and implementation of the DMKD
process
Information visualization products as end-user queries
XML
XML: the Key to DM and KD?
Or
simply a data exchange protocol?
Allows for the description and storage of
structured or semi-structured data and
their relationships
Can be used to exchange data in a
platform-independent way
BUT—only one paper at the major
conferences listed earlier that dealt with
XML
XML helps:
Standardize communication between diverse
DM tools and databases (I/O procedures)
Build standard data repositories sharing data
between different DM tools that work on different
software platforms
Implement communication protocols between
DM tools
Provide a framework for integration of and
communication between different DMKD steps
Predictive Model Markup Language
(PMML) and Other Tools
In
conjunction with XML, PMML enables
the automation of sharing of discovered
knowledge between different domains and
tools
XML-RPC
SOAP (Simple Object Access Protocol)
UDDI
OLAP
OLE DB-DM
Discovery Informatics: Definition
"Discovery
Informatics is the study and
practice of employing the full spectrum of
computing and analytical science and
technology to the singular pursuit of
discovering new information by identifying
and validating patterns in data."
--William W. Agresti in 2003
Discovery Informatics
Discovery
and Application of Information
Data Mining and Machine Learning are
two aspects of Discovery Informatics.
Overview of the Talk
Data
Mining and Knowledge Discovery
DMKD
DMKD
in Bioinformatics
in Chemistry
Public Chemistry Databases for DMKD
Overview of Web Services
NIH-funded Projects Underway or Planned
at Indiana University
Educational Opportunities at IU
Trends: Bioinformatics
Conferences
International Conference on Instelligent Systems
for Molecular Biology (ISMB) 2006
Research in Computational Molecular Biology
(RECOMB) 2006
http://ismb2006.cbi.cnptia.embrapa.br/papers.html
http://www.informatik.unitrier.de/~ley/db/conf/recomb/recomb2006.html
Pacific Symposium on Biocomputing (PSB) 2006
http://helix-web.stanford.edu/psb06/
Main Areas of Research in
Bioinformatics
Sequence
alignment
Alternative splicing
Microarray analysis
Functional analysis
Analysis of single nucleotide
polymorphisms (SNPs)
Natural language text analysis
DMKD Sessions at Major
Bioinformatics Conferences
Databases
and Data Integration
Text Mining and Information Extraction
Semantic Webs
Data Mining in Bioinformatics
(Bajcsy)
Data
cleaning, data preprocessing, and
semantic integration of heterogeneous,
distributed biomedical databases
Existing data mining tools for biodata
analysis
Development of advanced, effective, and
scalable data mining methods in biodata
analysis
Preprocessing of Biodata
Integration
of multiple microarray gene
experiments must resolve inconsistent
labels of genes to form a coherent data
store.
Focus on quantitative quality metrics
based on analytical and statistical data
descriptors and on relationships among
variables.
Semantic Integration of
Heterogeneous Biomedical
Databases
Combine
multiple sources into a coherent
data store
Find sematically equivalent real-world
entities from several biomedical sources
Problems
Different labels for the same concept: gene_id
vs. g_id
Time asynchronization: same gene analyzed
at multiple development stages
Approaches for Semantic
Integration of Biodata
Construction
of integrated biodata
warehouses or biodatabases
Construction of a federation of
heterogeneous distributed biodatabases
Must build up mapping rules or semantic
ambiguity resolution rules across multiple
databases
Existing Data Mining Tools for
Biodata Analysis-I
Sequence Analysis,
e.g.,
NCBI/BLAST, ClustalW, HMMER, PHYLIP,
MEME, TRANSFAC, MDScan, Vector NTI,
Sequencher, MacVector
Structure
Prediction and Visualization,
e.g.,
RasMol, Raster3D, Swiss-Model, Scope,
MolScript, Cn3D
Existing Data Mining Tools for
Biodata Analysis-II
Genome Analysis,
CAP3, Paracel GenomeAssembler,
GenomeScan, GeneMark, GenScan, X-Grail,
ORF Finder, GeneBuilder
Pathway Analysis
e.g.,
and Visualization, e.g.,
KEGG, EcoCyc/MetaCyc, GenMapp
Microarray Analysis,
e.g.,
ScanAlyze/Cluster/TreeView, Scanalytics
MicroArray Suite, Profiler, Silicon Genetics
Biospecific Data Analysis Software
Systems
Agilent
GeneSpring
Spotfire
Invitrogen VectorNTI
Text Mining in Bioinformatics
Techniques
have progressed from simple
recognition of terms to extraction of
interaction relationships in complex
sentences.
Search objectives have broadened to a
range of problems, e.g.,
Improving homology search
Identifying cellular location
Deriving genetic network technologies
Current Work in Biomedical Text
Mining (Cohen and Hersh)
Text mining operates at a finer level of granularity than
information retrieval and text summarization.
TM examines relationships between specific kinds of
information contained within and between documents.
Areas of active research:
Named entity recognition (genes, proteins, etc.)
Text classification
Synonym and abbreviation extraction
Relationship extraction
Hypothesis generation
Integrated frameworks
Systems Biology
Requires a shift in focus from genes and
proteins to the system’s structure and dynamics
Four key properties:
System structures
System dynamics
Control method
Design method
Systems Biology Markup Language (SBML) and
CellML
iSpecies.org
Overview of the Talk
Data
Mining and Knowledge Discovery
DMKD in Bioinformatics
DMKD
Public
in Chemistry
Chemistry Databases for DMKD
Overview of Web Services
NIH-funded Projects Underway or Planned
at Indiana University
Educational Opportunities at IU
Data Mining in Chemistry
“Modern experimentation (whether
“classical” or high-throughput) should be
based on the productive interplay of
statistical techniques (design-ofexperiments), molecular modeling as well
as cheminformatics.”
--Ulrich S. Schubert
Session on “Integration of Informatics
and Knowledge Management
Informatics”*
Integration of Informatics at the Systems Level and at the Data Level
Chris L. Waller, Ph.D., Director, World Wide Chemistry Informatics, Pfizer Global
Research & Development
Integrated Knowledge Management at Bayer HealthCare: Pharmacophore
Informatics
William J. Scott, Ph.D., Team Leader, Department for Chemistry Research, Bayer
Pharmaceuticals Corporation
Building a Knowledge Enabled Organization
Cory R. Brouwer, Ph.D., Associate Director, Knowledge Management Informatics,
Pfizer Global Research & Development
Knowledge Management: Building a Knowledge Enabled Organization
Victor Lobanov, Ph.D., Principal Scientist, MDI, Johnson & Johnson Pharmaceutical
R&D
*10th Annual Cheminformatics Conference, May 23-16, 2006, Philadelphia
Impact of HTS and Combinatorial
Chemistry Research
Most
the pharmaceutical industry
medical research
catalyst research
More
impact in:
recently:
polymer and materials research.
Diversity of Data Mining in
Chemistry
On 5/7/2006 there were 4072
references to either
“datamining” or “data mining”
in Chemical Abstracts.
3416 different index terms
were assigned to those
records.
2772 used 1-5 times (81%)
298 used 6-10 times (9%)
103 used 11-15 times (3%)
71 used 16-20 times (2%)
38 used 21-25 times (1%)
24 used 26-30 times (1%)
110 for 31-480 times (3%)
Most frequent co-term:
“bioinformatics” with 480 hits
or 12% of the occurrences
90%
80%
70%
60%
50%
Series1
40%
30%
20%
10%
0%
1-5
6-10
11-15
16-20
21-25
25-30
31-480
SFS graph
Components of the Semantic Web
for Chemistry
XML – eXtensible Markup Language
RDF – Resource Description Framework
RSS – Rich Site Summary
Dublin Core – allows metadata-based
newsfeeds
OWL – for ontologies
BPEL4WS – for workflow and web services
Murray-Rust et al. Org. Biomol. Chem. 2004, 2, 31923203.
Chemical Markup Language (CML)
Much of the semantics in a chemical article can
be supported by CML
Molecules
Structures
Reactions and reaction schemes
Spectra (including annotations)
Physicochemical data
XML dictionaries and lexicons provide linguistic
and semantic support for markup
Will lead to quicker authoring and higher quality
of embedded structures and data through
machine validation
Key Factors in the Success of the
Chemical Semantic Web
Institutional
Repositories: services
deployed and supported at an institutional
level to offer dissemination management,
stewardship, and where appropriate, longterm preservation of both the intellectual
work created by an institutional community
and the records of the intellectual and
cultural life of the institutional community
Open Access Movement
Knowledge-Driven Bioinformatics
Enhanced with Chemistry
Text Mining (Banville)
“In the pharmaceutical field, it is ideally the
marriage of biological and chemical information
that needs to be the ultimate focus of text data
mining applications.”
Problems:
Lack of universal publication standards for identifying
each unique chemical entity
Selective indexing policies of A&I services
Need to understand how chemical structures link to
biological processes
OSCAR3 Service
Open Java source application under
development by Peter Murray-Rust group at
Cambridge (Not published yet)
Extracts chemical information from either a
paragraph of experimental data or a full paper
(e.g. melting points, infra-red and NMR data,
and mass spectral information)
Produces an XML instance highlighting the
chemical information with an Extensible
Stylesheet Language (XSL) file
At IU, we are attaching SOAP input/output
engine for a web service based on OSCAR3.
OSCAR at Work in the Future
Semantic Scholars’ Grid I
Local MD
Store
Local Harvest
Store
Fetch MD
and Documents
PubMed
Gatherer
Indexer
Index all
Local MD
Query and
Get list
Analyzer
Run filter such as
OSCAR2 on
harvested MD
and documents
Store new MD
Science.gov
Google Scholar
e-Prints
Dspace
etc.
Semantic Scholars’ Grid II
Local MD
Store
ACM
CiteULike
IEEE
Connotea
Del.icio.us
Google
Scholar
etc.
Wiley
Plug-in
Updater
Synchronize
SSG and
foreign MD
etc.
Community
Tools
SSG
Viewer
Instant Citation
Index etc.
Update local MD
Control foreign interactions
View all MD’
Access Community Tools
Foreign
User Interface
Update and view
foreign MD
Chemical Datamining Software
SureChem
CLiDE
http://surechem.reeltwo.com/
Recognizes structures, reactions, and text
http://www.simbiosys.ca/clide/
OSCAR
“OSCAR1” to check experimental data
• http://www.ch.cam.ac.uk/magnus/checker.html
• http://www.rsc.org/Publishing/ReSourCe/AuthorGuidelines/AuthoringTools/E
xperimentalDataChecker/
CSR (Chemical Structure Reconstruction)
http://www.scai.fraunhofer.de/uploads/media/MZ-ERCIM05_04.pdf
MDL DocSearch—combines MDL’s Isentris platform and EMC’s
Documentum
Overview of the Talk
Data Mining and Knowledge Discovery
DMKD in Bioinformatics
DMKD in Chemistry
Public
Chemistry Databases for DMKD
Overview of Web Services
NIH-funded Projects Underway or Planned at
Indiana University
Educational Opportunities at IU
ChemDB
http://cdb.ics.uci.edu/CHEM/Web/
ChEBI, Chemical Entities of
Biological Interest
Dictionary
of molecular entities focused on
small chemical compounds
Features an ontological classification,
showing the relationships between
molecular entities or classes of entities
and their parents and/or children
Vioxx Entry in ChEBI
The IUPAC International Chemical
Identifier (InChI)
Open source, non-proprietary, public-domain identifier
for chemicals
String of characters that uniquely represent a molecular
substance
Independent of the way the chemical structure is drawn
Enables reliable structure recognition and easy linking of
diverse data compilations
Accepts as input MOLfiles (or SDfiles) and CML files
Download the program to your computer at:
http://www.iupac.org/inchi/license.html
Generation of InChI for Vioxx with
wInChI
Vioxx Entry in PubChem
Compounds Found with InChI
Vioxx Bioassay Data in PubChem
Vioxx PubChem Link to External
Sources of Information
PubChem Link to Elsevier MDL
DiscoveryGate www.discoverygate.com
provides access to integrated scientific content from
databases, journal articles, patent publications and
reference works
information providers include Elsevier, ThomsonDerwent, FIZ CHEMIE, the U.S. FDA, Prous Science
and Thieme
MDL Compound Index (the master list of substances
included in DiscoveryGate data sources) now
exceeds 14 million unique chemical structures with
the addition of 5 million chemical structures from the
PubChem database.
The Elsevier MDL/NIH Link via
PubChem and DiscoveryGate
Cross-indexes PubChem to the Compound
Index hosted on Elsevier MDL’s DiscoveryGate
platform
MDL added 5 million structures from PubChem
to their index, resulting in over 14 million unique
chemical structures
Links go both ways
Can move from biological data in PubChem to
bioactivity, chemical sourcing, synthetic methodology,
and EHS data in DiscoveryGate sources
Elsevier MDL’s xPharm
Comprehensive
set of records linking:
Agents (compounds) (2300)
Targets (600)
Disorders (450)
Principles that govern their interactions (180)
Answers
questions such as:
• What targets are associated with control of blood
pressure?
• What adverse effects are associated with
monoamine oxidase inhibitors?
Web Guide for Essential
Cheminformatics Resources
http://www.chembiogrid.org
http://www.indiana.edu/~cheminfo/cicc/
ChemBioGrid Chemical Databases
Overview of the Talk
Data
Mining and Knowledge Discovery
DMKD in Bioinformatics
DMKD in Chemistry
Public Chemistry Databases for DMKD
Overview
of Web Services
NIH-funded
Projects Underway or Planned
at Indiana University
Educational Opportunities at IU
Web Services Overview
What
are “Web Services”?
A distributed invocation system built on Grid
computing
• Independent of platform and programming
language
• Built on existing Web standards
A service oriented architecture with
• Interfaces based on Internet protocols
• Messages in XML (except for binary data
attachments)
Web Services for Chemistry:
Problems
Performance and scalability
Proprietary data
Competition from high-performance desktop
applications
-- Geoff Hutchison, it’s a puzzle blog, 2005-01-05
ALSO:
Lack of a substantial body of trustworthy Open
Access databases
Non-standard chemical data formats (over 40 in
regular use and requiring normalization to one
another)
DM Internet Toolbox Architecture
Overview of the Talk
Data
Mining and Knowledge Discovery
DMKD in Bioinformatics
DMKD in Chemistry
Public Chemistry Databases for DMKD
Overview of Web Services
NIH-funded
Projects Underway or
Planned at Indiana University
Educational
Opportunities at IU
Indiana University Planned
Projects:
http://www.chembiogrid.org
Design of a Grid-based distributed data
architecture
Development of tools for HTS data analysis and
virtual screening
Database for quantum mechanical simulation
data
Chemical prototype projects
Novel routes to enzymatic reaction mechanisms
Mechanism-based drug design
Data-inquiry-based development of new methods in
natural product synthesis
Web Services for Chemistry at IU
Purpose
Technologies
Interaction Layer
Interactive software for
creative access and
exploitation of information
by humans
Microsoft .NET Smart
Clients, portlets, Java
applets, email and browser
clients, visualization
technologies
Aggregation Layer
Workflows and data
schemas customized for
particular domains,
applications and users
BPEL, Taverna and other
workflow modeling tools,
aggregate web services
Web service layer
Comprehensive data and
computation provision
including storage,
calculation, semantics and
meta-data exposed as web
services
Apache web services,
SOAP wrappers, WSDL,
UDDI, XML,
Microsoft .NET
NCI Developmental Therapeutics
Program (DTP)
Downloadable
data:
In vitro 60 cell line results
in vitro anti-HIV results
Yeast assay
200,000+ chemical structures
molecular targets
microarray data
Or
search the database at:
• http://dtp.nci.nih.gov/docs/dtp_search.html
IU Database of NIH DTP Data
Contains over 200,000 chemical structures
tested in 60 cellular assays from different human
tumor cell lines
Also includes microarray assay profiles for the
untreated cell lines (~14,000 datapoints)
A local PostgreSQL database containing the
data that is exposed as a web service
Using workflows and complex SQL queries, we
can do advanced data mining that exploits the
chemical, biological and genomic information for
particular audiences (chemists, biologists, etc)
Mining the NIH DTP database
60 cell lines
~200,000
compounds
Cell lines can be clustered based on gene expression similarity
Compounds can be clustered based on similarity of profile
across cell lines, or by chemical structure fingerprint similarity
Use of Taverna at IU
A protein implicated in tumor growth is supplied to the docking
program (in this case HSP90 taken from the PDB 1Y4 complex)
The workflow employs our local NIH DTP database service to
search 200,000 compounds tested in human tumor cellular assays
for similar structures to the ligand.
Client portlets are used to browse these structures
Once docking is complete, the user visualizes the high-scoring
docked structures in a portlet using the JMOL applet.
Similar structures are filtered for drugability, and are automatically
passed to the OpenEye FRED docking program for docking into the
target protein.
A 2D structure is supplied for input into the similarity search (in this
case, the extracted bound ligand from the PDB IY4 complex)
Correlation of docking results and “biological fingerprints” across the
human tumor cell lines can help identify potential mechanisms of
action of DTP compounds
Taverna Workflow
Workflow definition
Available web services
(WSDL)
Visual depiction of workflow
Taverna in Action
CGL Contributions to CICC
Build Web/Grid services for connecting
Third party tool evaluation
Data sources
Applications (simulation, data mining, data assimilation, imaging, etc).
Computing resources
Information services.
Workflow (Taverna)
Grid tools: Globus and Condor (for interacting with TeraGrid)
Building standards-based Web portal environments.
OGCE grid portal project
JSR 168 Java standards.
This activity will begin in earnest over the summer.
Digital Chemistry (BCI) Clustering
Service Methods
Service Method
Description
Input
Output
makebitsGenerate
Generate fingerprints
SMIstring
from a SMILES structure
Fingerprint
string
divkmGenerate
Cluster fingerprints with
Divkmeans
SCNstring Clustered
Hierarchy
smile2dkm
Makebits + divkm
SMIstring
optclusGenerate
Generate the best levels DKMstring Best partition
in a hierarchy
cluster level
rnnclusGenerate
Extract individual cluster
partitions
Clustered
Hierarchy
DKMstring Indiv. cluster
partitions
smile2ClusterPartiti Generate a new SMILES SMIstring
oned
structure w/ extra col.
New SMILES
structure
Local Web Service Methods for
WWMM of PMR’s Group
Services
Descriptions
Input
Output
InChIGoogle Search an InChI
inchiBasic
structure through Google type
Search result in
HTML format
InChIServer
Generate InChI
version
format
An InChI
structure
OBServer
Transform a chemical
format to another using
Open Babel
format
inputData
outputData
options
Converted
chemical
structure string
CMLRSSSer Generate CMLRSS feed
ver
from CML data
mol, title
Converted
description CMLRSS feed
link, source of CML data
More Services
VOTables
and related
services.
General purpose service for manipulating tabular
data. Comes with third party tools for parsing,
manipulating, displaying data. Includes import
tools. Using this as an intermediary for data
exchange between data bases.
Draw2d
Uses CDK tools to create 2d images from SDF
formatted data.
Common
Substructure
Another CDK service that can be used to calculate
the common substructure between two molecules.
Other CDK
Services
See
http://www.chembiogrid.org/wiki/index.php/Web_Se
rvices_Infrastructure. Based on Dr. Rajarshi
Guha’s services.
ToxTree
An
in silico toxicology prediction suite
Based on the CDK toolkit
Built on CML
Released as OpenSource under the GPL
Standalone PC software
User Manual:
http://ecb.jrc.it/DOCUMENTS/QSAR/TOX
TREE/toxTree_user_manual.pdf
ToxTree Service
An open Java source application by Nina Jeliazkova
Estimates toxic hazard by applying a decision tree
approach.
Encodes the Cramer scheme
(Cramer G. M., R. A. Ford, R. L. Hall, Estimation of Toxic
Hazard - A Decision Tree Approach, J. Cosmet. Toxicol.,
Vol.16, pp. 255-276, Pergamon Press, 1978)
Could be applied to datasets from various compatible file
types.
We are converting this GUI application to a text-based
web service
Overview of the Talk
Data
Mining and Knowledge Discovery
DMKD in Bioinformatics
DMKD in Chemistry
Public Chemistry Databases for DMKD
Overview of Web Services
NIH-funded Projects Underway or Planned
at Indiana University
Educational
Opportunities at IU
Chemoinformatics Education at IU
School
of Informatics degree programs
BS, MS, PhD
Programs
offered at both the Indianapolis
(IUPUI) and Bloomington (IUB) campuses
Other Educational Activities
Graduate Certificate Program in Chemical
Informatics (4 courses by Distance Education)
I571 Chemical Information Technology (3 cr.)
I572 Computational Chemistry and Molecular
Modeling (3 cr.)
I573 Programming Techniques for Chemical and Life
Science Informatics (3 cr.)
I553 Independent Study in Chemical Informatics (3
cr.)
I571 as CIC Courseshare offering w. Michigan
Experiments with teleconferencing as a distance
education tool
PhD in Informatics
Began in August 2005
Tracks:
bioinformatics; chemical informatics; health
informatics; human-computer interaction design;
social and organizational informatics
Under development:
complex systems, networks, modeling and
simulation; cybersecurity; discovery and application of
information; logical and mathematical foundations;
music informatics
Graduate Enrollment: Chemo-,
Laboratory, Bio-, Health Informatics
MS
Chem
Lab
Bio
Health
IUB
3
0
38
0
IUPUI
6
15
34
36
TOTAL
9
15
72
36
PhD
Chem
Lab
Bio
Health
IUB
1
0
3
0
IUPUI
1
0
4
3
TOTAL
2
0
7
3
Software/DBs Used in the Program
Company
ArrgusLab
Digital Chemistry
Cambridge Cryst Data Ctr
CambridgeSoft
Chemical Abstracts Service
Chemaxon
Daylight Chemical Info System
FIZ Karlsruhe
IO-Informatics
MDLCrossFire
OpenEye
Sage Informatics
Serena Software
Spotfire
STN International
Wavefunction
Products and/or (Target Area)
(Molecular modeling)
Toolkit (Clustering)
Cambridge Structrual DB & GOLD
ChemDraw Ultra
SciFinder Scholar
Marvin (and other software)
Toolkit
Inorganic Crystal Structure DB
Sentient
Beilstein and Gmelin
Toolkit (and other software)
ChemTK
PCMODEL
DecisionSite
STN Express with Discover (Anal Ed)
Spartan
Closing quote
“The future of chemistry depends on the
automated analysis of chemical
knowledge, combining disparate data
sources in a single resource, . . . which
can be analysed using computational
techniques to assess and build on these
data.”
Townsend et al. Org. Biomol. Chem. 2004, 2,
3299.
We all need help when overloaded!
Bibliography
Agresti, William W. “Discovery informatics.” Communications of the ACM
2003, 46(8), 25-28.
Banville, Debra L. “Mining chemical structural information from the drug
literature.” Drug Discovery Today January 2006, 11(1/2), 35-42.
Bajcsy, Peter; Han, Jiawei; Liu, Lei; Yang, Jiong. "Survey of bio-data
analysis from a data mining perspective." Chapter 2 in: Wang, Jason T. L.;
Zaki, Mohammed J.; Toivonen, Hannu T. T.; Shasha, Dennis (eds.), Data
Mining in Bioinformatics. London, Springer Verlag, 2005, pp.9-39.
Banville, Debra L. “Mining chemical structural information from the drug
literature.” Drug Discovery Today, 2006, 11(1/2), 35-42.
Cios, Krzysztof J.; Kurgan, Lukasz A. “Trends in data mining and knowledge
discovery.” Chapter 1 in: Pal, N.R.; Jain, L.C.; Teodoresku, N. (eds.),
Knowledge Discovery in Advanced Information Systems. N.Y., Springer
Verlag, 2002, pp. 1-26.
Cohen, Aaron M.; Hersh, W.illiam R. "A survey of current work in biomedical
text mining." Briefings in Bioinformatics March 2005, 6(1), 57-71.
Bibliography
Corbett, Peter T.; Murray-Rust, Peter; Day, Nick E.; Townsend, Joe A.; Rzepa, Henry
S. “Chemistry publications in CML.” Abstracts of Papers, 231st ACS National
Meeting, Atlanta, GA, United States, March 26-30, 2006, CINF-055.
Fayyad, U.M.; Piatesky-Shapiro, G.; Smyth, P.; Uthurusamy, R. Advances in
Knowledge Discovery and Data Mining. AAAi/MIT Press, 1996. (quoted by Cios and
Kurgan)
Gardner, Stephen P. “Ontologies and semantic data integration.” Drug Discovery
Today 2005 10(14), 1001-1007.
Guha, R.; Howard, M.T.; Hutchison, G.R.; Murray-Rust, P.; Rzepa, H.; Steinbeck, C;
Wegner, J.; Willighagen, E.L. “The Blue Obelisk—Interoperability in chemical
informatics.” Journal of Chemical Information and Modeling 2006 Web Release Date:
22-Feb-2006; DOI: 10.1021/ci050400b
Holliday, Gemma L.; Murray-Rust, Peter; Rzepa, Henry S. “Chemical Markup, XML,
and the World Wide Web. 6. CMLReact, an XML Vocabulary for Chemical Reactions.”
Journal of Chemical Information and Modeling 2006, 46(1), 145-157.
Jónsdóttir, S.O.; Jorgensen, F.S.; Brunak, S. “Prediction methods and databases
within chemoinformatics: emphasis on drugs and drug candidates.” Bioinformatics
2005 May 15; 21(10): 2145-60.
Bibliography
Karthikeyan, M.; Krishnan, S.; Pankey, Anil Kumar. “Harvesting chemical information
from the Internet using a distributed approach: ChemXtreme.” Journal of Chemical
Information and Modeling.” DOI: 10.1021/ci050329.
Krallinger, Martin; Alonso-Allende Erhardt, Ramon; Valencia, Alfonso. “Text-mining
approaches in molecular biology and biomedicine.” Drug Discovery Today 2005,
10(6), 439-445.
Scherf Uwe, Ross Douglas T., Waltham Mark, Smith Lawrence H., Lee Jae K.,
Tanabe Lorraine, Kohn Kurt W., Reinhold William C., Myers Timothy G., Andrews
Darren T., Scudiero Dominic A., Eisen Michael B., Sausville Edward A., Pommier
Yves, Botstein David, Brown Patrick O., Weinstein John N. “A gene expression
database for the molecular pharmacology of cancer.” Nature Genetics 2000, 24, 236244.
Schubert, Ulrich S. "Materials informatics: from data to knowledge towards integrated
escience approaches." QSAR & Combinatorial Science 2005, 24(1), 5. (NB: Entire
issue is devoted to this topic.)
SIAM International Conference on Data Mining (5th: 2005: Newport Beach, CA) Data
Mining; Proceedings. Kargupta, Hillol et al., eds. SIAM, 2005.
Torr-Brown, Sheryl. “Advances in knowledge management for pharmaceutical
research and development.” Current Opinion in Drug Discovery & Development
2005, 8(3), 316-322.
Web 2.0
Social Software: allows group interactions
Enables groups to form and organize themselves
Examples
•
•
•
•
•
•
•
•
Wikis
Blogs
RSS (now found on chemistry.org)
Podcasting/Coursecasting
Webcasting/Webinars
Flickr
Jybe
FURL
FURL (Frame Uniform Resource
Locater)
For
archiving and sharing of web pages
Furler can capture the pages for a
discussion group
Tracks useful pages for a discussion
http://www.furl.net/home.jsp
Jybe (Join Your Browser with
Everyone)
Collaboration
and communication in real
time with IE and Firefox
Screen-sharing AND editing
Privacy protected: must be invited
Upload documents to convert to html
http://www.jybe.com