How chemists use data

Transcript How chemists use data

How chemists use data
Dr William G Town
President, Kilmorie Consulting
[email protected]
Fourth Bloomsbury Conference on E-publishing and Epublications
24th and 25th June 2010
Overview
•
•
•
•
•
•
Chemistry documentation perspective
Case study – CCDC
Study of chemists behaviour – JISC
RSC Project Prospect
RSC ChemSpider
OreChem project
Chemists have a long tradition of
documenting chemistry
• Gmelin Handbook of Chemistry (1817- )
• Beilstein Handbook of Organic Chemistry (1881 - )
• Chemical Abstracts (1907 - )
– CAS Online (1983 - )
– STN Express (1987 - )
– SciFinder (1995 - )
• ChemWeb (1996 - )
• Reaxys (2009 - )
Chemists have a long tradition of
documenting chemistry
• Data centres (e.g. CCDC) started in 1960s
• Extensive chemical database activities
– Bibliographic databases (1960s – )(e.g. CAS)
– Factual databases (1980s – )(e.g. Beilstein)
– Open access databases (2000s – )(e.g. Crystal Eye)
What’s the status of chemistry online?
•
•
•
•
•
•
•
•
•
•
•
•
Encyclopaedic articles (Wikipedia)
Chemical vendor databases
Metabolic pathway databases
Virtual Screening databases
Property databases
Screening assay results
Patents with chemical structures (IBM & SureChem)
ADME/Tox data
Scientific publications
Compound aggregators
Blogs/Wikis and Open Notebook Science
Commercial databases
Chemists like structures
digitonin
Cambridge Crystallographic Data
Centre (CCDC)

Founded in 1965 with grant funding in the Department
of Chemistry, University of Cambridge

Self financing, self administering Institution since 1987
– Not-for-profit, charitable, research Institute
– Recognized institute for postgraduate degrees of
the University of Cambridge

Objectives
– “advancement and promotion of the science of
chemistry and crystallography for the public benefit”
Cambridge Structural Database
Worldwide repository of validated small-molecule crystal structures
Lamotrigine
Acta Cryst., Sect.C:Cryst Struct.
Commun. (2009), 65, o460
Refcode: EFEMUX01
CSD Growth 1970-2010
Dec 09 – 500,000th structure
milestone reached
Knowledge mining using the CSD
“Crystals are windows on the world of atoms”
(Chet Raymo, Boston Globe, Science Musings)
CSD System search and analysis software
permit structural knowledge in the CSD to be
mined from the raw data, to generate:
 Crystallographic knowledge
 Intra-molecular structural knowledge
 Inter-molecular structural knowledge
Knowledge mining using the CSD
Scientific Applications
•
•
•
•
•
•
•
Structural chemistry and crystal engineering
Rational drug discovery and design
Protein – ligand interactions & ligand docking
Drug development, formulation and delivery
Materials research and development
Crystal structure prediction
Crystal structure determination
A study of scholarly communication between
chemists and of their use of Web 2.0
technologies
• Study commissioned by JISC (UK Joint
Information Systems Committee)
• Principal contractor was Publishing Directions
(Deborah Kahn – project leader)
• Project team composed of Nicki Dennis, Lara
Burns and me
• Started November ‘08, reported in April ’09
http://www.jisc.ac.uk/media/documents/aboutus/workinggroups/scadvocacyfinal%20repor
t.pdf
Background to the study
Methods of scholarly communication have changed rapidly
in the past decade. Improvements in computing and
social networking technologies, digital data capture
techniques, powerful data and text mining techniques
and other technological changes enable practices that
are collaborative, network based and highly intensive.
Background to the study
• We researched the needs of academics in two specific
areas, economics and chemistry.
• Recommendations were made on advocacy
programmes for each discipline which will be most
effective for encouraging optimum take up of useful
technologies and other developments which improve
scholarly communication.
ul
at
s
or
ki
ng
io
n
W
or
at
a
ct
rs
s
s
ro
s
s
C
IF
od
el
s
pe
m
ac
M
io
n
s
s
se
t
ur
e
ag
e
pa
Im
nt
at
ld
tr u
s
ks
er
Bo
o
lp
ap
se
ta
pr
e
im
en
Po
in
t
Si
m
Po
w
er
er
na
al
s
jo
ur
m
ic
ed
C
he
lis
h
Ex
p
Pu
b
Percentage of sample
Use of information resources
Use of information resources by research chemists - top ten
120
100
80
60
40
20
0
W
ik
ed
ge
Sp
ec
tr a
ld
at
ab
as
es
Sc
op
us
Sc
St
ho
ru
la
ct
r
ur
al
da
ta
ba
se
s
Ar
t ic
le
al
er
ts
G
oo
gl
e
sc
Re
ho
la
ac
r
t io
n
da
ta
ba
se
s
ip
ed
ia
Kn
ow
l
in
de
r
of
Sc
iF
W
eb
eJo
ur
na
ls
Percentage of sample
Use of information resources
Online resources used at least weekly in chemistry - top ten
100
90
80
70
60
50
40
30
20
10
0
Use of information resources
• High use of Wikipedia and Google Scholar
but chemists use alerting services and
more specialised subject based services
– This is likely to reflect the fact that chemists
are taught information skills as part of their
degree course
% response
Models
CIF
Images
Course
materials
Books
Chemical
structures
Working
papers
Experimental
& theoretical
datasets
PowerPoint
presentations
Published
journal
articles
Data sharing
Types of information shared by chemistry researchers (top ten)
100
90
80
70
60
50
40
30
20
10
0
Data storage
Data storage and sharing
• Chemists share datasets since they work
collaboratively across institutes
• Despite considerable work around repositories
and storage, data are still being stored locally
rather than in institutional or subject based
repositories.
• Concerns around ownership of results and of
“competitors” obtaining the results need to be
addressed before this will change significantly.
Three years of semantic publishing –
RSC Project Prospect
What were they trying to improve?
– Discoverability
– Use
– Understanding
– Linking
And why...
• What chemistry on the web may become...
• Prolonged exposure to
Peter Murray-Rust
Quick, what can we mark up?
What standards did we have in 2007?
•
•
•
•
InChI – for some compounds
ChEBI for some compounds and groups of compounds
Gene/Sequence/Cell Ontologies
IUPAC Gold Book (dictionary, really, but online)
And RDF/OWL as distribution format
30-40% of RSC publishing
What did RSC learn with Prospect?
• This is probably the way to go – 4000 articles so far
• How do they cover all subjects?
– Standards not well defined in all areas
• Scale up in manual QA
• Scale up during huge growth and scope of RSC
publishing activities
• How to use all that real chemistry data?
• Pump prime to change what is asked from authors
• Is the vision the day-glo article? (“Free headache for
every user”)
Ontology Add-in for Word 2007
John Wilbanks
Intent: Term
recognition
& disambiguation
based on OBO or
OWL formats
Services: Ontology
download web service
• Phil
Bourne
• Lynn
Fink
Relationships:
Ontology
browser
Source code and binary:
http://research.microsoft.com/ontology/
Authoring: Chem4Word – Chemistry Drawing in Word
Author and edit 1D and 2D
chemistry.
Intent: Recognizes
chemical dictionary and
ontology terms
<?xml version="1.0" ?>
<cml version="3" convention="org-synth-report" xmlns="http://www.xml-cml.org/schema">
<molecule id="m1">
<atomArray>
<atom id="a1" elementType="C" x2="-2.9149999618530273" y2="0.7699999809265137" />
<atom id="a2" elementType="C" x2="-1.5813208400249916" y2="1.5399999809265137" />
<atom id="a3" elementType="O" x2="-0.24764171819695613" y2="0.7699999809265134" />
<atom id="a4" elementType="O" x2="-1.5813208400249912" y2="3.0799999809265137" />
<atom id="a5" elementType="H" x2="-4.248679083681063" y2="1.5399999809265137" />
<atom id="a6" elementType="H" x2="-2.914999961853028" y2="-0.7700000190734864" />
<atom id="a7" elementType="H" x2="-4.248679083681063" y2="-1.907348645691087E-8" />
<atom id="a8" elementType="H" x2="1.0860374036310796" y2="1.5399999809265132" />
</atomArray>
<bondArray>
<bond atomRefs2="a1 a2" order="1" />
<bond atomRefs2="a2 a3" order="1" />
<bond atomRefs2="a2 a4" order="2" />
<bond atomRefs2="a1 a5" order="1" />
<bond atomRefs2="a1 a6" order="1" />
<bond atomRefs2="a1 a7" order="1" />
<bond atomRefs2="a3 a8" order="1" />
</bondArray>
</molecule>
</cml>
Data: Semantics stored
in Chemistry Markup
Language
Intelligence: Verifies
validity of authored
chemistry
Relationships: Navigate
and link referenced
chemistry
Available soon:
http://research.microsoft.com/chem4word/
Standards = longevity
Help implement and develop standards
– Open ontologies for chemistry
– InChI Trust
– How to publish this - pre-competition
Addressing a real need in standards
Pistoia Alliance
“An initiative to provide an open foundation of data
standards, ontologies and web-services to streamline the
Pharmaceutical Drug Discovery workflow”
Semantic Enrichment of the Scientific
Literature (SESL) Oct09-Oct10
• Pistoia Alliance-funded
• EBI
• Elsevier, NPG, OUP, RSC
How to use this information better to
benefit existing researchers –
computers and humans
• Real behaviour (for humans)
• Clear requirements (for computer discovery)
What do humans want?
As few interfaces as possible
media.obsessable.com
What do computers want?
Web services
flickr.com/photos/microcosmos
A free to access online
database for chemists
Website and web services
Links over 25 million compounds
integrated to <300 data sources
A curation platform for the public to improve the quality of data online
A deposition platform for the public to annotate and extend the data
ChemSpider – A Pragmatic Vision
“Build a Structure Centric Community”
– Integrate chemical structure data on the web
– Create a “structure-based hub” to information and
data
– Provide access to structure-based “algorithms”
– Let chemists contribute their own data
– Allow the community to curate/correct data
Why did the RSC acquire
ChemSpider?
•
•
•
•
•
Data versus documents
Enhancing discoverability
Build on cheminformatics expertise
RSC presence in the open data space
Critical mass of data for structure
searching
• Networking chemical scientists
Crowd-sourcing chemistry curation
Identify/tag errors, edit names, synonyms, identify records
to deprecate
CAS SciFinder
Reaxys
Differences between ChemSpider,
Reaxys and SciFinder
• Everything on Reaxys and Scifinder is curated
• The data resources can be over a 100 years old
• The platforms are commercial and “read-only”
•
•
•
•
•
ChemSpider is free, to everyone
Data are in a state of ongoing curation & annotation
Data resources are from the “electronic era”
Data are expanded daily and enhanced on an ongoing basis
The platform delivers integrated algorithm access
Future of chemistry online?
• Make the internet searchable by chemical structure and
substructure by a free online service
• Aggregate and help improve disparate public sources
• Highlight high quality publications
• Test sharing and discussion of research data in the open
• Provide structural home to preserve researchers’ collections,
experimental and property data
OreChem Project
• Participants
– Cambridge University
– Cornell University
– Indiana University
– Penn State University
• Funding
– Microsoft Research
– NSF
OreChem Project
• Data integration
– Representation/reuse through common data models and
ontologies
• Data capture and recovery
– At source capture of experimental data and research process
(ELNs)
– Compound object authoring
– Retrospective harvesting of chemistry data
• Data storage and manipulation
–
–
–
–
Cloud-based triple store
Chemical structure search
Linked data integration
Computation of properties
Chemistry is particularly challenging
• Commercial value of chemical information (e.g.
Pharma industry)
• Nature of chemistry research culture
– Predominance of synthesis (creation) overshadows
discovery mode typical of physics or biology
– Autonomy, successful research with limited reliance
on others
• Dominance of scholarly societies as publishers
– ACS (CAS)
– RSC
Chemistry on the Internet
– a future vision
•
•
•
•
•
The “semantic web” for chemistry is in place
Crowdsourcing is commonplace
Chemists will search the web by “structure”
Chemistry articles indexed and searchable
Reduced number of searches to find data because data are
integrated – compounds, vendors, syntheses, data,
publications and patents
• A world of Open Access and Open Data
Linked Data on the Web
Acknowledgements
•
•
•
•
Colin Groom, Gary Battle CCDC
Richard Kidd, RSC
Tony Williams, RSC ChemSpider
Carl Lagoze, OreChem
Any questions?

How chemists use data

Transcript How chemists use data

Directory