RSC PPT Template

Download Report

Transcript RSC PPT Template

Semantics and standards
in chemistry
Project Prospect, ChemSpider and chemical data
Anomalocaris
floreslivroselua.files.wordpress.com
What do our scientists want?
(apart from PDFs?)
Chemists like structures
digitonin
Three years of semantic publishing
What were we trying to improve?
 Discoverability
 Use
 Understanding
 Linking
And why...
 What chemistry on the web may become...
 Prolonged exposure to
Peter Murray-Rust
Quick, what can we mark up?
What standards did we have in 2007?
 InChI – for some compounds
 ChEBI for some compounds and groups of
compounds
 Gene/Sequence/Cell Ontologies
 IUPAC Gold Book (dictionary, really, but online)
And RDF/OWL as distribution format
30-40% of our publishing
RSS for
human
readers
RSS for computers
<item rdf:about=http://xlink.rsc.org/?DOI=b716356h&amp;RSS=1>
<title> [… title] </title>
<link>http://xlink.rsc.org/?DOI=b716356h&RSS=1</link>
<description> [… blah] </description>
<content:encoded> [… human-readable stuff</content:encoded>
[… dublin core stuff …]
<content:items>
<rdf:Bag>
<rdf:li>
<content:item rdf:about=“info:inchi/InChI=1/C22H22NO4/c1-13-16-11-21(264)20(25-3)10-15(16)8-18-17-12-22(27-5)19(24-2)9-14(17)6-7-23(13)18/h612H,1-5H3/q+1"/>
</rdf:li>
<rdf:li>
<content:item rdf:about=“http://purl.org/obo/owl/SO#SO:0000028”/>
</rdf:li>
</rdf:Bag>
</content:items>
</item>
Enhanced HTML
Database
Text mining (Oscar)
Manual QA
http://www.sciborg.org.uk/
http://oscar3-chem.sourceforge.net/
Enhanced RSS
Why is this hard?
How many numbered compounds
actually are named in a given
paper?
iloprost (1)
tributyl-1-hexynylstannane (2)
the desired 2-heptyne (3)
methyl–Pd(II) iodide 4 or 4′
alkynylstannane 5
the hypervalent stannate 6
(alkynyl)(methyl)Pd(II) complex 7
the desired methylalkyne 8
compounds 9–14
the stannyl precursors 15 and 16
methylated compounds 17 and 18
stannyl precursor 19
iloprost methyl ester 20
 Text mining is the easy bit
 Cleaning up afterwards is hard
Spent more time on cleaning than mining when
quality is important
Annotation: where and when?
Pre-publication?
At publication?
(by authors)
(by editors)
?
Prospect
After
publication?
(by the crowd)
ChemMantis
What if the authors did it all?
Ontology Add-in for Word 2007
John Wilbanks
Intent: Term
recognition
& disambiguation
based on OBO or
OWL formats
Services: Ontology
download web service
• Phil
Bourne
• Lynn
Fink
Relationships:
Ontology
browser
Source code and binary:
http://research.microsoft.com/ontology/
Authoring: Chem4Word – Chemistry Drawing in Word
Author and edit 1D and 2D
chemistry.
Intent: Recognizes
chemical dictionary and
ontology terms
<?xml version="1.0" ?>
<cml version="3" convention="org-synth-report" xmlns="http://www.xml-cml.org/schema">
<molecule id="m1">
<atomArray>
<atom id="a1" elementType="C" x2="-2.9149999618530273" y2="0.7699999809265137" />
<atom id="a2" elementType="C" x2="-1.5813208400249916" y2="1.5399999809265137" />
<atom id="a3" elementType="O" x2="-0.24764171819695613" y2="0.7699999809265134" />
<atom id="a4" elementType="O" x2="-1.5813208400249912" y2="3.0799999809265137" />
<atom id="a5" elementType="H" x2="-4.248679083681063" y2="1.5399999809265137" />
<atom id="a6" elementType="H" x2="-2.914999961853028" y2="-0.7700000190734864" />
<atom id="a7" elementType="H" x2="-4.248679083681063" y2="-1.907348645691087E-8" />
<atom id="a8" elementType="H" x2="1.0860374036310796" y2="1.5399999809265132" />
</atomArray>
<bondArray>
<bond atomRefs2="a1 a2" order="1" />
<bond atomRefs2="a2 a3" order="1" />
<bond atomRefs2="a2 a4" order="2" />
<bond atomRefs2="a1 a5" order="1" />
<bond atomRefs2="a1 a6" order="1" />
<bond atomRefs2="a1 a7" order="1" />
<bond atomRefs2="a3 a8" order="1" />
</bondArray>
</molecule>
</cml>
Data: Semantics stored
in Chemistry Markup
Language
Intelligence: Verifies
validity of authored
chemistry
Relationships: Navigate
and link referenced
chemistry
Available soon:
http://research.microsoft.com/chem4word/
What if the readers did it?
ChemSpider ChemMantis
Deposit structures
…build dictionaries
So now it’s 2010 – where are we?
4k articles marked up, 40k compounds
RSC open ontology development
 Methods, reactions, molecular processes
User interface
 Partially publishing platform dependent
 Do we have the answers?
Compelling ontology browsing?
Remaining challenges
Open problems
 Chemical structures from images
 Productive identifiers
 Degree of manual effort required
Putting ChemMantis and Prospect together
 Backfile (to 1841)
 Community curation
Standards = longevity
Help implement and develop standards
 Open ontologies for chemistry
 InChI Trust
 How to publish this - pre-competition
Addressing a real need in standards
Pistoia Alliance
“An initiative to provide an open foundation of data
standards, ontologies and web-services to
streamline the Pharmaceutical Drug Discovery
workflow”
Semantic Enrichment of the Scientific
Literature (SESL) Oct09-Oct10
 Pistoia Alliance-funded
 EBI
 Elsevier, NPG, OUP, RSC
What have been the benefits?
For readers?
For authors?
For RSC?
How to use this information better to benefit
existing researchers – computers and humans
 Real behaviour (for humans)
 Clear requirements (for computer discovery)
What do humans want?
As few interfaces as possible
media.obsessable.com
What do computers want?
Web services
flickr.com/photos/microcosmos
A free to access online
database for chemists
Website and web services
Links over 20 million compounds
integrated to <300 data sources
A curation platform for the public to improve the quality of data online
A deposition platform for the public to annotate and extend the data
What’s the status of chemistry online?












Encyclopedic articles (Wikipedia)
Chemical vendor databases
Metabolic pathway databases
Virtual Screening databases
Property databases
Screening assay results
Patents with chemical structures (IBM & SureChem)
ADME/Tox data
Scientific publications
Compound aggregators
Blogs/Wikis and Open Notebook Science
Other publishers’ databases
Caution! Question Everything!
ChemSpider SyntheticPages
Quality and cleanup
 Who says what Taxol is?
 What is the “timeline” for a molecule?
 How do we clean up the Public data?
 Not even experts can agree (and can take days,
weeks to do the detective work). See taxol, digitonin
Crowd-sourcing chemistry curation
 identify/tag errors, edit names, synonyms,
identify records to deprecate
Future of chemistry online?
Make the internet searchable by chemical structure and
substructure by a free online service
Aggregate and help improve disparate public sources
Highlight our (and other publishers’) high quality publications
Test sharing and discussion of research data in the open
Provide structural home to preserve researchers’ collections,
experimental and property data
A society’s business...
Develop standards
Test, implement, refine, promote best practice
Evolve
(don’t forget the human behaviour)