Towards knowledge representation improvements in chemistry

Download Report

Transcript Towards knowledge representation improvements in chemistry

Towards knowledge representation
improvements in chemistry
Evan Bolton, Ph.D.
Mar. 15, 2016
National Center for Biotechnology Information
Premise
MOST CHEMISTRY KNOWLEDGE IS
LOCKED UP IN TEXT.
Image credits:
https://play.google.com/store/apps/details?id=com.uc.addon.web2pdf
http://ideasuccessnetwork.com/idea-discovery-article-how-good-idea-mushroomed/
http://xk2.ahu.cn/
http://libraryschool.libguidescms.com/content.php?pid=682172
https://www.freshrelevance.com/blog/real-time-marketing-report-for-may-2014
Simplistic science workflow
Read
papers
Do
science
Search
papers
Publish
papers
Computers help yet we
rely on humans who
abstract out keywords,
article gist, data, etc
CAS, Medline,
ChEMBL, etc.
Image credits:
http://www.how-to-draw-funny-cartoons.com/cartoon-scientist.html
http://computertutorinc.net/computer-maintenance-safety-tips/
Computer aided abstraction
Natural Language
Processing (NLP)
Name entity recognitionattempts
(NER) to “read”
text
now as good as a human
to with
find a computer
with human-like
chemical names, gene names,
understanding …
disease names in text corpus.
is getting there
BioCreative V (2015)
Image credits:
http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/ner/
http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/PubTator/
http://www.intechopen.com/books/theory-and-applications-for-advanced-text-mining/biomedical-named-entity-recognition-a-survey-of-machine-learning-tools
http://www.slideshare.net/NextMoveSoftware/leadmine-a-grammar-and-dictionary-driven-approach-to-chemical-entity-recognition
Effects of computer understanding
you can experience
Knowledge representation helps to provide “information about the world
in a form that a computer system can utilize to solve complex tasks”…
https://en.wikipedia.org/wiki/Knowledge_representation_and_reasoning
HOW DO WE BRING IT ALL TOGETHER?
Image credit: http://artint.info/html/ArtInt_8.html
Chemical information is everywhere now
7
PubChem data complexity
•
Many links between large record collections
–
–
–
–
–
–
–
–
•
•
With so many links, how does one ensure they
are accurate, relevant?
How does one improve the specificity of links?
~210M Substances <-> ~85M Compounds
~85M Compounds <-> ~85M Compounds
~225M Bioactivities <-> ~3M Substances
~225M Bioactivities <-> ~2M Compounds
~225M Bioactivities <-> ~1M BioAssays
~11M PMIDs <-> ~100K Compounds
~3M Patents <-> ~30M Substances
~3M Patents <-> ~15M Compounds
Sparse and dense data and/or linking
New types of data, links, and metadata on a
regular basis
8
• Most chemistry information is
for humans, not machines
• Chemists invent arbitrary
conventions to communicate
How do you describe a chemical substance? chemical information
Chemical information is not easy
•
– No standards, meta data associated with chemical representation
(e.g., purity)
• How do you describe a chemical mixture?
– No standards (InChI?), often free-form text
• How do you describe a bioactivity?
– Emerging standards, not widely adopted
• Minimum Information About a Bioactive Entity (MIABE)
• How do you draw a chemical structure?
– IUPAC Graphical Representation Standards for Chemical Structure Diagrams
• Not widely adopted, large pre-existing corpus
Image credit: http://www.ivyroses.com/Chemistry/GCSE/What-is-a-substance.php
9
Chemical information is not designed for computers
• As a chemist, you can
understand and recognize that
this picture is the chemical
acetone
• You can put a chemical name
or registry identifier next to it
• Is this not good enough?
• Many names for structure
• The computer ‘sees’ a binary
image not a structure
Acetone
67-64-1
Almost all chemical information is
geared towards human understanding
in the form of text and images
10
Computer understanding of chemical information
propan-2-one
58.07914 g/mol
• Give a computer a chemical
structure a (normal)
human
Computer
understanding can help
cannot understand it
provide human understandingCC(=O)C
• A computer can make the image
from the structure
the
computer understands, we
• A computerIf
can
associate
informationcan
to theleverage
structure it for search, analysis,
Acetone
67-64-1
• A computerand
can generate
more other
key information from structure
11
Same chemical structure
can represent many
things, most of which are
use case dependent
Chemical structure is not enough
• Chemical information is a bit of a mess and can be rather nuanced
– Names, names, and more names (+200M in PubChem)
• Some standard names are not open and cannot be used/verified without $$$
– Name/structure associations vary by use case (many overlapping)
•
•
•
•
•
Acetic acid vs. Acetic acid tri-hydrate
Go below 32%
Formaldehyde: (gas) vs. Formalin (liquid, 40% formaldehyde w/ water) formaldehyde in water and
it is non-flammable
Sulfuric acid: SO3 (gas) vs. H2SO4 (liquid)
Glucose: L/D, ring open/closed (f/p), alpha/beta/both vs. Glucose monohydrate
Large corpus in the ‘wild’ .. data source dependent nuances
• Verify with primary source(s) prior to information use
– i.e., is this the form of the chemical I care about?
Users are not happy to
see overlapping data on
different forms of same
chemical
12
ChemIDplus: A TOXNET DATABASE
Many data sources of relevant information
Biological Safety Data Sheets NIOSH Pocket Guide to Chemical Hazards
Does each organization (or scientist)
http://www.csb.gov/
use their ownPathogen
favoriteSafety
dataData
source(s)?
Sheets and Risk Assessment
SDS Search and Product Safety Center SIRI MSDS Index
International Chemical
(ICSC) data sources
DoSafety
theseCards
various
provide
https://www.dol.gov/
consistent information (gaps, errors)?
SDS and Chemical
from Manufacturers
How Information
do their decisions
change with
http://www.sigmaaldrich.com/
different information (or lack
of it)?
Right to Know Hazardous Substance Fact
CHEMINDEX FREE on the WEB!
TOXLINE: A TOXNET DATABASE
13
Chemical data issues are fundamental
• Many answers to “How do describe my substance and
its data?”
• Information is geared towards humans
• Need computer understanding of information
• Chemical structure representation is insufficient
• Publically available chemical information is heavily
fragmented
• Lots of data links (use case dependent, relevancy, etc.)
14
WHAT ARE WE DOING ABOUT IT?
Community drive towards knowledge representation
Dear Colleagues,
We would like to thank you for taking the time to participate in our first meeting to address chemical
ontologies (CO). Below please find some notes / minutes from the meeting held in Basel, Switzerland, October
2, 2015.
The purpose was to explore using computers to ingest machine-readable forms of molecules and to generate
molecular attributes (descriptors). For example, ingesting a SMILES string and producing a set of triples that
describe the molecule [ molecule X “is a “ ketone ; “ is a “ amino acid ; “is a “ steroid etc. ]. The output of which
would provide the basis of a chemical ontology to be used for classification purposes as well as for input for
downstream operations such as knowledge graphs, data mining, chemical text mining and cognitive computing
experiments. Historically these operations were performed manually or semi-automated; however, it is
desirable to have a computer process for large scale processing to meet current day demands resulting from
computer curation of the scientific literature. To date, two programs have been developed to accomplish this
objective: one at OntoChem, a German informatics company, and another (ClassyFire) at the University of
Alberta. While both programs produce reasonable output, there are differences that could lead to nonconformity in the resulting ontologies. One motivation for the workshop was anticipation that all parties will
benefit from common standards for a computer-derived chemical ontology. Overall we believe that a
Chemical Ontology can make contributions when it comes to answering scientific relevant questions.
Workshop hosted by Fatma Oezdemir-Zaech, Novartis Pharma AG
Things connect thru ontologies
A common vocabulary for researchers who need to share information in a domain.
Including machine-interpretable definitions of basic concepts in the domain
and relations among them.
Slide courtesy of Stephen Boyer, IBM
A Goal for Building Chemical Ontologies :
Step 1 ) Establish - Molecular Attributes
Step 2 ) Establish - Functional Attributes
Step 3 ) Explore - Integration & Predication
Slide courtesy of Stephen Boyer, IBM
There are many types of : “ Attributes “
Physical Attributes :
Examples : Molecular Weight , Melting point , Boiling Point …etc
• Molecular Attributes :
Examples: Steroid, Prostaglandin, Amino Acid, Alkene, Imidazole, .
• Functional Attributes :
Examples: Anti-Inflammatory, Explosive, Refrigerant, Pesticide
Slide courtesy of Stephen Boyer, IBM
Consider this molecule
Slide courtesy of Yannick Djoumbou & David Wishart / Drugbank team / U of Alberta
Carboxylic acid
Slide courtesy of Yannick Djoumbou & David Wishart / Drugbank team / U of Alberta
Benzoic acid
Carboxylic acid
Slide courtesy of Yannick Djoumbou & David Wishart / Drugbank team / U of Alberta
Benzoic acid
Carboxylic acid
Phenol
Slide courtesy of Yannick Djoumbou & David Wishart / Drugbank team / U of Alberta
Benzoic acid
Carboxylic acid
Phenol
Hydroxy group
Slide courtesy of Yannick Djoumbou & David Wishart / Drugbank team / U of Alberta
Carboxylic acid
Benzoic acid
Phenol
Hydroxy group
Azo
Slide courtesy of Yannick Djoumbou & David Wishart / Drugbank team / U of Alberta
Carboxylic acid
Benzoic acid
Phenol
Hydroxy group
Azo
Pyridine
Slide courtesy of Yannick Djoumbou & David Wishart / Drugbank team / U of Alberta
Carboxylic acid
Benzoic acid
Phenol
Sulfone
Hydroxy group
Azo
Pyridine
Slide courtesy of Yannick Djoumbou & David Wishart / Drugbank team / U of Alberta
Carboxylic acid
Benzoic acid
Phenol
Sulfone
Hydroxy group
Sulfonamide
Azo
Pyridine
Slide courtesy of Yannick Djoumbou & David Wishart / Drugbank team / U of Alberta
Carboxylic acid
Benzoic acid
Azobenzene
Phenol
Sulfone
Hydroxy group
Sulfonamide
Azo
Pyridine
Slide courtesy of Yannick Djoumbou & David Wishart / Drugbank team / U of Alberta
Carboxylic acid
Benzoic acid
Azobenzene
Phenol
Sulfone
Hydroxy group
Sulfonamide
Azo
Benzene
Pyridine
Slide courtesy of Yannick Djoumbou & David Wishart / Drugbank team / U of Alberta
Carboxylic acid
Benzoic acid
Azobenzene
Phenol
Sulfone
Hydroxy group
Molecular Attributes (Labels)
Is a
Benzoic acid
Is a
Carboxylic acid
Is a
Carbonyl cpd
Is a
Phenol
Is a
Axobenzene
Is a
Azo compound
Is a
sulfone
Is a
Sulfonamide
Is a
Pyridine
Is a
Benzene
Is a
hydroxy
Sulfonamide
Azo
Benzene
Pyridine
Slide courtesy of Yannick Djoumbou & David Wishart / Drugbank team / U of Alberta
Carboxylic acid
Benzoic acid
Azobenzene
Phenol
Sulfone
Hydroxy group
Molecular Attributes (Labels)
Is a
Benzoic acid
Is a
Carboxylic acid
Is a
Carbonyl cpd
Is a
Phenol
Is a
Axobenzene
Is a
Azo compound
Is a
sulfone
Is a
Sulfonamide
Is a
Pyridine
Is a
Benzene
Is a
hydroxy
Sulfonamide
Azo
Benzene
Pyridine
Functional Attributes
Is used for the treatment of Crohn's disease
Is used for the treatment of rheumatoid arthritis
Is used for the treatment of ulcerative colitis
Slide courtesy of Yannick Djoumbou & David Wishart / Drugbank team / U of Alberta
Slide courtesy of Stephen Boyer, IBM
Name
Name
Operation A
Molecular Attributes / Labels
Structure
SMILES
Operation B
Molecular Attributes / labels
Slide courtesy of Stephen Boyer, IBM
Structure
Name
SMILES
Name
Operation A
Molecular
Labels
ClassyFire
OntoChem
Molecular
Labels
Molecular
Labels
Other
Ex : Derwent Frag Codes, ChEBI, MeSH
Normalization Process – compatible with rules of chemical nomenclature
Molecular
Labels
Slide courtesy of Stephen Boyer, IBM
Name
2-[(1S,2S,4R,8S,9S,11S,12R,13S,19S)-12,19-difluoro-11-hydroxy-6,6,9,13-tetramethyl-16-oxo-5,7dioxapentacyclo[10.8.0.0^{2,9}.0^{4,8}.0^{13,18}]icosa-14,17-dien-8-yl]-2-oxoethyl acetate
Molecular Descriptors / Attributes / Labels
2-[(1S,2S,4R,8S,9S,11S,12R,13S,19S)-12,19-difluoro-11-hydroxy-6,6,9,13-tetramethyl-16-oxo-5,7dioxapentacyclo[10.8.0.0^{2,9}.0^{4,8}.0^{13,18}]icosa-14,17-dien-8-yl]-2-oxoethyl acetate
Slide courtesy of Stephen Boyer, IBM
SMILES String
[H][C@@]12C[C@@]3([H])[C@]4([H])C[C@]([H])(F)C5=CC(=O)C=C[C@]5(C)[C@@]4(F)[
C@@H](O)C[C@]3(C)[C@@]1(OC(C)(C)O2)C(=O)COC(C)=O
ClassyFire
ClassyFire: Halogenated steroids (6); Fluorohydrins (7); Halohydrins (7); 1,3-dioxolanes (9);
11-beta-hydroxysteroids (9); Dioxolanes (9); 3-oxo delta-1,4-steroids (10); Alpha-acyloxy
ketones (10); Delat-1,4-steroids (10); 11-hydroxysteroids (12); Gluco/mineralcorticoids,
progestogins and derivatives (13); Pregnane steroids (13); 20-oxosteroids (15); Acetate
salts (22); 3-oxosteroids (26); Oxosteroids (27); Carboxylic acid salts (30); Hydroxysteroids
(32); Cyclic ketones (45); Alpha amino acid amides (73); Pyrrolidines (80); D-alpha-amino
acids(85); Cyclic ketones (45); Acetals (50); Steroids and steroid derivatives (51); Alkyl
fluorides (53); Alkyl halides (67); Cyclic alcohols and derivatives (86); Ketones (101);
Organofluorides (128); Carboxylic acid esters (139); Secondary alcohols (187); Oxacyclic
compounds (192); Lipids and lipid-like molecules (209); Organohalogen compounds (272);
Ethers (393); Alcohols and polyols (395); Carboxylic acid derivatives (423); Carboxylic acids
and derivatives (548); Carbonyl compounds (598); Organic acids and derivatives (633);
Organoheterocyclic compounds (651); Organooxygen compounds (856); Organic
compounds (978); Chemical entities (989); Hydrocarbon derivatives (995);
OntoChem
OntoChem: 17-deoxy-prednisolones (6); halohydrins (6);
prednisolones (6); ethanoic acid esters (20); methyl esters
(20); acetals (37); alkyl fluorides (56); cyclic ketones (61);
natural product derivatives (92); fluorine compounds
(126); alkene derivatives (172); polycyclic compounds
(184); oxacyclic compounds (190); secondary alcohols
(202); carboxylic acids (249); formic acid derivatives (559);
lipophilic molecules (642); lipinski molecules (785);
bioavailable molecules (867); oxygen compounds (891);
small molecules (949); carbon compounds (974); hetero
compounds (978);
Green means likely synonym between ontologies
Its’ an olefin
Its’ an alkene
Slide courtesy of Yannick Djoumbou & David Wishart / Drugbank team / U of Alberta
Classifications in PubChem
• Enables drill down to chemical lists with a
classified feature of interest
Of the 38cross
benzaldehydes
• Allows interesting
comparison between
in ChEBI, how many are
within the set of 128
classification systems
benzaldehyes in MeSH?
– For example, compare MeSH with ChEBI
Only 20 between the two are in common
Only 19 of 20 are considered aldehydes
Only 17 of 19 are considered benzaldehydes!
38
Computational vs. Manual Classification
•
Humans make mistakes
– This includes programming mistakes,
classification mistakes, encoding
mistakes
•
Manual classifications only handle a
small number of chemical structures
– Automated classification can help to
extend classifiers to all known
chemicals
•
Harmonization of terminology?
– Can they speak a common language?
Image credit: https://commons.wikimedia.org/wiki/File:Brueghel-tower-of-babel.jpg
39
Benzene boiling point case study
Benzene
Coal tar
Computational harmonization of chemical concepts
[comparison of MeSH, WHO ILO ICSC, OSHA concepts]
Gasoline
•
•
Naptha
•
OSHA 706 is for naptha. There are several names
ILO 1400 is for gasoline. There are three
Harmonization
suggestions to
improve
MeSH
[Naptha,
Naptha
(coal tar), high solvent naptha,
names [Gasoline, Benzin,
86290-81-5].
crude solvent coal tar naptha, 8030-30-6]. This maps
The corresponding MeSH concept is
to MeSHconcept
concept M0008998.
M0045647 , with names [naphtha,
Reg#
‘86290-81-5’
to be added to the gasoline
M00089981.with
single
name ‘Gasoline’.
O3L624621X,
8030-30-6, benzin, benzine, petroleum
2. M0060581 to be a related concept (narrower)
to M0045647.
The ILO term ‘86290-81-5’ is not in MeSH.
3. name ‘petroleum ether’ be moved fromether].
M0045647 to M0060581.
The other ILO term ‘Benzin’ is located in
•
There
is a second
record
for naptha from OSHA, 707,
4. names [petroleum spirit, painters naphtha, refined
solvent
naptha]
another concept
M0045647
names
which is a narrower concept to OSHA 706 above.
to be
added to with
M0060581
OSHA 707 has several names [naptha, petroleum
‘naphtha, O3L624621X,
8030-30-6,
5. name [phenyl
hydride] to be added to M0002332.
ether, petroleum spirit, painters naphtha, refined
benzin, benzine,
petroleum
ether’.
6. M0008998
to be
a related concept (narrower) to M0045647.
solvent naptha, 8032-32-4, ligroin]. These names
map to the MeSH concept M0045647 via ‘petroleum
ether’ but also M0060581 with names [ligroin, 803232-4].
41
Summary
• Knowledge representation in chemistry enables computer
understanding of the domain
• Community efforts are underway to help build and harmonize
chemical ontologies
• There are many opportunities to improve the quality,
quantity, variety, relevancy, and integration of (open) chemical
data and chemical knowledge
42
PubChem Crew …
Steve Bryant (PI)
Evan Bolton
Sunghwan Kim
Jie Chen
Ben Shoemaker
Tiejun Cheng
Paul Thiessen
Gang Fu
Jiyao Wang
Renata Geer
Yanli Wang
Asta Gindulyte
Bo Yu
Lianyi Han
Leonid Zaslavsky
Jane He
Jian Zhang
Siqian He
Special thanks to the NCBI Help Desk, especially Rana
Morris, and past PubChem group members.
43
Special thanks
•
•
Chemical Ontology Collaborators (including Stephen Boyer, Yannick Djoumbou, Lutz Weber)
PubChemRDF Collaborators
–
Especially:
•
•
–
•
Especially: Leah McEwen (Cornell U.), Ralph Stuart (Keene State College), Ye Li (U. of Michigan)
Software collaborators
–
–
–
•
•
NLM Linked Data Infrastructure Working Group (MeSH RDF)
Chemical Health and Safety collaborators
–
•
Janna Hastings (European Bioinformatics Institute), Michel Dumontier (Stanford University), Colin Batchelor (Royal
Society of Chemistry), Egon Willighagen (Maastricht University)
Stephan Schurer, Uma Vempati, and Hande Küçük (U. of Miami)
NextMove Software (Roger Sayle, Daniel Lowe, Noel O’Boyle, John May)
Xemistry GmbH (Wolf D. Ihlenfeldt)
OpenEye Scientific Software
All PubChem Contributors and Collaborators
This research was supported [in part] by the Intramural Research Program of the NIH, National
Library of Medicine.
44
Have any
questions?
45