Small Molecules in Bioinformatics

Download Report

Transcript Small Molecules in Bioinformatics

Small Molecules Resources at the EBI
Dr. Louisa Bellis
Chemical Content Curator, ChEMBL Group
EMBL-EBI, UK
Bioinformatics Resources for
Immunologists
6th September 2013
Services | Research | Training | Industry
Agenda
• Introduction
• Small molecule resources
• ChEBI
• ChEMBL
• Searching and browsing
• Hands-on Exercises
Small Molecules within Bioinformatics
Genomes
Literature
Expressions
Nucleotide sequences
Protein sequences
Protein domains, families
Enzymes
3D structures
Small molecules
Pathways
Systems
Annotation of bioinformatics data
• Essential for capturing understanding and knowledge
associated with core data
• Often captured in free text, which is easier to read and better
for conveying understanding to a human audience, but…
• Difficult for computers to parse
• Quality varies from database to database
• Terminology used varies from annotator to annotator
• Towards annotation using standard vocabularies: ontologies
within bioinformatics
Small Molecule Databases can be used to:
• Investigate historical compounds and associated
bioactivity data.
• Create Structure-Activity Relationships (SARs)
• Direct synthesis
• Direct end product testing
ChEBI and ChEMBL
What is ChEBI?
• Chemical Entities of Biological Interest
• Freely available
• Focused on ‘small’ chemical entities (no proteins or
nucleic acids)
•
•
•
•
Illustrated dictionary of chemical nomenclature
High quality, manually annotated
Provides chemical ontology
~39,000 ChEBI 3* compounds
Access ChEBI at http://www.ebi.ac.uk/chebi/
ChEBI Data Overview
Nomenclature
Ontology
caffeine
1,3,7-trimethylxanthine
methyltheobromine
metabolite
CNS stimulant
trimethylxanthines
Chemical data
Database Xrefs
Formula: C8H10N4O2
Charge: 0
Mass: 194.19
MSDchem: CFF
KEGG DRUG: D00528
Chemical Informatics
InChI=1/C8H10N4O2/c1-10-4-9-65(10)7(13)12(3)8(14)11(6)2/h4H,1-3H3
SMILES: CN1C(=O)N(C)c2ncn(C)c2C1=O
Visualisation
CHEBI COMPOUND PAGE
ChEBI Chemical Structures
• Chemical structure may be
interactively explored
using MarvinView applet
• Available in formats
• Image
• Molfile
• InChI and InChIKey
• SMILES
Automatic Cross-references
The ChEBI ontology
Organised into three sub-ontologies, namely
• Molecular structure ontology
• Subatomic particle ontology
• Role ontology
(R)-adrenaline
Molecular structure ontology
Role ontology
ChEBI ontology relationships
• Generic ontology relationships
• Chemistry-specific relationships
Viewing ChEBI ontology
What is ChEMBL?
• Database of bioactive, drug-like small molecules.
• Store 2D structures, calculated properties (logP, mol
weight, Lipinski etc)
• Contains abstracted bioactivity data, e.g. binding data
and IC50, from multiple primary scientific journals
• Covers about 33 years of compound synthesis and
testing
• Annotated FDA-approved drugs
Access ChEMBL at https://www.ebi.ac.uk/chembldb/
Data Statistics
•
•
•
•
Focused towards compounds with drug-like properties by extraction
from medicinal chemistry journals
Includes small molecules (~92%) and peptides (~7%)
Abstracted from 50,095 papers across 47 journals
1,487,579 compound records (~450,000 directly from PubChem)
•
•
11,420,351 activities (>6.0 million directly from PubChem)
•
•
•
1,295,510 distinct compound structures
binding measurements, functional assays and ADMET
9,844 targets, with over 5,400 protein targets and over 2,440 human
targets
Deposition of PubChem Substances and Bioassay assays
ChEMBL Data Overview
Compound
H
N
N H
O
Target
N
H
N
MAHVRGLQLPGCLALAALCSLVHSQHVFLAPQQARSLLQRVRRANTFLEEVRKGNLERECVEETCSYEEAFEALE
SSTATDVFWAKYTACETARTPRDKLAACLEGNCAEGLGTNYRGHVNITRSGIECQLWRSRYPHKPEINSTTHPGA
DLQENFCRNPDSSTTGPWCYTTDPTVRRQECSIPVCGQDQVTVAMTPRSEGSSVNLSPPLEQCVPDRGQQYQGRL
AVTTHGLPCLAWASAQAKALSKHQDFNSAVQLVENFCRNPDGDEEGVWCYVAGKPGDFGYCDLNYCEEAVEEETG
DGLDEDSDRAIEGRTATSEYQTFFNPRTFGSGEADCGLRPLFEKKSLEDKTERELLESYIDGRIVEGSDAEIGMS
PWQVMLFRKSPQELLCGASLISDRWVLTAAHCLLYPPWDKNFTENDLLVRIGKHSRTRYERNIEKISMLEKIYIH
PRYNWRENLDRDIALMKLKKPVAFSDYIHPVCLPDRETAASLLQAGYKGRVTGWGNLKETWTANVGKGQPSVLQV
VNLPIVERPVCKDSTRIRITDNMFCAGYKPDEGKRGDACEGDSGGPFVMKSPFNNRWYQMGIVSWGEGCDRDGKY
GFYTHVFRLKKWIQKVIDQFGE
N
H
N
>Thrombin
N
O
H
O
Bioactivity
Compound
Ki=4.5
nM
Assay
SAR Data
APTT
11 min
Clinical Trials
Target
Discovery
Lead
Discovery
•Target
identification
•Microarray
profiling
•Target validation
•Assay
development
•Biochemistry
•Clinical/Animal
disease models
•High-throughput
Screening (HTS)
•Fragment-based
screening
•Focused libraries
•Screening
collection
Lead
Optimisation
•Medicinal
Chemistry
•Structure-based
drug design
•Selectivity
screens
•ADMET screens
•Cellular/Animal
disease models
•Pharmacokinetics
Preclinical
Development
Phase
1
•Toxicology
•In vivo safety
pharmacology
•Formulation
•Dose prediction
PK
tolerability
Discovery
Med. Chem. SAR
Phase
2
Phase
3
Efficacy
Safety
&
Efficacy
Development
Launch
Indication
Discovery &
expansion
Use
Clinical Candidates
Drugs
~15,000 candidates
~2,400
drugs
ChEMBL database
> 10,000,000 bioactivities
> 1,300,000 compounds
~30,000 distinct lead series
ChEMBL Target Types
Molecular
Nucleic acid
Non-molecular
Protein
Cell-line
HEK293 cells
Tissue Subcellular-fraction Organism
Nervous
Mitochondria
DNA
Single Protein
Protein Complex
PDE5
Nicotinic acetylcholine receptor
Protein Family
Muscarinic receptors
Drosophila
CHEMBL COMPOUND PAGE
Clickable structure
Drug
Information
Structural
Representations
ChEMBL --> ChEBI Link:
ChemSpider Links:
The link works
both ways. They
link TO
ChemSpider and
FROM
ChemSpider.
They link on
Standard InChI
Wikipedia Links:
We also have links with
Wikipedia. These also use
the Standard_Inchi as the
common identifier. These
links will link to the
Compound Report Card in
ChEMBL.
Searching and Browsing
Chemical names
• Common or trivial names are those that are highly used.
• Advantages of common names include
simplicity,
easy to pronounce,
universally recognised
• The main disadvantage is ambiguity – the same common name
may refer to more than one type of chemical.
• Fluorene
• Fluorine
Systematic names
• A systematic name is one which corresponds to the chemical
structure such that the structure can be determined from the
name, e.g. 1,2-dimethyl-naphthalene
• Software packages exist which can generate structures from the
systematic names (e.g. ACD/Name, ChemOffice, MarvinSketch).
• More than one correct systematic name can be assigned to the
same molecular structure, depending on the manner in which
naming rules are applied (e.g. IUPAC names).
Examples of common and systematic names
Common names
caffeine
Systematic names
1,3,7-trimethyl-3,7-dihydro-1Hpurine-2,6-dione
guaranine
7-methyltheophylline
theine
1,3,7-trimethyl-2,6dioxopurine
The ChEBI web service
• Programmatic access to a ChEBI entry
• SOAP based Java implementation
• Clients currently available in Java and perl
• Methods include:
• getLiteEntity
• getCompleteEntity and getCompleteEntityByList
• getOntologyParents
• getOntologyChildren and getAllOntologyChildrenInPath
• getStructureSearch
• Documented at
http://www.ebi.ac.uk/chebi/webServices.do.
Web services
• Allow users to create their own applications to query data
User
application
The ChEBI web service
• Programmatic access to a ChEBI entry
• SOAP based Java implementation
• Clients currently available in Java and perl
• Methods
• getLiteEntity
• getCompleteEntity and getCompleteEntityByList
• getOntologyParents
• getOntologyChildren and getAllOntologyChildrenInPath
• getStructureSearch
• Documented at
http://www.ebi.ac.uk/chebi/webServices.do.
Web service client object model
getLiteEntity
getCompleteEntity
getOntology
(Parents and
Children)
ChEMBL Web Services
• Programmatic access to the ChEMBL database
• Provide Java, Perl and Python scripts to help you get
started with the ChEMBL RESTful Web Service API
• Can be used to bring back compounds, lists of
compounds, images, targets and assays
• https://www.ebi.ac.uk/chembldb/index.php/ws
Examples of
Web Services
INTERFACE SEARCHING
ChEBI simple and advanced text search
AND, OR
and BUT
NOT
Narrow to
category
Structure
drawing tools
Search options
Search Results
Hover-over for a
larger structure
Click to go to entry
page
Types of structure search
• Identity – based on InChI
InChI=1/H2O/h1H2
• Substructure – uses fingerprints to narrow search range, then
performs full substructure search algorithm
0010110010
1010110111
• Similarity – based on Tanimoto coefficient calculated between the
fingerprints
Tanimoto(a,b)
= c / (a+b-c)
a 0010110010
b 1010110111
= 4 / (4+7-4)
= 0.57
Browse via Periodic Table
Molecular
entities /
Elements
Navigate via links in ontology
Click to follow
ontology links
ChEMBL Interface Searching:
• Keywords
• Compound name
• Trade Name
• Synonym
• Structure
• Exact match
• Substructure
• SMILES
• Single or a list of SMILES
Keyword searches.
Can use * as a
wildcard
Can search with a
list of ChEMBL IDs,
or Keywords or
SMILES
Run substructure
and similarity
searches
Types of Compound Names To Use
• ChEMBL captures all compound names, compound keys
and synonyms from the papers.
• Synonyms can be taken from the publications or are
curated from other sources (e.g. NCBI website).
• Curated and extracted synonyms in ChEMBL_16 > 660,000
• Types of synonyms captured include:
• Research codes
• FDA alternative names
• Trade Names (not for combinations of drugs)
• INN, BAN, JAN, USAN
Protein Sequence Search
• More precise method for identifying targets
• Input is a protein sequence of interest
• Uses BLAST* algorithm to perform pair-wise comparisons
between input sequence and all proteins in the Target
Dictionary, to find most closely related matches
• Results are scored according to similarity to input sequence
(determined by number of amino acids that are identical or
have similar properties)
*Altschul SF et al., J Mol Biol. 215(3), p403-10 (1990).
Find a protein sequence of interest
http://www.uniprot.org
Select entry of interest
Retrieve
sequence in
fasta format
Paste in a FASTA
file and run a
search to fetch
matching targets
Can also browse
using the
Taxonomy
Family Tree
browser
Search box for
keyword
searching
Browse Drugs Tab
Able to search the
approved drugs
using keywords
WHY USE ChEBI AND
ChEMBL?
I want to find data and information
on the target, IRAK4.
I also want to find out about the
compounds that have been tested
against this target.
But where would I start?....
Identifying Chemical Tools
• Search ChEMBL for protein of interest
• Simple text search against protein names/synonyms OR
• Browse protein family tree OR
• Sequence search using BLAST (can find related proteins)
• Identify compounds active against this protein
• Sort/filter by relevant activity types and potency
• E.g., retrieve compounds with IC50/Ki < 100nM
• Retrieve other data for these compounds
• Structures, chemical properties, other activities
Compound Properties and Selectivity
• ChEMBL stores a wide range of calculated compound
properties (e.g., mol wt, logP, RO5 violations)
• Can be used to identify compounds most likely to have good
in vivo properties (Absorption, Distribution, Metabolism,
Excretion)
• Contains activity information against liability targets (e.g.,
cytochrome P450s, HERG K+ channel)
• If compounds have been tested in these assays, can avoid
those with potential toxicity issues
• Contains data on a wide range of targets
• If compounds have been tested against multiple targets, can
get an idea of their selectivity (important for validation
studies)
DOWNLOAD AND ANALYSIS
OF RESULTS
• The compound results can be downloaded as an
*.SDFile.
• The bioactivity data can be downloaded as *.XLS or a
TAB file (tab-delimited)
Activity types and values
Assay details
Literature references
You can use the
standard Excel
filtering options to
filter the results
Downloads and programmatic access
Downloading ChEBI flavours
• All downloads come in two flavours
• 3 star only entries (manually annotated ChEBI
entries)
• 2 and 3 star entries (manually annotated ChEBI,
ChEMBL and user submissions)
Downloading ChEBI
• OBO file
• Use on OBO-edit
• SDF File
• Chemistry software compliant such as Bioclipse
• Flat file, tab delimited
• Import all the data into Excel
• Parse it into your own database structure
• Oracle binary dumps
• Import into an oracle database
• Generic SQL insert statements
• Import into MySQL or postgresql database
Downloading ChEMBL
Help and Feedback
• Email addresses for support queries and feedback
• General questions and feedback on ChEMBL interface:
[email protected]
• Reporting of data errors:
[email protected]
• General questions, support and feedback on ChEBI
[email protected]
Thank you
Services | Research | Training | Industry