Computational Metabolomics
Download
Report
Transcript Computational Metabolomics
Computational Challenges in
Metabolomics (Part 1)
David Wishart, University of Alberta
Dagstuhl Seminar on Computational Mass
Spectrometry
Schloss Dagstuhl, Germany Aug. 23-28, 2015
Environmental Influence
Physiological Influence
The Pyramid of Life
Metabolomics
Metabolome
Proteomics
Proteome
Genomics
Genome
Why Small Molecules Count
• 100% of all agricultural products (herbicides,
pesticides, fertilizers) are small molecules
• >99% of all compounds that give food or drinks their
aroma, color and taste are small molecules
• 91% of all known drugs are small molecules
• >85% of all common clinical assays test for small
molecules
• 60% of all drugs are derived from pre-existing
metabolites
• 10-15% of identified genetic disorders involve
diseases of small molecule metabolism
Proteomics vs. Metabolomics
Proteomics vs. Metabolomics
•
•
•
•
•
•
•
•
•
•
•
•
•
Very MS or MS/MS oriented
Good separation is critical
Generates lots of raw data
Peptide and protein ID
Isotopic labeling (ICAT) helps
Possible to derive 3D structure
Permits protein imaging
Very dependent on databases
Spectral processing and
deconvolution is challenging
Quantitation is challenging
Data analysis requires MV stats
Data integration is challenging
Better software is needed
•
•
•
•
•
•
•
•
•
•
•
•
•
Very MS or MS/MS oriented
Good separation is critical
Generates lots of raw data
Chemical ID
Isotopic labeling (SIL) helps
Possible to derive 3D structure
Permits metabolite imaging
Very dependent on databases
Spectral processing and
deconvolution is challenging
Quantitation is challenging
Data analysis requires MV stats
Data integration is challenging
Better software is needed
Proteomics vs. Metabolomics
Proteomics Workflow
Biofluid/Extracts
Protein ID
HPLC or PAGE
Tryptic Digest
Mass Fingerprint
MALDI plate
MS analysis
Protein ID by PMF-MS
Metabolomics Workflow
Biological or Tissue Samples
Compound ID
Extraction
LC/GC-MS Spectra
Biofluids or Extracts
LC-MS or GC-MS
Compound ID by GC/LC-MS
LC/GC-MS total
Ion chromatogram
CH3
Proteomics vs. Metabolomics
• Polymers of 20 amino acids
(chemically similar)
• 185 million sequences (from
DNA sequencing)
• Sequence defines MS & MS/MS
spectra
• Trypsin gives definable
cleavages
• MS alone can ID proteins (PMF)
• MS/MS fragmentation at 1 fixed
energy
• MS/MS fragmentation is easily
predictable and very distinct
• 30 common PTMs
• PTMs are somewhat predictable
• 1000s of distinct chemical
classes (chemically diverse)
• No information from DNA
sequencing
• Structure defines MS & MS/MS
spectra (adducts, fragments)
• No trypsin for small molecules
(CID only)
• MS alone cannot ID metabolites
• Different energies for different
molecules
• MS/MS & EI-MS fragments not
easily predictable, often similar
• >400 PTMs via metabolism
• PTMs are hard to predict
Challenges for Metabolomics
• Most MS-based metabolomics studies ID <100 cmpds
(<1% of the known metabolome)
• Metabolite ID requires accurate, referential MS/MS or
EI-MS spectra and/or RT information
• Limited experimental MS/MS, EI-MS & RT data
• The chemical space of most metabolomes is not fully
known (perhaps >5 million compounds total)
• <1% of the chemicals in PubChem are relevant to
metabolomics
• Metabolomics needs specialized compound and
spectral (MS/MS, EI-MS, NMR) databases
• Metabolomics needs computational tools to predict
biologically viable metabolites and their spectra
LC-MS Spectral DBs
•
•
•
•
•
•
•
•
MoNA – 236,604 spectra, 69,946 cmpds** (12,000)
METLIN – 68,124 spectra, 13,048 cmpds
mzCloud – 422,349 spectra, 2975 cmpds
NIST14 MS/MS – 234,284 spectra, 9344 cmpds
MassBank – 28,185 spectra, 11,500 cmpds
Wiley LC-MSn – >10,000 spectra, 4500 poisons
ReSpect – 9107 spectra, 3595 cmpds
GNPS – 9000 spectra, 4200 natural products
Total #compounds with exp. MS/MS spectra ~20,000
Less than 60% are biologically relevant
How to Get Missing Spectra?
• Obtain or synthesize all biologically relevant
molecules (metabolites, HPVs, drugs,
pollutants, foods, etc.), prepare or synthesize
their metabolites and collect their NMR, LCMS and GC-MS spectra COST - 5,000,000
cmpds X $1000/cmpd = $5 billion
• OR
• Do this entire exercise computationally
COST - 5,000,000 cmpds X $0.10/cmpd =
$500,000
Computational Metabolomics
Known biomolecules (50,000)
Match observed spectra
to predicted spectra to ID compounds
Predicted biotransformations
(50,000 --> 5,000,000)
Predicted MS/MS, NMR, GC-MS
Spectra of knowns + biotransformed
The Human Metabolome
Database (HMDB)
http://www.hmdb.ca
• A web-accessible resource
containing detailed
information on 41,993
“quantified”, “detected” and
“expected” metabolites
• Data mined from the
literature and other eDBs
• 100’s of drug metabolites
• 1000’s of xenobiotics
• >10,000 reference spectra
• Supports sequence,
spectral, structure and text
searches as well as
compound browsing
• Full data downloads
The Drug Database
(DrugBank v. 4.3)
•
•
•
•
•
•
•
•
•
•
http://www.drugbank.ca
1602 small molecule drugs
>5000 experimental drugs
Data mined from the literature and
other eDBs
>1000 drugs with metabolizing
enzyme data
>1200 drug metabolites
>600 MS+NMR spectra
>4200 unique drug targets
208 data fields/drug
Supports sequence, spectral,
structure and text searches as well
as compound browsing
Full data downloads
The Toxic Exposome
Database (T3DB)
http://www.t3db.ca
• Comprehensive data on toxic
compounds (drugs,
pesticides, herbicides,
endocrine disruptors, drugs,
solvents, carcinogens, etc.)
• Data mined from the
literature and other eDBs
• >3600 toxic compounds
• >1900 reference spectra
• ~2100 toxic targets
• Supports sequence, spectral,
structure, text searches as
well as compound browsing
• Full data downloads
Computational Metabolomics
Known biomolecules (50,000)
Match observed spectra
to predicted spectra to ID compounds
Predicted biotransformations
(50,000 --> 5,000,000)
Predicted MS/MS, NMR, GC-MS
Spectra of knowns + biotransformed
Secondary Metabolism
CH3
Tempazepam
Oxazepam
Diazepam
N-(2-Benzoyl-4-chlorophenyl)-2acetamidoacetamide
Nordazepam
BioTransformer
BioTransformer - Flowchart
Query Molecule
Other
Reactions
Phase I
Reaction-specific
structural constraints
Enzyme metabolite?
(Machine Learning)
YES
NO
SOM Predictor
(Machine Learning)
YES
YES
SOMs
Metabolite Generator
NO
No metabolites
Metabolites
All structures are generated as
SMILES, SDF or MOL files
NO
BioTransformer – SOM Prediction
•
•
•
•
Preference Learning based on 100 atomic (e.g.
atom type) and 10 molecular features (e.g. mass)
SOM predictor was trained for 9 CYP450s
Average Prediction accuracy of 84.54%
Structures generated based on 92 Phase I
reactions
BioTransformer Results
6,230 Phase I metabolites
?
9,510 Phase II metabolites
?
6,110 Microbial metabolites
?
12,340 Other metabolites
?
5,000 compounds
34,000 metabolites
~220,000
Computational Metabolomics
Known biomolecules (50,000)
Match observed spectra
to predicted spectra to ID compounds
Predicted biotransformations
(50,000 --> 5,000,000)
Predicted MS/MS, NMR, GC-MS
Spectra of knowns + biotransformed
Computational Challenges in
Metabolomics (Part 2)
Sebastian Böcker, Friedrich Schiller
University
Dagstuhl Seminar on Computational Mass
Spectrometry
Schloss Dagstuhl, Germany Aug. 23-28, 2015