Transcript Document

Canadian Bioinformatics
Workshops
www.bioinformatics.ca
Module #: Title of Module
2
Module 6
Metabolomic Data Analysis Using
MetaboAnalyst
David Wishart
A Typical Metabolomics
Experiment
2 Routes to Metabolomics
ppm
7
6
5
4
Quantitative (Targeted)
Methods
3
2
Chemometric (Profiling)
Methods
25
TMAO
hippurate
allantoin creatinine taurine
1
PC2
20
creatinine
15
10
citrate
ANIT
5
hippurate
urea
2-oxoglutarate
water
succinate
fumarate
0
-5
-10
ppm
7
6
5
4
3
2
1
Control
-15
PAP
-20
-25
-30
PC1
-20
-10
0
10
Metabolomics Data
Workflow
Chemometric Methods
Targeted Methods
• Data Integrity Check
• Spectral alignment or
binning
• Data normalization
• Data QC/outlier
removal
• Data reduction &
analysis
• Compound ID
• Data Integrity Check
• Compound ID and
quantification
• Data normalization
• Data QC/outlier
removal
• Data reduction &
analysis
Data Integrity/Quality
• LC-MS and GC-MS have
high number of false
positive peaks
• Problems with adducts
(LC), extra derivatization
products (GC), isotopes,
breakdown products
(ionization issues), etc.
• Not usually a problem
with NMR
• Check using replicates
and adduct calculators
MZedDB http://maltese.dbs.aber.ac.uk:8888/hrmet/index.html
HMDB http://www.hmdb.ca/search/spectra?type=ms_search
Data/Spectral Alignment
• Important for LC-MS
and GC-MS studies
• Not so important for
NMR (pH variation)
• Many programs
available (XCMS,
ChromA, Mzmine)
• Most based on time
warping algorithms
http://mzmine.sourceforge.net/
http://bibiserv.techfak.uni-bielefeld.de/chroma
http://metlin.scripps.edu/download/
Binning (3000 pts to 14 bins)
xi,yi
x = 232.1 (AOC)
y = 10 (bin #)
bin1 bin2 bin3 bin4 bin5 bin6 bin7 bin8...
Data Normalization/Scaling
• Can scale to sample or
scale to feature
• Scaling to whole sample
controls for dilution
• Normalize to integrated
area, probabilistic
quotient method,
internal standard,
sample specific (weight
or volume of sample)
• Choice depends on
sample & circumstances
Same or different?
Data Normalization/Scaling
• Can scale to sample or
scale to feature
• Scaling to feature(s)
helps manage outliers
• Several feature scaling
options available: log
transformation, autoscaling, Pareto scaling,
probabilistic quotient,
and range scaling
MetaboAnalyst http://www.metaboanalyst.ca
Dieterle F et al. Anal Chem. 2006 Jul 1;78(13):4281-90.
Data QC, Outlier Removal &
Data Reduction
• Data filtering (remove solvent peaks,
noise filtering, false positives, outlier
removal -- needs justification)
• Dimensional reduction or feature
selection to reduce number of
features or factors to consider (PCA
or PLS-DA)
• Clustering to find similarity
MetaboAnalyst
http://www.metaboanalyst.ca
• Web server designed to
handle large sets of LC-MS,
GC-MS or NMR-based
metabolomic data
• Supports both univariate
and multivariate data
processing, including ttests, ANOVA, PCA, PLS-DA
• Identifies significantly
altered metabolites,
produces colorful plots,
provides detailed
explanations & summaries
• Links sig. metabolites to
pathways via SMPDB
MetaboAnalyst Workflow
Data preprocessing
Data
normalization
Data
analysis
Data
annotation
14
• GC/LC-MS raw spectra
• Peak lists
• Spectral bins
• Concentration table
Data input
• Spectra processing
• Peak processing
• Noise filtering
• Missing value estimation
Data processing
Functional Interpretation
• Row-wise normalization
• Column-wise normalization
• Combined approach
Data integrity
check
Data normalization
Statistical
Exploration
Enrichment analysis
Pathway analysis
Time-series analysis
Two/multi-group analysis
• Over representation analysis
• Single sample profiling
• Quantitative enrichment
analysis
• Enrichment analysis
• Topology analysis
• Interactive visualization
• Data overview
• Two-way ANOVA
• ANOVA - SCA
• Time-course analysis
• Univariate analysis
• Correlation analysis
• Chemometric analysis
• Feature selection
• Cluster analysis
• Classification
Outputs
• Processed data
• Result tables
• Analysis report
• Images
Image Center
• Resolution: 150/300/600 dpi
• Format: png, tiff, pdf, svg, ps
Quality checking
• Methods comparision
• Temporal drift
• Batch effect
• Biolgoical checking
Other utilities
• Peak searching
• Pathway mapping
• Name/ID conversion
• Lipidomics
MetaboAnalyst Overview
• Raw data processing
– Using MetaboAnalyst
• Data Reduction & Statistical analysis
– Using MetaboAnalyst
• Functional enrichment analysis
– Using MSEA in MetaboAnalyst
• Metabolic pathway analysis
– Using MetPA in MetaboAnalyst
Example Datasets
Example Datasets
Metabolomic Data Processing
Common Tasks
• Purpose: to convert various raw data
forms into data matrices suitable for
statistical analysis
• Supported data formats
– Concentration tables (Targeted Analysis)
– Peak lists (Untargeted)
– Spectral bins (Untargeted)
– Raw spectra (Untargeted)
Data Upload
Alternatively …
Data Set Selected
• Here we will be selecting a data set
from dairy cattle fed different
proportions of cereal grains (0%,
15%, 30%, 45%)
• The rumen was analyzed using NMR
spectroscopy using quantitative
metabolomic techniques
• High grain diets are thought to be
stressful on cows
Data Integrity Check
Data Normalization
Data Normalization
• At this point, the data has been
transformed to a matrix with the samples
in rows and the variables
(compounds/peaks/bins) in columns
• MetaboAnalyst offers three types of
normalization, row-wise normalization,
column-wise normalization and combined
normalization
• Row-wise normalization aims to make
each sample (row) comparable to each
other (i.e. urine samples with different
dilution effects)
Data Normalization
• Column-wise normalization aims to make
each variable (column) comparable to
each other
• This procedure is useful when variables
are of very different orders of magnitude
• Four methods have been implemented for
this purpose – log transformation,
autoscaling, Pareto scaling and range
scaling
Normalization Result
Quality Control
• Dealing with outliers
– Detected mainly by visual inspection
– May be corrected by normalization
– May be excluded
• Noise reduction
– More of a concern for spectral bins/
peak lists
– Usually improves downstream results
Visual Inspection
• What does an outlier look like?
Finding outliers via PCA
Finding outliers via Heatmap
Outlier Removal
Noise Reduction
Noise Reduction (cont.)
• Characteristics of noise &
uninformative features
– Low intensities
– Low variances (default)
Data Reduction and
Statistical Analysis
Common tasks
• To identify important features;
• To detect interesting patterns;
• To assess difference between the
phenotypes
• To facilitate classification /
prediction
ANOVA
View Individual Compounds
Questions
• Q: Which compounds show
significant difference among all the
neighboring groups (0-15, 15-30, and
30-45)?
• Q: For Uracil, are groups 15, 30, 45
significantly different from each
other?
Overall correlation pattern
High resolution image
Specify format
Specify resolution
Specify size
Question
• Q: In untargeted metabolomics using
NMR, researchers often look for
region(s) on the spectra showing
biggest change in their correlation
patterns under different conditions.
Can you do that in MetaboAnalyst?
• Hint: check the available parameters
of Correlation analysis
Template Matching
• Looking for compounds showing interesting
patterns of change
• Essentially a method to look for linear trends or
periodic trends in the data
• Best for data that has 3 or more groups
Template Matching (cont.)
Strong linear
+ correlation
to grain %
Strong linear
- correlation
to grain %
Question
• Q: Identify compounds that decrease
in the first three groups but increase
in the last group?
PCA Scores Plot
PCA Loading Plot
Compounds
most responsible
for separation
3D-PCA
48
Question
Q: Identify compounds that contribute
most to the separation between group
15 and 45
PLS-DA Score Plot
Evaluation of PLS-DA Model
• PLS-DA Model
evaluated by cross
validation of Q2 and R2
• More components to
model improves quality
of fit, but try to
minimize this value
• 3 Component model
seems to be a good
compromise here
• Good R2/Q2 (>0.7)
Important Compounds
Model Validation
Questions
• Q: What does p < 0.01 mean?
• Q: How many permutations need to be
performed if you want to claim p value <
0.0001?
Heatmap Visualization
Note that the Heatmap is not being clustered on Rows (i.e. the % grain in
diet)
Heatmap Visualization (cont.)
Question
Q: Identify compounds with a low
concentration in group 0, 15 but
increase in the group 35 and 45
Q: Which compound is the only one
significantly increased in group 45?
Download Results
Analysis Report
Metabolite Set Enrichment
Analysis
Metabolite Set Enrichment
Analysis (MSEA)
http://www.msea.ca
• Web tool designed to handle
lists of metabolites (with or
without concentration data)
• Modeled after Gene Set
Enrichment Analysis (GSEA)
• Supports over
representation analysis
(ORA), single sample
profiling (SSP) and
quantitative enrichment
analysis (QEA)
• Contains a library of 6300
pre-defined metabolite sets
including 85 pathway sets &
850 disease sets
Enrichment Analysis
• Purpose: To test if there are some
biologically meaningful groups of
metabolites that are significantly enriched
in your data
• Biological meaningful
– Pathways
– Disease
– Localization
• Currently, only supports human
metabolomic data
MSEA
• Accepts 3 kinds of input files
• 1) list of metabolite names only (ORA)
• 2) list of metabolite names +
concentration data from a single
sample (SSP)
• 3) a concentration table with a list of
metabolite names + concentrations
for multiple samples/patients (QEA)
The MSEA approach
Over Representation
Analysis
Single Sample Profiling
Compound
concentrations
Compound
concentrations
ORA input
For MSEA
Compound concentrations
Compare to normal
references
Compound selection
(t-tests, clustering)
Important compound lists
Quantitative Enrichment
Analysis
Abnormal compounds
Assess metabolite
set sdirectly
Find enriched biological
themes
Metabolite set libraries
Biological interpretation
64
Data Set Selected
• Here we are using a collection of
metabolites identified by NMR
(compound list + concentrations)
from the urine from 77 lung and
colon cancer patients, some of whom
were suffering from cachexia
(muscle wasting)
Start with a Compound List
Upload Compound List
Normally GSEA
would require
a list of all known
genes for the given
platform. Here we
just use the list of
metabolites found
in KEGG. ORA is
a “weak” analysis in
MSEA
Compound Name
Standardization
Name Standardization (cont.)
Select a Metabolite Set
Library
Result
Result (cont.)
The Matched Metabolite Set
Single Sample Profiling
(Basically used by a physician to
analyze a patient)
Single Sample Profiling
(cont.)
Concentration Comparison
Concentration Comparison
(cont.)
Quantitative Enrichment
Analysis
Result
The Matched Metabolite Set
Question
• Q: Are these metabolites increased
or decreased in the cachexia group?
Metabolic Pathway Analysis
with MetPA
Pathway Analysis
• Purpose: to extend and enhance
metabolite set enrichment analysis
for pathways by
– Considering the pathway structures
– Supporting pathway visualization
• Currently supports 15 organisms
Data Upload
Data Set Selected
• Here we are using a collection of
metabolites identified by NMR
(compound list + concentrations)
from the urine from 77 lung and
colon cancer patients, some of whom
were suffering from cachexia
(muscle wasting)
Normalization
Pathway Libraries
Network Topology Analysis
Position Matters

Which positions are
important?
Hubs
 Nodes that are highly
connected (red ones)
 Bottlenecks
 Nodes on many
shortest paths
between other nodes
(blue ones)
Graph theory
 Degree centrality
 Betweenness centrality


Junker et al. BMC Bioinformatics 2006
89
Which Node is More
Important?
High
degree
centrality
High
betweenness
centrality
Pathway Visualization
Pathway Visualization
(cont.)
Question
• Q: Which pathway do you think is
likely to be affected the most? Why?
Result
Not Everything Was
Covered
•
•
•
•
•
•
•
Clustering (K-means, SOM)
Classification (SVM, randomForests)
Time-series data analysis
Two factor data analysis
Data quality checks
Peak searching
….
Time Series Analysis in
MetaboAnalyst
96
Quality Checking Module