PPT - Bioinformatics.ca

Download Report

Transcript PPT - Bioinformatics.ca

Canadian Bioinformatics
Workshops
www.bioinformatics.ca
Module #: Title of Module
2
Module 5
Metabolomic Data Analysis Using
MetaboAnalyst
David Wishart
Learning Objectives
• To become familiar with the standard
metabolomics data analysis workflow
• To become aware of key elements
such as: data integrity checking,
outlier detection, quality control,
normalization, scaling, etc.
• To learn how to use MetaboAnalyst to
facilitate data analysis
A Typical Metabolomics
Experiment
2 Routes to Metabolomics
ppm
7
6
5
4
Quantitative (Targeted)
Methods
3
2
Chemometric (Profiling)
Methods
25
TMAO
hippurate
allantoin creatinine taurine
1
PC2
20
creatinine
15
10
citrate
ANIT
5
hippurate
urea
2-oxoglutarate
water
succinate
fumarate
0
-5
-10
ppm
7
6
5
4
3
2
1
Control
-15
PAP
-20
-25
-30
PC1
-20
-10
0
10
Metabolomics Data
Workflow
Chemometric Methods
Targeted Methods
• Data Integrity Check
• Spectral alignment or
binning
• Data normalization
• Data QC/outlier
removal
• Data reduction &
analysis
• Compound ID
• Data Integrity Check
• Compound ID and
quantification
• Data normalization
• Data QC/outlier
removal
• Data reduction &
analysis
Data Integrity/Quality
• LC-MS and GC-MS have
high number of false
positive peaks
• Problems with adducts
(LC), extra derivatization
products (GC), isotopes,
breakdown products
(ionization issues), etc.
• Not usually a problem
with NMR
• Check using replicates
and adduct calculators
MZedDB http://maltese.dbs.aber.ac.uk:8888/hrmet/index.html
HMDB http://www.hmdb.ca/search/spectra?type=ms_search
Data/Spectral Alignment
• Important for LC-MS
and GC-MS studies
• Not so important for
NMR (pH variation)
• Many programs
available (XCMS,
ChromA, Mzmine)
• Most based on time
warping algorithms
http://mzmine.sourceforge.net/
http://bibiserv.techfak.uni-bielefeld.de/chroma
http://metlin.scripps.edu/xcms/
Binning (3000 pts to 14 bins)
xi,yi
x = 232.1 (AOC)
y = 10 (bin #)
bin1 bin2 bin3 bin4 bin5 bin6 bin7 bin8...
Data Normalization/Scaling
• Can scale to sample or
scale to feature
• Scaling to whole sample
controls for dilution
• Normalize to integrated
area, probabilistic
quotient method,
internal standard,
sample specific (weight
or volume of sample)
• Choice depends on
sample & circumstances
Same or different?
Data Normalization/Scaling
• Can scale to sample or
scale to feature
• Scaling to feature(s)
helps manage outliers
• Several feature scaling
options available: log
transformation, autoscaling, Pareto scaling,
probabilistic quotient,
and range scaling
MetaboAnalyst http://www.metaboanalyst.ca
Dieterle F et al. Anal Chem. 2006 Jul 1;78(13):4281-90.
Data QC, Outlier Removal &
Data Reduction
• Data filtering (remove solvent peaks,
noise filtering, false positives, outlier
removal -- needs justification)
• Dimensional reduction or feature
selection to reduce number of
features or factors to consider (PCA
or PLS-DA)
• Clustering to find similarity
MetaboAnalyst
http://www.metaboanalyst.ca
A comprehensive web server designed to process &
analyze LC-MS, GC-MS or NMR-based metabolomic data
MetaboAnalyst History
• 2009 v1.0 - Supports both univariate and
multivariate data processing, including ttests, ANOVA, PCA, PLS-DA, colorful
plots, with detailed explanations &
summaries
• 2012 v2.0 - Identifies significantly altered
functions & pathways
• 2015 v3.0 – Better performance, better
graphical interactivity, biomarker analysis,
power analysis, integration with gene
expression data …
MetaboAnalyst Overview
•
•
•
•
•
Raw data processing
Data reduction & statistical analysis
Functional enrichment analysis
Metabolic pathway analysis
Power analysis and sample size
estimation
• Biomarker analysis
• Integrative analysis
MetaboAnalyst Modules
Data preprocessing
Data
normalization
Data analysis
Data
interpretation
17
MetaboAnalyst Modules
Example Datasets
Example Datasets
Metabolomic Data Processing
Common Tasks
• Purpose: to convert various raw data
forms into data matrices suitable for
statistical analysis
• Supported data formats
– Concentration tables (Targeted Analysis)
– Peak lists (Untargeted)
– Spectral bins (Untargeted)
– Raw spectra (Untargeted)
Select a Module (Statistical
Analysis)
Data Upload
Alternatively …
Data Set Selected
• Here we have selected a data set
from dairy cattle fed different
proportions of cereal grains (0%,
15%, 30%, 45%)
• The rumen was analyzed using NMR
spectroscopy using quantitative
metabolomic techniques
• High grain diets are thought to be
stressful on cows
Data Integrity Check
Data Normalization
Samples = rows
Compounds = columns
Data Normalization
• At this point, the data has been
transformed to a matrix with the samples
in rows and the variables
(compounds/peaks/bins) in columns
• MetaboAnalyst offers three types of
normalization, row-wise normalization,
column-wise normalization and combined
normalization
• Row-wise normalization aims to make
each sample (row) comparable to each
other (i.e. urine samples with different
dilution effects)
Data Normalization
• Column-wise normalization aims to make
each variable (column) comparable in
scale to each other, thereby generating a
“normal” distribution
• This procedure is useful when variables
are of very different orders of magnitude
• Four methods have been implemented for
this purpose – log transformation,
autoscaling, Pareto scaling and range
scaling
Normalization Result
Data Normalization
• You cannot know a priori what the
best normalization protocol will be
• MetaboAnalyst allows you to
interactively explore different
normalization protocols and to
visually inspect the degree of
“normality” or Gaussian behavior
• This example is nicely normalized
Next Steps
• After normalization has been
completed it is a good idea to look at
your data a little further to identify
outliers or noise that could/should be
removed
Quality Control
• Dealing with outliers
– Detected mainly by visual inspection
– May be corrected by normalization
– May be excluded
• Noise reduction
– More of a concern for spectral bins/
peak lists
– Usually improves downstream results
Visual Inspection
• What does an outlier look like?
Finding outliers via PCA
Finding outliers via Heatmap
Outlier Removal (Data Editor)
Noise Reduction (Data Filtering)
Noise Reduction (cont.)
• Characteristics of noise & uninformative
features
– Low intensities
– Low variances (default)
Data Reduction and
Statistical Analysis
Common Tasks
• To identify important features
• To detect interesting patterns
• To assess difference between the
phenotypes
• To facilitate classification or
prediction
• We will look at ANOVA, Multivariate
Analysis (PCA, PLS-DA) and
Clustering
ANOVA
• Looking at 4 different dairy cow
populations
– 0% grain in diet
– 15% grain in diet
– 30% grain in diet
– 45% grain in diet
• Try to identify those metabolites that
are different between all groups or
just between 0% and everything else
ANOVA
Click this to view
the table
Click this spot
and the 3-PP
graph pops up
View Individual Compounds
Click this to see
the uracil graphs
What’s Next?
• Click and compare different
compounds to see which ones are
most different or most similar
between the 4 groups
• Click on the Correlation link (under
the ANOVA link) to generate a heat
map that displays the pairwise
compound correlations and
compound clusters
Overall Correlation Pattern
Click this to save
a high res. image
High Resolution Image
What’s Next?
• When looking at >2 groups it is often
useful to look for patterns or trends
within particular metabolites
• Use Pattern Hunter to find these
trends
Pattern Matching
• Looking for compounds showing interesting
patterns of change
• Essentially a method to look for linear trends or
periodic trends in the data
• Best for data that has 3 or more groups
Pattern Matching (cont.)
Strong linear
+ correlation
to grain %
Strong linear
- correlation
to grain %
Multivariate Analysis
• Use PCA option to view the
separation (if any) in the 4 groups
• Look at the 2D PCA Score Plot
– 2 most significant principal components
• Look at the 2D PCA Loading Plot
• Look at the PCA Plot in 3D
– 3 most significant principal components
• Options for viewing are located in the
top tabs
PCA Scores Plot
PCA Loading Plot
Compounds
most responsible
for separation
Click on a point
to view
3D Score Plot
Drag to rotate
Mouse over to see
sample names
55
Multivariate Analysis
• Use PLS-DA option to view the
separation of the 4 (labeled) groups
• PLS-DA “rotates” the PCA axes to
maximize separation
• Look at the 2D PLS Scores Plot
• Look at the Q2 and R2 Values (Cross
Validation)
• Use the VIP plot to ID important
metabolites (VIP > 1.2)
PLS-DA Score Plot
Evaluation of PLS-DA Model
• PLS-DA Model
evaluated by cross
validation of Q2 and
R2
• Using too many
components can
over-fit
• 3 component model
seems to be a good
compromise here
• Good R2/Q2 (>0.7)
Important Compounds
Model Validation
Note, permutation is computationally intensive. It is not
performed by default. Users need to set the permutation number
and press the submit button
Hierarchical Clustering
(Heat Maps)
• An alternative way of viewing or
clustering multivariate data
• Allows one to look at the behavior of
individual metabolites
• Can ask questions such as: which
compounds have a low concentration in
group 0, 15 but increase in the group 35
and 45? or which compound is the only
one significantly increased in group 45?
Heatmap Visualization
Note that the Heatmap is not being clustered on Rows. It is ordered by the
class labels
Heatmap Visualization (cont.)
What’s Next?
• Most of the multivariate analysis is
now done
• MetaboAnalyst has been keeping
track of the plots or graphs you have
generated
• Now its time to generate a printed
report that summarizes what you’ve
done and what you’ve found
Download Results
Analysis Report
Select a Module
(Enrichment Analysis)
Metabolite Set Enrichment
Analysis (MSEA)
http://www.msea.ca Now part of
Metaboanalyst
• Designed to handle lists of
metabolites (with or without
concentration data)
• Modeled after Gene Set
Enrichment Analysis (GSEA)
• Supports over
representation analysis
(ORA), single sample
profiling (SSP) and
quantitative enrichment
analysis (QEA)
• Contains a library of 6300
pre-defined metabolite sets
including 85 pathway sets &
850 disease sets
Enrichment Analysis
• Purpose: To test if there are biologically
meaningful groups of metabolites that are
significantly enriched in your data
• Biological meaningful in terms of:
– Pathways
– Disease
– Localization
• Currently, MSEA only supports human
metabolomic data
MSEA
• Accepts 3 kinds of input files
– list of metabolite names only (ORA –
over representation analysis)
– list of metabolite names + concentration
data from a single sample (SSP – single
sample profiling)
– a concentration table with a list of
metabolite names + concentrations for
multiple samples/patients (QEA –
quantitative enrichment analysis)
The MSEA Approach
ORA
SSP
Over Representation
Analysis
Single Sample Profiling
Compound
concentrations
Compound
concentrations
ORA input
For MSEA
Quantitative Enrichment
Analysis
Compound concentrations
Compare to normal
references
Compound selection
(t-tests, clustering)
Important compound lists
QEA
Abnormal compounds
Assess metabolite
sets directly
Find enriched biological
themes
Metabolite set libraries
Biological interpretation
73
Data Set Selected
• Here we are using a collection of
metabolites identified by NMR (compound
list + concentrations) from the urine from
77 lung and colon cancer patients, some
of whom were suffering from cachexia
(muscle wasting)
Start with a Compound List
for ORA
Upload Compound List
Normally GSEA
would require
a list of all known
genes for the given
platform. Here we
just use the list of
metabolites found
in KEGG. ORA is
a “weak” analysis in
MSEA
Perform Compound Name
Standardization
Name Standardization (cont.)
Select a Metabolite Set
Library
Result
Result (cont.)
Click on details
to see more
The Matched Metabolite Set
Click on SMPDB
to see more
information
Phenylalanine and Tyrosine
Metabolism in SMPDB
Single Sample Profiling (SSP)
(Basically used by a physician to analyze a
patient)
Concentration Comparison
Concentration Comparison
(cont.)
Quantitative Enrichment
Analysis (QEA)
Result
Click on details
to see more
The Matched Metabolite Set
Select a Module (Pathway
Analysis)
Pathway Analysis
• Purpose: to extend and enhance
metabolite set enrichment analysis
for pathways by
– Considering pathway structures
– Supporting pathway visualization
• Currently supports analysis for 21
diverse (model) organisms such as
humans, mouse, drosophila,
arabadopsis, E. coli, yeast, etc.
(KEGG pathways only)
Data Set Selected
• Here we are using a collection of
metabolites identified by NMR
(compound list + concentrations)
from the urine from 77 lung and
colon cancer patients, some of whom
were suffering from cachexia
(muscle wasting)
Pathway Analysis Module
Data Upload
Perform Data Normalization
Select Pathway Libraries
Perform Network Topology
Analysis
Pathway Position Matters

Which positions are
important?
Hubs
 Nodes that are highly
connected (red ones)
 Bottlenecks
 Nodes on many
shortest paths
between other nodes
(blue ones)
Graph theory
 Degree centrality
 Betweenness centrality


Junker et al. BMC Bioinformatics 2006
98
Which Node is More
Important?
High
degree
centrality
High
betweenness
centrality
Pathway Visualization
Pathway Visualization
(cont.)
Pathway Impact
• Incorporates parameters such as the
log fold-change of the DE
metabolites, the statistical
significance of the set of pathway
genes and the topology of the
signaling pathway
• Combines the pathway topology with
the over-representation evidence
Result
Select a Module (Biomarker
Analysis)
Biomarker Analysis
• Purpose is to find biomarkers using ROC
(receiver operator characteristic) curves
with high sensitivity and specificity
• Maximize AUC under ROC curve while
minimizing the number of metabolites
used in the biomarker panel
• 3 different modules (univariate – single
marker at a time, multivariate – many
combinations of biomarkers, manual –
user choice)
Select Test Data Set 1
Click Here
Click Here
Data Set Selected
• 90 patients (expectant mothers) at 3
months pregnancy
• Serum samples
• 45 patients went on to develop preeclampsia at 6-7 months
• 45 patients had normal pregancies
• Trying to find biomarkers for
predicting early pre-eclampsia
Perform Data Integrity Check
Click Here
Perform Log Normalization
Click Here
Check That It’s Normally
Distributed
before
Click Here
after
Select Multivariate Option
Click Here
View ROC Curve
Choose a Model (95% conf.)
Select model
95% Confidence Interval
Select Sig. Features Tab
Click Here
View VIP Plot
Select a Module (Power
Analysis)
Statistical Power
• Statistical power is the ability of a test
to detect an effect, if the effect actually
exists
– A power of 0.8 in a clinical trial means that
the study has a 80% chance of ending up
with a statistically significant treatment
effect if there really was an important
difference between treatments.
• To answer research questions:
– How powerful is my study?
– How many samples do I need to have for
what I want to get from the study?
Statistical Power (cont.)
• The statistical power of a test
depends:
1. Sample size,
2. Significance criterion (alpha)
3. Effect size
Increase power
• Effect size
• Sample size
Decrease Power
• Significance criterion
The Approach
• How do we get these values?
– Effect size can be estimated from a pilot data;
– Significance criteria
• Single metabolite - p value cutoff (i.e. 0.05, 0.01)
• Metabolomics data – FDR (i.e. 0.1)
– Sample size is our interest
– Power value is our interest
• You need to upload a pilot data, and set the
criteria, MetaboAnalyst will compute a power
vs. sample size curve by computing power
values for a range of sample sizes [3, 1000]
Power vs. Sample size
At least 60 samples/group will needed to get a power of 0.8
Not Everything Was Covered
•
•
•
•
•
Clustering (K-means, SOM)
Classification (SVM, randomForests)
Time-series data analysis
Two factor data analysis
Integrative pathway analysis (gene
and metabolite)
Time Series Analysis in
MetaboAnalyst
123
Integrative Pathway Analysis