How to look for the needle in a haystack
Download
Report
Transcript How to look for the needle in a haystack
Jianguo (Jeff) Xia
Wishart Research Group
University of Alberta, Canada
METABOLOMICS 2010
1
Metabolomics in the University of Alberta
DI/GC/LCMS,
HPLC, NMR
Software
Tools
METABOLOMICS 2010
Compound
Databases
2
Outline
Introduction
Omics data overview
Lessons from other omics research
Project goals
Web-based metabolomics tools
I.
General data processing & statistical analysis
I.
MetaboAnalyst (http://www.metaboanalyst.ca)
Identify functionally interesting patterns
II.
I.
III.
I.
MSEA (http://www.msea.ca)
Metabolic Pathway Analysis
MetPA (http://metpa.metabolomics.ca)
Public databases
Summary
METABOLOMICS 2010
3
The ‘-omics’ data overview
Genomics
DNA sequence
100,000 - 1,000,000
Transcriptomics
Gene expression
10,000 - 100,000
Proteomics
Protein expression/
interaction
1,000 – 10,000
Metabolomics
Compound
concentration
100 – 1,000
METABOLOMICS 2010
4
Common questions
1.
2.
3.
4.
5.
6.
Are there some interesting patterns present in my
data?
What are the most important features associated
with different phenotypes?
Is there a real difference between the groups?
Can I use this data to predict a phenotype?
How to interpret these features / patterns?
How does my result compared with published
data?
METABOLOMICS 2010
5
Common approaches
1st
Classical statistics
T-tests, ANOVA
Since 1950s
2nd
High-dimensional
feature selection;
Machine learning
SAM, Limma;
SVM, Neural
networks
Since 1990s
3rd
Group-based
enrichment analysis
GSEA, GSA,
Globaltest
Since 2003
4th
Pathway Analysis
SPIA, TopoGSA
Since 2007
METABOLOMICS 2010
6
Project Goals
Provide well-established methods proven highly
successful in other ‘omics’ studies;
Do not re-invent the wheel!
Support traditional approaches
Cheminformatics approaches
Data processing & normalization procedures
Easy-t0-use
Not command-line
Target users – bench biologists
METABOLOMICS 2010
7
Indentify influential algorithms
\\\\\\\
\\\
METABOLOMICS 2010
8
Identify the best practices
Nature Genetics - 38, 500 - 501 (2006)
METABOLOMICS 2010
9
Metabolomics web applications
• General data processing & analysis
– MetaboAnalyst
– http://www.metaboanalyst.ca
• Metabolite Set Enrichment Analysis
– MSEA
– http://www.msea.ca
• Metabolomic Pathway Analysis
– MetPA
– http://metpa.metabolomics.ca
METABOLOMICS 2010
10
MetaboAnalyst
http://www.metaboanalyst.ca
General metabolomics data processing, normalization,
and statistical analysis
Support two-group and multi-group analysis
20+ well-established methods
Dynamic graphical presentation
Automatic report generation
METABOLOMICS 2010
11
What MetaboAnalyst can:
Basic data processing:
Peak picking, Peak alignment, Baseline filtering, etc.
Data normalization
probabilistic quotient normalization, scaling, etc.
Data overview
PCA, Heatmaps, etc.
Identify important features
t-tests, ANOVA, SAM, etc.
Classification
PLS-DA, random Forest, SVM, etc.
METABOLOMICS 2010
12
GC/LC-MS
raw spectra
MS / NMR
peak lists
• Peak detection
• Retention time correction
MS / NMR
spectra bins
Metabolite
concentrations
Baseline filtering
Peak alignment
• Data integrity check
• Missing value imputation
Data normalization
• Row-wise normalization (4)
• Column-wise normalization (4)
Data analysis
• Univariate analysis (4)
• Dimension reduction (2)
• Feature selection (2)
• Cluster analysis (4)
• Classification (2)
Data annotation
• Peak searching (3)
• Pathway mapping
Download
• Processed data
• PDF report
• Images
METABOLOMICS 2010
13
MetaboAnalyst
METABOLOMICS 2010
14
Data Upload
METABOLOMICS 2010
15
Data processing and integrity check
METABOLOMICS 2010
16
Data normalization (1)
METABOLOMICS 2010
17
Data normalization (2)
METABOLOMICS 2010
18
Data Analysis
METABOLOMICS 2010
19
Clustering with PCA
METABOLOMICS 2010
20
Hierarchical clustering
METABOLOMICS 2010
21
Supervised approach – PLS-DA
METABOLOMICS 2010
22
Feature selection - ANOVA
METABOLOMICS 2010
23
Feature selection - SAM
METABOLOMICS 2010
24
Data Download
METABOLOMICS 2010
25
METABOLOMICS 2010
26
Updates & Forecast
Recently upgraded
Support for multiple group analysis
One-way ANOVA & post-hoc analysis
To be added
To add some advanced methods for
Association analysis / ROC / OPLS
To enhance web interfacing with XCMS
Allow local installation
To be released by the end of this summer
METABOLOMICS 2010
27
Data Interpretation
Manual approach
Background knowledge plus literature search
Basic & Intuitive
Can be very accurate
Issues
Time-consuming
Subjective
Lack of statistical strength
METABOLOMICS 2010
28
Introducing MSEA
http://www.msea.ca
Metabolite Set Enrichment Analysis
Identify biological meaningful patterns from
quantitative metabolomics data
METABOLOMICS 2010
29
Biologically meaningful
Associated to the same biological process (i.e. pathway)
Associated with the same genetic traits (i.e. SNP)
Co-regulated in the same pathological conditions (i.e. disease)
Located in the same cellular compartment
Enriched in certain tissues/ organs
METABOLOMICS 2010
30
The MSEA approach
Over-representation
Analysis
Single Sample
Profiling
Compound
concentrations
Compound
concentrations
Compound selection
(t-tests, ANOVA, PLS-DA)
Important compound lists
Quantitative
Enrichment Analysis
Compound
concentrations
Compare to normal
references
Abnormal compounds
Assess metabolite set
directly (GlobalTest)
Find enriched
biological themes
Metabolite set libraries
Biological interpretation
METABOLOMICS 2010
31
MSEA @ www.msea.ca
METABOLOMICS 2010
32
Over-representation analysis
METABOLOMICS 2010
33
Compound label standardization
METABOLOMICS 2010
34
Identify abnormal concentration
METABOLOMICS 2010
35
Library selection
METABOLOMICS 2010
36
Result
METABOLOMICS 2010
37
Download
METABOLOMICS 2010
38
MSEA summary
More biologically–motivated
Simultaneously biomarker identification and
interpretation
Automatic comparison with published data
Important patterns
Reference concentrations
Potential issues
Limited by the size and quality of the knowledge database
For pathway-based metabolite sets
Does not consider the pathway topology
METABOLOMICS 2010
39
Topology matters
Glycine, serine and theonine metabolism
Galactose metabolism
p = 1e-5
p = 1e-7
METABOLOMICS 2010
40
Introducing MetPA
http://metpa.metabolomics.ca
Pathway Analysis Tool
884 pathways covering 11 model organisms
Enrichment Analysis
Global Test
Global ANCOVA
Topology Analysis
Degree Centrality
Betweenness Centrality
Google-map style visualization
METABOLOMICS 2010
41
MetPA
METABOLOMICS 2010
42
11 pathway libraries (KEGG)
METABOLOMICS 2010
43
Combining Enrichment analysis & Topology analysis
METABOLOMICS 2010
44
Node importance measure: centrality
Degree Centrality
Local structure;
Highly connected (hub)
The Red nodes
Betweeness Centrality
Global structures;
Sits on many shortest
paths between other
nodes
The Blue node
METABOLOMICS 2010
Junker et al. BMC Bioinformatics 2006
45
Point and Click
METABOLOMICS 2010
46
Lossless zooming
METABOLOMICS 2010
47
Result table
METABOLOMICS 2010
48
Downloads
METABOLOMICS 2010
49
MetPA summary
Combine statistical analysis and topological analysis
Results are more close to manual identification
Highly interactive visualization system
allows easy hierarchical navigation within a large
amount of information
METABOLOMICS 2010
50
Public Databases
HMDB
DrugBank
SMPDB
T3DB
METABOLOMICS 2010
51
Summary
1st & 2nd
generation
3rd generation
• MetaboAnalyst: general data
processing & analysis
• MSEA: Metabolite set enrichment
analysis
4th
generation
• MetPA: Metabolomics Pathway
Analysis
5th
generation
• Integrate with other omics data
52
Acknowledgement
• Dr. David Wishart
Alberta Ingenuity Fund (AIF)
The Human Metabolome Project (HMP)
University of Alberta, Canada
METABOLOMICS 2010
53