How to look for the needle in a haystack

Download Report

Transcript How to look for the needle in a haystack

Jianguo (Jeff) Xia
Wishart Research Group
University of Alberta, Canada
METABOLOMICS 2010
1
Metabolomics in the University of Alberta
DI/GC/LCMS,
HPLC, NMR
Software
Tools
METABOLOMICS 2010
Compound
Databases
2
Outline
 Introduction
 Omics data overview
 Lessons from other omics research
 Project goals
 Web-based metabolomics tools
I.
General data processing & statistical analysis
I.
MetaboAnalyst (http://www.metaboanalyst.ca)
Identify functionally interesting patterns
II.
I.
III.
I.
MSEA (http://www.msea.ca)
Metabolic Pathway Analysis
MetPA (http://metpa.metabolomics.ca)
 Public databases
 Summary
METABOLOMICS 2010
3
The ‘-omics’ data overview
Genomics
DNA sequence
100,000 - 1,000,000
Transcriptomics
Gene expression
10,000 - 100,000
Proteomics
Protein expression/
interaction
1,000 – 10,000
Metabolomics
Compound
concentration
100 – 1,000
METABOLOMICS 2010
4
Common questions
1.
2.
3.
4.
5.
6.
Are there some interesting patterns present in my
data?
What are the most important features associated
with different phenotypes?
Is there a real difference between the groups?
Can I use this data to predict a phenotype?
How to interpret these features / patterns?
How does my result compared with published
data?
METABOLOMICS 2010
5
Common approaches
1st
Classical statistics
T-tests, ANOVA
Since 1950s
2nd
High-dimensional
feature selection;
Machine learning
SAM, Limma;
SVM, Neural
networks
Since 1990s
3rd
Group-based
enrichment analysis
GSEA, GSA,
Globaltest
Since 2003
4th
Pathway Analysis
SPIA, TopoGSA
Since 2007
METABOLOMICS 2010
6
Project Goals
 Provide well-established methods proven highly
successful in other ‘omics’ studies;
 Do not re-invent the wheel!
 Support traditional approaches
 Cheminformatics approaches
 Data processing & normalization procedures
 Easy-t0-use
 Not command-line
 Target users – bench biologists
METABOLOMICS 2010
7
Indentify influential algorithms
\\\\\\\
\\\
METABOLOMICS 2010
8
Identify the best practices
Nature Genetics - 38, 500 - 501 (2006)
METABOLOMICS 2010
9
Metabolomics web applications
• General data processing & analysis
– MetaboAnalyst
– http://www.metaboanalyst.ca
• Metabolite Set Enrichment Analysis
– MSEA
– http://www.msea.ca
• Metabolomic Pathway Analysis
– MetPA
– http://metpa.metabolomics.ca
METABOLOMICS 2010
10
MetaboAnalyst
 http://www.metaboanalyst.ca
 General metabolomics data processing, normalization,
and statistical analysis
 Support two-group and multi-group analysis
 20+ well-established methods
 Dynamic graphical presentation
 Automatic report generation
METABOLOMICS 2010
11
What MetaboAnalyst can:
 Basic data processing:
 Peak picking, Peak alignment, Baseline filtering, etc.
 Data normalization
 probabilistic quotient normalization, scaling, etc.
 Data overview
 PCA, Heatmaps, etc.
 Identify important features
 t-tests, ANOVA, SAM, etc.
 Classification
 PLS-DA, random Forest, SVM, etc.
METABOLOMICS 2010
12
GC/LC-MS
raw spectra
MS / NMR
peak lists
• Peak detection
• Retention time correction
MS / NMR
spectra bins
Metabolite
concentrations
Baseline filtering
Peak alignment
• Data integrity check
• Missing value imputation
Data normalization
• Row-wise normalization (4)
• Column-wise normalization (4)
Data analysis
• Univariate analysis (4)
• Dimension reduction (2)
• Feature selection (2)
• Cluster analysis (4)
• Classification (2)
Data annotation
• Peak searching (3)
• Pathway mapping
Download
• Processed data
• PDF report
• Images
METABOLOMICS 2010
13
MetaboAnalyst
METABOLOMICS 2010
14
Data Upload
METABOLOMICS 2010
15
Data processing and integrity check
METABOLOMICS 2010
16
Data normalization (1)
METABOLOMICS 2010
17
Data normalization (2)
METABOLOMICS 2010
18
Data Analysis
METABOLOMICS 2010
19
Clustering with PCA
METABOLOMICS 2010
20
Hierarchical clustering
METABOLOMICS 2010
21
Supervised approach – PLS-DA
METABOLOMICS 2010
22
Feature selection - ANOVA
METABOLOMICS 2010
23
Feature selection - SAM
METABOLOMICS 2010
24
Data Download
METABOLOMICS 2010
25
METABOLOMICS 2010
26
Updates & Forecast
 Recently upgraded
 Support for multiple group analysis
 One-way ANOVA & post-hoc analysis
 To be added
 To add some advanced methods for

Association analysis / ROC / OPLS
 To enhance web interfacing with XCMS
 Allow local installation

To be released by the end of this summer
METABOLOMICS 2010
27
Data Interpretation
 Manual approach
 Background knowledge plus literature search
 Basic & Intuitive
 Can be very accurate
 Issues



Time-consuming
Subjective
Lack of statistical strength
METABOLOMICS 2010
28
Introducing MSEA
 http://www.msea.ca
 Metabolite Set Enrichment Analysis
 Identify biological meaningful patterns from
quantitative metabolomics data
METABOLOMICS 2010
29
Biologically meaningful





Associated to the same biological process (i.e. pathway)
Associated with the same genetic traits (i.e. SNP)
Co-regulated in the same pathological conditions (i.e. disease)
Located in the same cellular compartment
Enriched in certain tissues/ organs
METABOLOMICS 2010
30
The MSEA approach
Over-representation
Analysis
Single Sample
Profiling
Compound
concentrations
Compound
concentrations
Compound selection
(t-tests, ANOVA, PLS-DA)
Important compound lists
Quantitative
Enrichment Analysis
Compound
concentrations
Compare to normal
references
Abnormal compounds
Assess metabolite set
directly (GlobalTest)
Find enriched
biological themes
Metabolite set libraries
Biological interpretation
METABOLOMICS 2010
31
MSEA @ www.msea.ca
METABOLOMICS 2010
32
Over-representation analysis
METABOLOMICS 2010
33
Compound label standardization
METABOLOMICS 2010
34
Identify abnormal concentration
METABOLOMICS 2010
35
Library selection
METABOLOMICS 2010
36
Result
METABOLOMICS 2010
37
Download
METABOLOMICS 2010
38
MSEA summary
 More biologically–motivated
 Simultaneously biomarker identification and
interpretation
 Automatic comparison with published data
 Important patterns
 Reference concentrations
 Potential issues
 Limited by the size and quality of the knowledge database
 For pathway-based metabolite sets

Does not consider the pathway topology
METABOLOMICS 2010
39
Topology matters
Glycine, serine and theonine metabolism
Galactose metabolism
p = 1e-5
p = 1e-7
METABOLOMICS 2010
40
Introducing MetPA
 http://metpa.metabolomics.ca
 Pathway Analysis Tool
 884 pathways covering 11 model organisms
 Enrichment Analysis


Global Test
Global ANCOVA
 Topology Analysis


Degree Centrality
Betweenness Centrality
 Google-map style visualization
METABOLOMICS 2010
41
MetPA
METABOLOMICS 2010
42
11 pathway libraries (KEGG)
METABOLOMICS 2010
43
Combining Enrichment analysis & Topology analysis
METABOLOMICS 2010
44
Node importance measure: centrality
 Degree Centrality
 Local structure;
 Highly connected (hub)
 The Red nodes
 Betweeness Centrality
 Global structures;
 Sits on many shortest
paths between other
nodes
 The Blue node
METABOLOMICS 2010
Junker et al. BMC Bioinformatics 2006
45
Point and Click
METABOLOMICS 2010
46
Lossless zooming
METABOLOMICS 2010
47
Result table
METABOLOMICS 2010
48
Downloads
METABOLOMICS 2010
49
MetPA summary
 Combine statistical analysis and topological analysis
 Results are more close to manual identification
 Highly interactive visualization system
 allows easy hierarchical navigation within a large
amount of information
METABOLOMICS 2010
50
Public Databases
 HMDB
 DrugBank
 SMPDB
 T3DB
METABOLOMICS 2010
51
Summary
1st & 2nd
generation
3rd generation
• MetaboAnalyst: general data
processing & analysis
• MSEA: Metabolite set enrichment
analysis
4th
generation
• MetPA: Metabolomics Pathway
Analysis
5th
generation
• Integrate with other omics data
52
Acknowledgement
• Dr. David Wishart
 Alberta Ingenuity Fund (AIF)
 The Human Metabolome Project (HMP)
 University of Alberta, Canada
METABOLOMICS 2010
53