- Cal State LA - Instructional Web Server

Download Report

Transcript - Cal State LA - Instructional Web Server

Pathway Analysis
Michael Sneddon
Southern California Bioinformatics Institute
August 20, 2004
Project Overview
• Specializes in microarray data analysis software
– Image Analysis
– Data Analysis
– Data Management
• How can microarray data be used to find information
about biological pathways?
• Project: explore different ways to extract information
about biological pathways from microarray data.
CONFIDENTIAL
Sample Microarray Data
• Microarrays
can provide
information on differential expression
Gene
3 would
be selected
between conditions. The most differentially expressed genes are
for further
study.
singled
out for further
study.
Conditions
Genes
Healthy
Healthy
Infected
Infected
Gene 1
95
101
112
106
Gene 2
150
175
163
145
Gene 3
123
118
212
248
Gene 4
64
73
50
58
Gene 5
284
253
258
270
Gene 6
100
89
92
94
Gene 7
78
86
132
125
Gene 8
184
170
145
153
Gene 9
138
146
130
185
CONFIDENTIAL
A Different Approach
Difficulties With Old Approach
• No gene is significantly differentially expressed.
• Many genes are significantly differentially expressed.
• Not making use of prior knowledge.
A Different Approach
• Look for affected biological processes, sometimes called pathways,
instead of individual genes.
• Need a way to convert a list of differential gene expression values
into scores for pathways.
• The way to do that is through a scoring metric.
CONFIDENTIAL
Method of Ranking Pathways
Microarray
Data
Scoring
Metric
A score
for each
Pathway:
indicates
how much it
was affected
by the condition
Annotations
Many Different Scoring
Metrics Available
CONFIDENTIAL
A Simple Metric
Photosynthesis
Gene Names
P-Value
1) 200078_s_at
.13458
2) 201172_x_at
.05124
3) 205473_at
.98341
4) 208678_at
.46123
5) 214244_s_at
.00032
6) 230565_at
.00341
7) 36994_at
.28952
8) 39144_s_at
.17345
5 genes have a
P-value below 0.2
# of genes below 0.2
out of 8 genes in
Score =
totalthis
# of pathway
genes
in pathway
Score = 5/8 = 0.625
CONFIDENTIAL
Project Goals
I. Analyze and compare different scoring metrics
– How similar are the different metrics?
– Which metric produces the most biologically significant
results?
– When should we use a particular metric over another?
II. Explore known ranking metrics
– How and why do they work?
– Is there a way to improve them or design a better one?
CONFIDENTIAL
The Metrics Investigated
• Enrichment – the original method first used to rank
pathways, it is still widely used today
• GSEA (Gene Set Enrichment Analysis) – a recently
published* method using a Kolmogorov-Smirnov statistic
• Shams 1
• Shams 2
• Shams 3
}
Potential BioDiscovery
Scoring Metrics
* Mootha, et al, “PGC-1alpha-responsive genes involved in oxidative
phosphorylation are coordinately downregulated in human diabetes.” (Nat
Genet. 2003 Jul;34(3):267-73).
CONFIDENTIAL
Part I: Compare the metrics
Compare each metric to all the others to see if they
produce similar results.
If they are very similar,
it doesn’t matter
which one we use.
If they are different,
which one is correct?
Or, are they both
correct?
CONFIDENTIAL
How to Compare Metrics
Wrote a program that does the following:
(1) Rank the same pathways using different metrics
(2) Take the top pathways from each ranking
(3) Count the number of pathways that are in common
among the top pathways being considered
(4) Construct a % similarity score = # of pathways in
common divided by the total number of pathways
CONFIDENTIAL
An Example
SHAMS I
GSEA
PATHWAY NAME
SCORE
tachykinin signaling pathway
2.051
endothelin receptor activity
1.983
initiation factor 4F complex
1.965
nucleotide-sugar metabolism
1.807
sarcoglycan complex
1.829
ubiquitin C-terminal activity
1.664
odorant binding
1.539
interleukin-6 receptor binding 1.507
cysteine-type peptidase activity 1.488
clathrin binding
1.432
female germ-cell nucleus
1.422
AMP deaminase activity
1.415
anticoagulant activity
protease activator activity
malate metabolism
1.337
1.336
1.326
PATHWAY NAME
SCORE
nucleotide-sugar metabolism
298.4
endothelin-B receptor activity 270.5
endothelin receptor activity
266.7
odorant binding
221.3
interleukin-6 receptor binding 218.1
histone deacetylation
181.3
clathrin binding
176.4
female germ-cell nucleus
162.9
AMP deaminase activity
159.2
cholecystokinin receptor activity 150.3
malate metabolism
144.0
malate activity
138.9
Compare the top 12
pathways from
each
metric.
malate dehydrogenase
delta-opioid receptor activity
vasculogenesis
113.5
112.8
109.0
CONFIDENTIAL
An Example
SHAMS I
GSEA
PATHWAY NAME
SCORE
tachykinin signaling pathway
2.051
endothelin receptor activity
1.983
initiation factor 4F complex
1.965
nucleotide-sugar metabolism
1.807
sarcoglycan complex
1.829
ubiquitin C-terminal activity
1.664
odorant binding
1.539
interleukin-6 receptor binding 1.507
cysteine-type peptidase activity 1.488
cholecystokinin receptor activity 1.385
female germ-cell nucleus
1.422
AMP deaminase activity
1.415
anticoagulant activity
protease activator activity
malate metabolism
1.337
1.336
1.326
PATHWAY NAME
SCORE
nucleotide-sugar metabolism
298.4
endothelin-B receptor activity 270.5
endothelin receptor activity
266.7
odorant binding
221.3
interleukin-6 receptor binding 218.1
histone deacetylation
181.1
clathrin binding
176.4
female germ-cell nucleus
162.9
clathrin binding
159.2
malate metabolism
150.3
AMP deaminase activity
144.0
malate activity
138.9
6 Matches out of 12
Total Pathways =
malate dehydrogenase
50%
Similarity
delta-opioid receptor activity
vasculogenesis
113.5
112.8
109.0
CONFIDENTIAL
Repeat The Process
First, take the top 10 pathways.
Then take the top 20 pathways.
Then take the top 30 pathways.
.
.
.
Continue until a pattern is seen.
CONFIDENTIAL
Example Graph of Results
0.5
0.45
Shams 1 vs. Shams 2
Shams 1 vs. Shams 3
0.4
Shams 1 vs. GSEA
Shams 1 vs Enrichment
% Similarity
0.35
Shams 2 vs. Shams 3
0.3
Shams 2 vs. GSEA
Shams 2 vs. Enrichment
0.25
Between Shams 1 and
0.2
Shams 2, the top 20 pathways
0.15
Shams 3 vs. GSEA
Shams 3 vs. Enrichment
GSEA vs. Enrichment
Random Sample
0.1
have about 36% Similarity
0.05
0
10
20
30
40
50
60
70
80
90
100
Cut-Off Value (out of 2646 pathways)
CONFIDENTIAL
Results
• No two metrics were very similar in any dataset tested
(i.e. 85%+)
• Percent Similarities differed greatly between different
datasets – no two metrics demonstrated a consistent
amount of similarity.
Since the metrics ranked the pathways differently…
Which metrics are correct? Or are they all correct?
•Begin by verifying and understanding what has
already been researched – GSEA.
CONFIDENTIAL
Part II: Exploring a Metric: GSEA
• Gene Set Enrichment Analysis
• A result of the collaboration of many individuals
from a number of institutions including MIT and
Harvard.
• Devised in order to identify the pathways that
are significantly affected in individuals with type
2 diabetes compared to healthy individuals.
• How, exactly, does GSEA work?
• Is our implementation correct?
CONFIDENTIAL
How GSEA works
• (1) Rank the genes based on differential expression
#
Gene
P-Value (T-test)
1
2
3
4
5
6
7
8
9
10
11
•
•
•
217245_at
204157_s_at
208670_s_at
203569_s_at
211432_s_at
215551_at
206226_at
201662_s_at
210322_x_at
216776_at
220405_at
•
•
•
0.011737478
0.011873747
0.01204891
0.012177919
0.012267433
0.012284646
0.012354957
0.01257829
0.012579987
0.012583022
0.012684445
•
•
•
If pathway one
(2)
Compute a score
contains these
for
each
pathway
three
genes
based on where the
genes of that
And pathway two
pathway
appear.
contains these
three
genes
Then pathway one is
given a higher score than
pathway two.
CONFIDENTIAL
Importance of a P-Value
• A metric will always produce a ranking.
• Is the ranking we get significant or could it have been
generated randomly?
Answer: We need to compute a P-value to make sure
that the score we get is unlikely to have been produced
by chance.
CONFIDENTIAL
Constructing a P-value
(1) Permute class labels 1000 times
(2) Rank the pathways with each different permutation
(3) Create a histogram of top values based on the
permutations
(4) Figure out where in the histogram the actual data lies –
shows how significant the score is.
CONFIDENTIAL
Constructing a P-value
But, if the actual
If the actual
score
score
falls here,
here,
the
thefalls
score
is not
score
is significant
significant
50
40
30
20
10
e
0
M
or
34
0
32
0
30
0
28
0
26
0
24
0
22
0
20
0
18
0
16
0
14
0
12
0
0
10
Number of Permutations
60
GSEA SCORE
CONFIDENTIAL
Implementation
• BioDiscovery already had an
implementation of the GSEA scoring metric.
• What I did:
– Tweaked the code so that it works better and functions
more like the original published method.
– Extended the code to compute a P-value to measure the
significance of GSEA scores.
CONFIDENTIAL
Results of GSEA analysis
• A better understanding of how GSEA operates especially
in comparison to other potential metrics.
• A good implementation of the GSEA metric.
• An implementation of a permutation analysis to judge the
significance of calculated scores.
CONFIDENTIAL
Next steps
• Extend the GSEA implementation of permutation
analysis for all the metrics to verify the significance of the
results.
• Submit these significant results to biologists to see which
metrics make the most sense.
• Final Step: Integrate the best metrics and the
permutation analysis into one application for biologists.
CONFIDENTIAL
Acknowledgments
Special Thanks to:
• Dr. Soheil Shams
• Dr. Bruce Hoff
• Keala Chan
• The staff of BioDiscovery, Inc.
• The professors of SoCalBSI
• The students of SoCalBSI
Funding Provided by:
• National Science Foundation
• National Institutes of Health
CONFIDENTIAL
Works Cited
• Mootha, et al, “PGC-1alpha-responsive genes involved in
oxidative phosphorylation are coordinately downregulated in
human diabetes.” (Nat Genet. 2003 Jul;34(3):267-73).
• Damian D, Gorfine M., “Statistical concerns about the GSEA
procedure” (Nat Genet. 2004 Jul;36(7):663; author reply 663)
• Confidential Documents of BioDiscovery, Inc.
• http://www.biocarta.com
• http://www.geneontology.org
• http://www.affymetrix.com
CONFIDENTIAL