Transcript Document
Part 5: Linking Microarray Data with Survival Analysis
Use of microarray data via model-based classification
in the study and prediction of survival from lung cancer
(Ben-Tovim Jones et al., 2005)
Problems
•Censored Observations – the time of occurrence of the event
(death) has not yet been observed.
•Small Sample Sizes – study limited by patient numbers
•Specific Patient Group – is the study applicable to other
populations?
•Difficulty in integrating different studies (different
microarray platforms)
A Case Study: The Lung Cancer data sets from
CAMDA’03
Four independently acquired lung cancer data sets
(Harvard, Michigan, Stanford and Ontario).
The challenge: To integrate information from different
data sets (2 Affy chips of different versions, 2 cDNA arrays).
The final goal: To make an impact on cancer biology and
eventually patient care.
“Especially, we welcome the methodology of survival analysis
using microarrays for cancer prognosis (Park et al.
Bioinformatics: S120, 2002).”
Methodology of Survival Analysis using Microarrays
Cluster the tissue samples (eg using hierarchical clustering), then
compare the survival curves for each cluster using a non-parametric
Kaplan-Meier analysis (Alizadeh et al. 2000).
Park et al. (2002), Nguyen and Rocke (2002) used partial least
squares with the proportional hazards model of Cox.
Unsupervised vs. Supervised Methods
Semi-supervised approach of Bair and Tibshirani (2004), to combine
gene expression data with the clinical data.
AIM: To link gene-expression data with survival from lung cancer
in the CAMDA’03 challenge
A CLUSTER ANALYSIS
We apply a model-based clustering approach to classify tumour
tissues on the basis of microarray gene expression.
B SURVIVAL ANALYSIS
The association between the clusters so formed and patient
survival (recurrence) times is established.
C DISCRIMINANT ANALYSIS
We demonstrate the potential of the clustering-based prognosis
as a predictor of the outcome of disease.
Lung Cancer
Approx. 80% of lung cancer patients have NSCLC (of which
adenocarcinoma is the most common form).
All Patients diagnosed with NSCLC are treated on the basis of
stage at presentation (tumour size, lymph node involvement and
presence of metastases).
Yet 30% of patients with resected stage I lung cancer will die of
metastatic cancer within 5 years of surgery.
Want a prognostic test for early-stage lung adenocarcinoma to
identify patients more likely to recur, and therefore who would
benefit from adjuvant therapy.
Lung Cancer Data Sets
(see http://www.camda.duke.edu/camda03)
Wigle et al. (2002), Garber et al. (2001), Bhattacharjee et al. (2001),
Beer et al. (2002).
Genes
Heat Map for 2880 Ontario Genes (39 Tissues)
Tissues
Genes
Heat Maps for the 20 Ontario Gene-Groups (39 Tissues)
Tissues
Tissues are ordered as:
Recurrence (1-24) and Censored (25-39)
Expression Profiles for Useful Metagenes (Ontario 39 Tissues)
Gene Group 1
Gene Group 2
Log Expression Value
Our Tissue Cluster 1
Our Tissue Cluster 2
Recurrence (1-24)
Censored (25-39)
Gene Group 19
Gene Group 20
Tissues
Tissue Clusters
CLUSTER ANALYSIS via EMMIX-GENE of 20
METAGENES yields TWO CLUSTERS:
CLUSTER 1 (31): 23 (recurrence) plus Poor-prognosis
8 (censored)
CLUSTER 2 (8): 1 (recurrence) plus
7 (censored)
Good-prognosis
SURVIVAL ANALYSIS:
LONG-TERM SURVIVOR (LTS) MODEL
S (t ) prob.{T t}
p 1S1 (t ) p 2
where T is time to recurrence and p1 = 1- p2 is the
prior prob. of recurrence.
Adopt Weibull model for the survival function for
recurrence S1(t).
Fitted LTS Model vs. Kaplan-Meier
Second PC
PCA of Tissues Based on Metagenes
First PC
Second PC
PCA of Tissues Based on Metagenes
First PC
Second PC
PCA of Tissues Based on All Genes (via SVD)
First PC
Second PC
PCA of Tissues Based on All Genes (via SVD)
First PC
Cluster-Specific Kaplan-Meier Plots
Survival Analysis for Ontario Dataset
• Nonparametric analysis:
Cluster
1
2
No. of Tissues No. of Censored
29
8
Mean time to Failure (SE)
665 85.9
1388 155.7
8
7
A significant difference between Kaplan-Meier estimates for
the two clusters (P=0.027).
• Cox’s proportional hazards analysis:
Variable
Cluster 1 vs. Cluster 2
Tumor stage (I vs. II&III)
Hazard ratio (95% CI)
P-value
6.78 (0.9 – 51.5)
1.07 (0.57 – 2.0)
0.06
0.83
Discriminant Analysis (Supervised Classification)
A prognosis classifier was developed to predict the class
of origin of a tumor tissue with a small error rate after
correction for the selection bias.
A support vector machine (SVM) was adopted to identify
important genes that play a key role on predicting the
clinical outcome, using all the genes, and the metagenes.
A cross-validation (CV) procedure was used to calculate
the prediction error, after correction for the selection bias.
ONTARIO DATA (39 tissues): Support Vector Machine
(SVM) with Recursive Feature Elimination (RFE)
0.12
Error Rate (CV10E)
0.1
0.08
0.06
0.04
0.02
0
0
2
4
6
8
10
12
log2 (number of genes)
Ten-fold Cross-Validation Error Rate (CV10E) of Support Vector
Machine (SVM). applied to g=2 clusters (G1: 1-14, 16- 29,33,36,38;
G2: 15,30-32,34,35,37,39)
STANFORD DATA
918 genes based on 73 tissue samples from 67 patients.
Row and column normalized, retained 451 genes after
select-genes step. Used 20 metagenes to cluster tissues.
Retrieved histological groups.
Genes
Heat Maps for the 20 Stanford Gene-Groups (73 Tissues)
Tissues
Tissues are ordered by their histological classification:
Adenocarcinoma (1-41), Fetal Lung (42), Large cell (43-47), Normal
(48-52), Squamous cell (53-68), Small cell (69-73)
STANFORD CLASSIFICATION:
Cluster 1: 1-19
(good prognosis)
Cluster 2: 20-26
(long-term survivors)
Cluster 3: 27-35
(poor prognosis)
Genes
Heat Maps for the 15 Stanford Gene-Groups (35 Tissues)
Tissues
Tissues are ordered by the Stanford classification into AC groups: AC
group 1 (1-19), AC group 2 (20-26), AC group 3 (27-35)
Expression Profiles for Top Metagenes (Stanford 35 AC Tissues)
Log Expression Value
Gene Group 1
Gene Group 2
Stanford AC group 1
Stanford AC group 2
Stanford AC group 3
Misallocated
Gene Group 3
Gene Group 4
Tissues
Cluster-Specific Kaplan-Meier Plots
Cluster-Specific Kaplan-Meier Plots
Survival Analysis for Stanford Dataset
• Kaplan-Meier estimation:
Cluster
1
2
No. of Tissues No. of Censored
17
5
Mean time to Failure (SE)
37.5 5.0
5.2 2.3
10
0
A significant difference in survival between clusters (P<0.001)
• Cox’s proportional hazards analysis:
Variable
Cluster 3 vs. Clusters 1&2
Grade 3 vs. grades 1 or 2
Tumor size
No. of tumors in lymph nodes
Presence of metastases
Hazard ratio (95% CI)
P-value
13.2 (2.1 – 81.1)
1.94 (0.5 – 8.5)
0.96 (0.3 – 2.8)
1.65 (0.7 – 3.9)
4.41 (1.0 – 19.8)
0.005
0.38
0.93
0.25
0.05
Survival Analysis for Stanford Dataset
• Univariate Cox’s proportional hazards analysis (metagenes):
Metagene
Coefficient (SE)
P-value
1
2
3
4
5
1.37 (0.44)
-0.24 (0.31)
0.14 (0.34)
-1.01 (0.56)
0.66 (0.65)
0.002
0.44
0.68
0.07
0.31
6
7
8
9
10
-0.63 (0.50)
-0.68 (0.57)
0.75 (0.46)
-1.13 (0.50)
0.73 (0.39)
0.20
0.24
0.10
0.02
0.06
11
12
13
14
15
0.35 (0.50)
-0.55 (0.41)
-0.61 (0.48)
0.22 (0.36)
1.70 (0.92)
0.48
0.18
0.20
0.53
0.06
Survival Analysis for Stanford Dataset
• Multivariate Cox’s proportional hazards analysis (metagenes):
Metagene
Coefficient (SE)
P-value
1
3.44 (0.95)
0.0003
2
-1.60 (0.62)
0.010
8
-1.55 (0.73)
0.033
11
1.16 (0.54)
0.031
The final model consists of four metagenes.
STANFORD DATA: Support Vector Machine
(SVM) with Recursive Feature Elimination (RFE)
0.07
Error Rate (CV10E)
0.06
0.05
0.04
0.03
0.02
0.01
0
0
1
2
3
4
5
6
7
8
9
10
log2 (number of genes)
Ten-fold Cross-Validation Error Rate (CV10E) of Support Vector
Machine (SVM). Applied to g=2 clusters.
CONCLUSIONS
We applied a model-based clustering approach to
classify tumors using their gene signatures into:
(a) clusters corresponding to tumor type
(b) clusters corresponding to clinical outcomes
for tumors of a given subtype
In (a), almost perfect correspondence between
cluster and tumor type, at least for non-AC
tumors (but not in the Ontario dataset).
CONCLUSIONS (cont.)
The clusters in (b) were identified with clinical
outcomes (e.g. recurrence/recurrence-free and
death/long-term survival).
We were able to show that gene-expression
data provide prognostic information, beyond
that of clinical indicators such as stage.
CONCLUSIONS (cont.)
Based on the tissue clusters, a discriminant analysis
using support vector machines (SVM) demonstrated
further the potential of gene expression as a tool for
guiding treatment therapy and patient care to lung
cancer patients.
This supervised classification procedure was used to
provide marker genes for prediction of clinical
outcomes.
(In addition to those provided by the cluster-genes
step in the initial unsupervised classification.)
LIMITATIONS
Small number of tumors available (e.g Ontario and
Stanford datasets).
Clinical data available for only subsets of the tumors;
often for only one tumor type (AC).
High proportion of censored observations limits
comparison of survival rates.