Essential Bioinformatics and Biocomputing (LSM2104

Download Report

Transcript Essential Bioinformatics and Biocomputing (LSM2104

CZ5225: Modeling and Simulation in Biology
Lecture 6, Microarray Cancer Classification
Prof. Chen Yu Zong
Tel: 6874-6877
Email: [email protected]
http://xin.cz3.nus.edu.sg
Room 07-24, level 7, SOC1,
National University of Singapore
Why Microarray?
• Although there has been some improvements over the
past 30 years still there exists no general way for:
– Identifying new cancer classes
– Assigning tumors to known classes
• In this paper they are introducing two general ways for
– Class prediction of a new tumor
– Class discovery of new unknown subclasses
– Without using the previous biological information
2
Why Microarray?
• Why do we need to classify cancers?
– The general way of treating cancer is to:
• Categorize the cancers in different classes
• Use specific treatment for each of the classes
• Traditional way
– Morphological appearance.
3
Why Microarray?
• Why traditional ways are not enough ?
– There exists some tumors in the same class
with completely different clinical courses
• May be more accurate classification is needed
– Assigning new tumors to known cancer
classes is not easy
• e.g. assigning an acute leukemia tumor to one of
the
– AML
– ALL
4
Cancer Classification
• Class discovery
– Identifying new cancer classes
• Class Prediction
– Assigning tumors to known classes
5
Cancer Genes and Pathways
•
15 cancer-related pathways, 291 cancer genes, 34 angiogenesis genes, 12 tumor
immune tolerance genes Nature Medicine 10, 789-799 (2004); Nature Reviews Cancer 4, 177-183
(2004), 6, 613-625 (2006); Critical Reviews in Oncology/Hematology 59, 40-50 (2006)
http://bidd.nus.edu.sg/group/trmp/trmp.asp
6
Disease outcome prediction with microarray
Most discriminative genes
Patient i:
Patient
Normal
person j:
SVM
Important
genes
Normal
Patient i:
Patient
Normal
person j:
SVM
Signatures
Predictor-genes
Better predictive power
Clues to disease genes, drug targets
Normal
Disease outcome prediction with microarray
Expected features of signatures:
Composition:
• Certain percentages of cancer genes, genes in cancer pathways,
and angiogenesis genes
Stability:
• Similar set of predictor-genes in different patient compositions
measures under the same or similar conditions
How many genes should be in a signature?
Patient i:
Patient
Normal
person j:
SVM
Normal
Class
No of Genes
or Pathways
Cancer genes (oncogenes, tumorsuppressors, stability genes)
219
Cancer pathways
15
Angiogenesis
34
Cancer immune tolerance
15
Class Prediction
• How could one use an initial collection of samples
belonging to known classes to create a class Predictor?
– Gathering samples
– Hybridizing RNA’s to the microarray
– Obtaining quantitative expression level of each gene
– Identification of Informative Genes via Neighborhood
Analysis
– Weighted votes
9
Neighborhood Analysis
• We want to identify the genes whose expression pattern were
strongly correlated with the class distinction to be predicted and
ignoring other genes
– Each gene is presented by an expression vector consisting of its
expression level in each sample.
– Counting no. of genes having various levels of correlation with
ideal gene c.
– Comparing with the correlation of randomly permuted c with it
• The results show an unusually high density of correlated genes!
10
Idealized expression pattern
Neighborhood analysis
11
Class Predictor
• The General approach
– Choosing a set of informative genes based on
their correlation with the class distinction
– Each informative gene casts a weighted vote
for one of the classes
– Summing up the votes to determine the
winning class and the prediction strength
12
Computing Votes
• Each gene Gi votes for AML or ALL depending on :
– If the expression level of the gene in the new tumor is
nearer to the mean of Gi in AML or ALL
• The value of the vote is :
– WiVi where:
• Wi reflects how well Gi is correlated with the class
distinction
• Vi = | xi – (AML mean + ALL mean) / 2 |
• The prediction strength reflects:
– Margin of victory
– (Vwin – Vloose) / (Vwin + Vloose)
13
Class Predictor
14
Evaluation
• DATA
– Initial Sample
• 38 Bone Marrow Samples (27 ALL, 11 AML) obtained at the
time of diagnosis.
– Independent Sample
• 34 leukemia consisted of 24 bone marrow and 10 peripheral
blood samples (20 ALL and 14 AML).
• Validation of Gene Voting
– Initial Samples
• 36 of the 38 samples as either AML or ALL and two as
uncertain. All 36 samples agrees with clinical diagnosis.
– Independent Samples
• 29 of 34 samples are strongly predicted with 100% accuracy.
15
Validation of Gene Voting
16
An early kind of analysis: unsupervised
learning  learning disease sub-types
p53
Rb
17
Sub-type learning: seeking ‘natural’
groupings & hoping that they will be useful…
p53
Rb
18
E.g., for treatment
Respond to treatment Tx1
p53
Do not
Respond to treatment Tx1
Rb
19
The ‘one-solution fits all’ trap
Do not
Respond to treatment Tx2
p53
Rb
Respond to treatment Tx2
20
A more modern view:
supervised learning
A
B
INDUCTIVE
ALGORITHM
C
D
TRAIN
INSTANCES
Classifier
OR
Regression Model
E
APPLICATION
INSTANCES
A1, B1, C1, D1, E1
A2, B2, C2, D2, E2
An, Bn, Cn, Dn, En
CLASSIFICATION
PERFORMANCE
21
Predictive Biomarkers & Supervised Learning
A
B
INDUCTIVE
ALGORITHM
C
D
TRAIN
INSTANCES
Classifier
OR
Regression Model
E
APPLICATION
INSTANCES
A1, B1, C1, D1, E1
A2, B2, C2, D2, E2
An, Bn, Cn, Dn, En
CLASSIFICATION
PERFORMANCE
Predictive
Biomarkers
22
Predictive Biomarkers & Supervised Learning
23
A more modern view 2:
Unsupervised learning as structure learning
A
B
A
INDUCTIVE
ALGORITHM
C
D
TRAIN
INSTANCES
E
B
C E
D
A1, B1, C1, D1, E1
A2, B2, C2, D2, E2
PERFORMANCE
An, Bn, Cn, Dn, En
24
Causative biomarkers
& (structural) unsupervised learning
A
B
A
INDUCTIVE
ALGORITHM
C
D
TRAIN
INSTANCES
E
B
C E
D
Causative
Biomarkers
A1, B1, C1, D1, E1
A2, B2, C2, D2, E2
PERFORMANCE
An, Bn, Cn, Dn, En
25
Supervised learning:
the geometrical interpretation
Cancer patients
p53
+P5
P2
New case,
classified
as cancer
+
+
+P4
?
P1
SV
clas M
sifi
er
+P3
+
New case,
classified
as normal
+ +
?
+
+
Normals
Rb
26
If 2D looks good, what happens in 3D?
• 10,000-50,000 (regular gene expression
microarrays, aCGH, and early SNP arrays)
• 500,000 (tiled microarrays, SNP arrays)
• 10,000-300,000 (regular MS proteomics)
• >10, 000, 000 (LC-MS proteomics)
This is the ‘curse of dimensionality problem’
27
Problems associated with high-dimensionality
(especially with small samples)
•
•
•
•
Some methods do not run at all (classical regression)
Some methods give bad results
Very slow analysis
Very expensive/cumbersome clinical application
28
Solution 1: dimensionality reduction
400
1st principal component (PC1)
PC1: 3X-Y=0
350
Normal subjects
Cancer patients
300
Gene Y
250
200
150
100
50
0
0
10
20
30
40
50
60
70
80
90
100
Gene X
29
Solution 2: feature selection
P
O
A
B
C
D
K
T
H
M
I
E
J
Q
L
N
30
Another (very real and unpleasant) problem
Over-fitting
• Over-fitting ( a model to your data)= building a model than
is good in original data but fails to generalize well to fresh
data
31
Over-fitting is directly related to the complexity of
decision surface (relative to the complexity of
modeling task)
Outcome of
Interest Y
Training Data
Test Data
Predictor X
32
Over-fitting is also caused by multiple
validations & small samples
General Population:
AUC of Model_1= 65%; AUC of Model_2= 85%; AUC of Model_3= 55%
Sample used for training & validation
Modeling Sample MS_1
1
TrainModel_1
TrainModel_2
TrainModel_3
Validate 
AUC of Model_1 = 88%
Validate 
AUC of Model_2 = 76%
Validate 
AUC of Model_3 = 63%
Sample not used for training & validation
Modeling Sample MS_n
2
TrainModel_1
TrainModel_2
TrainModel_3
Validate 
AUC of Model_1 = 61%
Validate 
AUC of Model_2 = 87%
Validate 
AUC of Model_3 = 67%
Training & Validation Phase
A sample in which over-fitting is detected
Independent
Evaluation Sample ES_1
3
Evaluate 
AUC of Model_1 = 65%
A sample in which over-fitting is not detected
Independent
Evaluation Sample ES_2
4
Evaluate 
AUC of Model_1 =84%
Validation With Independent Dataset
33
Over-fitting is also caused by multiple
validations & small samples
General Population:
AUC of Model_1= 65%; AUC of Model_2= 85%; AUC of Model_3= 55%
Sample not used for training & validation
Modeling Sample MS_1
1
TrainModel_1
TrainModel_2
TrainModel_3
Validate 
AUC of Model_1 = 88%
Validate 
AUC of Model_2 = 76%
Validate 
AUC of Model_3 = 63%
Sample used for training & validation
Modeling Sample MS_n
2
TrainModel_1
TrainModel_2
TrainModel_3
Validate 
AUC of Model_1 = 61%
Validate 
AUC of Model_2 = 87%
Validate 
AUC of Model_3 = 67%
Training & Validation Phase
A sample falsely detecting over-fitting
Independent
Evaluation Sample ES_1
3
Evaluate 
AUC of Model_2 = 74%
A sample not detecting over-fitting
Independent
Evaluation Sample ES_2
4
Evaluate 
AUC of Model_2 =90%
Validation With Independent Dataset
34
A method to produce realistic performance
estimates: nested n-fold cross-validation
Dataset
predictor variables
outcome variable
P1
P2
P3
Outer loop: Cross-validation for performance estimation
…
…
Training Testing
Average
C Accuracy
set
set
Accuracy
P1, P2
P3
1
89%
83%
P1,P3
P2
2
84%
P2, P3
P1
1
76%
Inner Loop: Cross-validation for model selection
Training Validation
set
set
P1
P2
P2
P1
P1
P2
P2
P1
C
1
2
Accuracy
86%
84%
70%
90%
Average
Accuracy
85%
Choose C=1 since it
maximizes accuracy
80%
35
How well supervised learning works in
practice?
36
Datasets
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Bhattacharjee2
Bhattacharjee2_I
Bhattacharjee3
Bhattacharjee3_I
Savage
Rosenwald4
Rosenwald5
Rosenwald6
Adam
Yeoh
Conrads
Beer_I
Su_I
Banez
-
Lung cancer vs normals [GE/DX]
Lung cancer vs normals on common genes between Bhattacharjee2 and Beer [GE/DX]
Adenocarcinoma vs Squamous [GE/DX]
Adenocarcinoma vs Squamous on common genes between Bhattacharjee3 and Su [GE/DX]
Mediastinal large B-cell lymphoma vs diffuse large B-cell lymphoma [GE/DX]
3-year lymphoma survival [GE/CO]
5-year lymphoma survival [GE/CO]
7-year lymphoma survival [GE/CO]
Prostate cancer vs benign prostate hyperplasia and normals [MS/DX]
Classification between 6 types of leukemia [GE/DX-MC]
Ovarian cancer vs normals [MS/DX]
Lung cancer vs normals (common genes with Bhattacharjee2) [GE/DX]
Adenocarcinoma vs squamous (common genes with Bhattacharjee3) [GE/DX
Prostate cancer vs normals [MS/DX]
37
Methods: Gene Selection Algorithms
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
ALL
- No feature selection
LARS
- LARS
HITON_PC HITON_PC_W -HITON_PC+ wrapping phase
HITON_MB HITON_MB_W -HITON_MB + wrapping phase
GA_KNN
- GA/KNN
RFE
- RFE with validation of feature subset with optimized polynomial kernel
RFE_Guyon - RFE with validation of feature subset with linear kernel (as in Guyon)
RFE_POLY - RFE (with polynomial kernel) with validation of feature subset with polynomial optimized kernel
RFE_POLY_Guyon - RFE (with polynomial kernel) with validation of feature subset with linear kernel (as in Guyon)
SIMCA
- SIMCA (Soft Independent Modeling of Class Analogy): PCA based method
SIMCA_SVM - SIMCA (Soft Independent Modeling of Class Analogy): PCA based method with validation of feature subset by SVM
WFCCM_CCR - Weighted Flexible Compound Covariate Method (WFCCM) applied as in Clinical
Cancer Research paper by Yamagata (analysis of microarray data)
WFCCM_Lancet - Weighted Flexible Compound Covariate Method (WFCCM) applied as in Lancet
paper by Yanagisawa (analysis of mass-spectrometry data)
UAF_KW
- Univariate with Kruskal-Walis statistic
UAF_BW
- Univariate with ratio of genes between groups to within group sum of squares
UAF_S2N
- Univariate with signal-to-noise statistic
38
Classification Performance
(average over all tasks/datasets)
39
How well dimensionality reduction and
feature selection work in practice?
40
Number of Selected Features
(average over all tasks/datasets)
10000.00
9000.00
8000.00
UAF_S2N
UAF_BW
UAF_KW
WFCCM_CCR
SIMCA_SVM
SIMCA
RFE_POLY_Guyon
RFE_POLY
RFE_Guyon
RFE
GA_KNN
HITONgp_MB_W
HITONgp_PC_W
HITONgp_MB
LARS
ALL
2000.00
1000.00
0.00
HITONgp_PC
7000.00
6000.00
5000.00
4000.00
3000.00
41
Number of Selected Features
(zoom on most powerful methods)
100.00
80.00
60.00
40.00
20.00
_G
uy
R
on
RF
FE
E_
_
PO PO
LY
LY
_G
uy
on
RF
E
RF
E
HI
T
LA
RS
O
Ng
p
HI
TO _PC
Ng
HI
p_
TO
M
Ng
B
p_
HI
PC
TO
_W
Ng
p_
M
B_
W
G
A_
KN
N
0.00
42
Number of Selected Features
(average over all tasks/datasets)
43