Molecular Classification of Cancer: Class Discovery and
Download
Report
Transcript Molecular Classification of Cancer: Class Discovery and
Molecular Classification of Cancer:
Class Discovery and Class Prediction by Gene
Expression Monitoring
T.R. Golub et al., Science 286, 531 (1999)
Introduction
Why is Identification of Cancer Class (tumor subtype) important?
Cancers of Identical grade can have widely
variable clinical courses (i.e. acute
lymphoblastic leukemia, or Acute myeloid
leukemia).
Tradition Method:
Morphological appearance.
Enzyme-based histochemical analyses.
Immunophenotyping.
Cytogenetic analysis.
Topics of Discussion
Class Prediction (supervised learning).
Class Discovery (unsupervised learning).
Class Prediction
How could one use an initial collection of
samples belonging to know classes to create a
class Predictor?
Identification of Informative Genes via
Neighborhood Analysis.
Weighted Vote
Neighborhood Analysis
Why do we want to start with
informative genes?
To be readily applied in a clinical
setting.
Highly instructive
Neighborhood Analysis
1.
2.
3.
v(g) = (e1, e2, ..., en)
c = (c1, c2, ..., cn)
Compute the correlation between v(g) and c.
1.
Euclidean distance
2.
Pearson correlation coefficient.
3.
P(g,c) = [µ1(g) - µ2(g)]/[ σ1(g) + σ2(g)]
Neighborhood Analysis
Class Predictor via Gene Voting
1.
2.
3.
4.
5.
6.
7.
8.
Parameters (ag, bg) are defined for each
informative gene
ag = P(g,c)
bg = [µ1(g) + µ2(g)]/2
vg = ag(xg - bg)
V1 = ∑ | Vg |; for Vg > 0
V2 = ∑ | Vg |; for Vg < 0
PS = (Vwin - Vlose)/(Vwin + Vlose)
The sample was assigned to the winning class
for PS > threshold.
Class Predictor via Gene Voting
Data
Initial Sample: 38 Bone Marrow Samples (27
ALL, 11 AML) obtained at the time of diagnosis.
Independent Sample: 34 leukemia consisted of
24 bone marrow and 10 peripheral blood
samples (20 ALL and 14 AML).
Neighborhood Analysis
Validation of Gene Voting
Initial Samples: 36 of the 38 samples as either
AML or ALL and two as uncertain. All 36 samples
agrees with clinical diagnosis.
Independent Samples: 29 of 34 samples are
strongly predicted with 100% accuracy.
Validation of Gene Voting
Class Discovery
Can cancer classes be discovered
automatically based on gene
expression?
Cluster tumors by gene expression
Determine whether the putative
classes produced are meaningful.
Cluster tumors
Self-organization Map (SOM)
Mathematical cluster analysis for recognizing
and clasifying feautres in complex,
multidimensional data (similar to K-mean
approach)
Chooses a geometry of “nodes”
Nodes are mapped into K-dimensional
space, initially at random.
Iteratively adjust the nodes.
Adjusting the nodes
Randomly select a data point P.
Move the nodes in the direction of P.
The closest node Np is moved the most.
Other nodes are moved depending on their
distance from Np in the initial geometry.
SOM
Validation of SOM
Prediction based on cluster A1 and A2:
24/25 of the ALL samples from initial dataset
were clustered in group A1
10/13 of the AML samples from initial dataset
were clustered in group A2
Validation of SOM
How could one evaluate the putative cluster if the
“right” answer were not known?
Assumption: class discovery could be tested
by class prediction.
Testing of Assumption:
• Construct Predictors based on clusters A1 and
A2.
• Construct Predictors based on random clusters
Validation of SOM
Predictions using predictors based
on clusters A1 and A2 yields 34
accurate predictions, one error and
three uncertains.
Validation of SOM
Searching for Finder Class
Use SOM to divide the initial samples into four
clusters (denoted B1 to B4)
B1 corresponds to AML, B2 corresponds to Tlineage ALL, B3 and B4 corresponds to B-lineage
ALL.