Class Discovery & Prediction
Download
Report
Transcript Class Discovery & Prediction
Molecular
Classification
of Cancer
Class Discovery and Class
Prediction by Gene Expression
Monitoring
Overview
Motivation
Microarray Background
Our Test Case
Class Prediction
Class Discovery
Motivation
Importance of cancer classification
Cancer classification has historically
relied on specific biological insights
We will discuss a systematic and
unbiased approach for recognizing
tumor subtypes
Microarray Background
Microarrays enable simultaneous
measurement of the expression levels
of thousands of genes in a sample
Microarray:
– Glass slide with a matrix of thousands of
spots printed on to it
– Each spot contains probes which bind to a
specific gene
Microarray Background (cont.)
The process:
– DNA samples are taken from the
test subjects
– Samples are dyed with fluorescent
colors and placed on the Microarray
– Hybridization of DNA and cDNA
The result:
– Spots in the array
are dyed in shades
of red to green
Microarray Background (cont.)
Sample 1
Sample 2
Gene 1
1.04
2.08
Gene 2
3.2
10.5
Gene 3
3.34
1.05
Gene 4
1.85
0.09
Microarray data is translated into an n x p table
(p – number of genes, n – number of samples)
Demonstration
http://www.bio.davidson.edu/courses/genomics/chip/chip.html
Our Test Case
38 bone marrow samples from acute
leukemia patients (27 ALL, 11 AML)
RNA from the samples was hybridized
to microarrays containing probes for
6817 human genes
For each gene, an expression level was
obtained
Class Prediction
Initial collection of samples belonging to
known classes
Goal: create a “class predictor” to
classify new samples
– Look for “informative genes”
– Make a prediction based on these genes
– Test the validity of the predictor
Informative genes
Genes whose expression pattern is
strongly correlated with the class
distinction
strongly
correlated
poorly
correlated
Neighborhood Analysis
Are the observed correlations stronger
than would be expected by chance?
C represents the AML/ALL
class distinction
C* is a random permutation of C.
Represents a random class
distinction
Application to the Test Case
Roughly 1100 genes were more highly
correlated with the AML-ALL class distinction
than would be expected by chance
Make a Prediction
Use a fixed subset of “informative
genes” (most correlated with the class
distinction)
Make a prediction on the basis of the
expression level of these genes in a
new sample
Prediction Algorithm
Each gene Gi votes, depending on whether
its expression level Xi in the sample is closer
to µ or µ
AML
ALL
The magnitude of the vote is Wi Vi
– Wi reflects how well the gene is correlated with
the class distinction
– Vi X i AML ALL
2
reflects the deviation of Xi from the average of µ
and µ
ALL
AML
Prediction Algorithm (cont.)
The votes for each class are summed to
obtain total votes VAML and VALL
Prediction Algorithm (cont.)
The prediction strength is calculated:
Vwin Vlose
PS
Vwin Vlose
The sample is assigned to the winning
class provided that the PS exceeds a
predetermined threshold
(0.3 in the test case)
Testing the Validity of Class
Predictors
Cross Validation
– withhold a sample
– build a predictor based on the remaining
samples
– predict the class of the withheld sample
– repeat for each sample
Assess accuracy on an independent set
of samples
Application to the Test Case
50 genes most
highly correlated
with the AML-ALL
distinction were
chosen
A class predictor
based on these
genes was built
Application to the Test Case
Performance in cross validation:
– Out of 38 samples there were 36
predictions and 2 uncertainties (PS < 0.3)
– 100% accuracy
– PS median 0.77
Application to the Test Case
(cont.)
Performance on an independent set of
samples:
– Out of 34 samples there were 29
predictions and 5 uncertainties (PS < 0.3)
– 100% accuracy
– PS median 0.73
Comments
Why 50 genes?
– Large enough to be robust against noise
– Small enough to be readily applied in a clinical
setting
– Predictors based on between 10 to 200 genes all
performed well
Genes useful for cancer class prediction
may also provide insight into cancer
pathogenesis and pharmacology
Comments (cont.)
Creation of a new predictor involves
expression analysis of thousands of
genes
Application of the predictor then
requires only monitoring the expression
level of few informative genes
Class Discovery
Cluster tumors by gene expression
– Apply a clustering technique to produce
presumed classes
Evaluation of the Classes:
– Are the classes meaningful?
– Do they reflect true structure?
Clustering Technique - SOMs
SOMs – Self Organizing Maps
Well suited for identifying a small
number of prominent classes
– Find an optimal set of “centroids”
– Partition the data set according to the centroids
– Each centroid defines a cluster consisting of the
data points nearest to it
We won't go into details about the
calculation of SOMs
Application of a two-cluster
SOM to the test case
Class A1:
24 ALL, 1 AML
Class A2:
10 AML, 3 AML
Quite effective at automatically discovering the
two types of leukemia
Not perfect
Evaluation of the Classes
How can we evaluate such classes if
the “right” answer is not already known?
Hypothesis: class discovery can be
tested by class prediction
– If the classes reflect true structure, then a
class predictor based on them should
perform well
Let’s test this hypothesis...
Validity of Predictors Based on
A1 and A2
Predictors based on different numbers
of informative genes performed well
For example: a 20-gene predictor
Validity of Predictors Based on
A1 and A2 cont.
Performance on
independent
samples:
– PS median 0.61
– Prediction made for
74% of samples
Validity of Predictors Based on
A1 and A2 cont.
Performance in
cross validation:
– 34 accurate
predictions with
high prediction
strength
– One error
– Three uncertains
the one cross
validation error
2 of the 3
cross validation
uncertains
Iterative Procedure
Use a SOM to initially cluster the data
Construct a predictor
Remove samples that are not correctly
predicted in cross-validation
Use the remaining samples to generate
an improved predictor
Test on an independent data set
Validity of Predictors Based on
Random Clusters
Performance:
– Poor accuracy in
cross validation
– Low PS on
independent
samples
Conclusion
The AML-ALL distinction could have
been automatically discovered and
confirmed without previous biological
knowledge
Application of a 4-cluster SOM
to the Test Case
Evaluation of the Classes
Complement approach:
– Construct class predictors to distinguish
each class from its complement
Pair-wise approach:
– Construct class predictors to distinguish
between each pair of classes Ci,Cj
– Perform cross validation only on samples
in Ci and Cj
Evaluation of the Classes
Class predictors distinguished the
classes from one another, with the
exception of B3 versus B4
Conclusion
The results suggest the merging of
classes B3 and B4
The distinction corresponding to AML,
B-ALL and T-ALL was confirmed
Uses of Class Discovery
Identify fundamental subtypes of any
cancer
Search for fundamental mechanisms
that cut across distinct types of cancers
Questions?
Thank you for listening