Class Discovery & Prediction

Download Report

Transcript Class Discovery & Prediction

Molecular
Classification
of Cancer
Class Discovery and Class
Prediction by Gene Expression
Monitoring
Overview
Motivation
 Microarray Background
 Our Test Case
 Class Prediction
 Class Discovery

Motivation
Importance of cancer classification
 Cancer classification has historically
relied on specific biological insights
 We will discuss a systematic and
unbiased approach for recognizing
tumor subtypes

Microarray Background
Microarrays enable simultaneous
measurement of the expression levels
of thousands of genes in a sample
 Microarray:

– Glass slide with a matrix of thousands of
spots printed on to it
– Each spot contains probes which bind to a
specific gene
Microarray Background (cont.)

The process:
– DNA samples are taken from the
test subjects
– Samples are dyed with fluorescent
colors and placed on the Microarray
– Hybridization of DNA and cDNA

The result:
– Spots in the array
are dyed in shades
of red to green
Microarray Background (cont.)

Sample 1
Sample 2
Gene 1
1.04
2.08
Gene 2
3.2
10.5
Gene 3
3.34
1.05
Gene 4
1.85
0.09
Microarray data is translated into an n x p table
(p – number of genes, n – number of samples)
Demonstration
http://www.bio.davidson.edu/courses/genomics/chip/chip.html
Our Test Case
38 bone marrow samples from acute
leukemia patients (27 ALL, 11 AML)
 RNA from the samples was hybridized
to microarrays containing probes for
6817 human genes
 For each gene, an expression level was
obtained

Class Prediction
Initial collection of samples belonging to
known classes
 Goal: create a “class predictor” to
classify new samples

– Look for “informative genes”
– Make a prediction based on these genes
– Test the validity of the predictor
Informative genes

Genes whose expression pattern is
strongly correlated with the class
distinction
strongly
correlated
poorly
correlated
Neighborhood Analysis

Are the observed correlations stronger
than would be expected by chance?
C represents the AML/ALL
class distinction
C* is a random permutation of C.
Represents a random class
distinction
Application to the Test Case

Roughly 1100 genes were more highly
correlated with the AML-ALL class distinction
than would be expected by chance
Make a Prediction
Use a fixed subset of “informative
genes” (most correlated with the class
distinction)
 Make a prediction on the basis of the
expression level of these genes in a
new sample

Prediction Algorithm

Each gene Gi votes, depending on whether
its expression level Xi in the sample is closer
to µ or µ
AML

ALL
The magnitude of the vote is Wi Vi
– Wi reflects how well the gene is correlated with
the class distinction
– Vi  X i   AML   ALL
2
reflects the deviation of Xi from the average of µ
and µ
ALL
AML
Prediction Algorithm (cont.)

The votes for each class are summed to
obtain total votes VAML and VALL
Prediction Algorithm (cont.)

The prediction strength is calculated:
Vwin  Vlose
PS 
Vwin  Vlose

The sample is assigned to the winning
class provided that the PS exceeds a
predetermined threshold
(0.3 in the test case)
Testing the Validity of Class
Predictors

Cross Validation
– withhold a sample
– build a predictor based on the remaining
samples
– predict the class of the withheld sample
– repeat for each sample

Assess accuracy on an independent set
of samples
Application to the Test Case

50 genes most
highly correlated
with the AML-ALL
distinction were
chosen
 A class predictor
based on these
genes was built
Application to the Test Case

Performance in cross validation:
– Out of 38 samples there were 36
predictions and 2 uncertainties (PS < 0.3)
– 100% accuracy
– PS median 0.77
Application to the Test Case
(cont.)

Performance on an independent set of
samples:
– Out of 34 samples there were 29
predictions and 5 uncertainties (PS < 0.3)
– 100% accuracy
– PS median 0.73
Comments

Why 50 genes?
– Large enough to be robust against noise
– Small enough to be readily applied in a clinical
setting
– Predictors based on between 10 to 200 genes all
performed well

Genes useful for cancer class prediction
may also provide insight into cancer
pathogenesis and pharmacology
Comments (cont.)
Creation of a new predictor involves
expression analysis of thousands of
genes
 Application of the predictor then
requires only monitoring the expression
level of few informative genes

Class Discovery

Cluster tumors by gene expression
– Apply a clustering technique to produce
presumed classes

Evaluation of the Classes:
– Are the classes meaningful?
– Do they reflect true structure?
Clustering Technique - SOMs

SOMs – Self Organizing Maps
Well suited for identifying a small
number of prominent classes
– Find an optimal set of “centroids”
– Partition the data set according to the centroids
– Each centroid defines a cluster consisting of the
data points nearest to it

We won't go into details about the
calculation of SOMs
Application of a two-cluster
SOM to the test case
Class A1:
24 ALL, 1 AML
Class A2:
10 AML, 3 AML


Quite effective at automatically discovering the
two types of leukemia
Not perfect
Evaluation of the Classes
How can we evaluate such classes if
the “right” answer is not already known?
 Hypothesis: class discovery can be
tested by class prediction

– If the classes reflect true structure, then a
class predictor based on them should
perform well

Let’s test this hypothesis...
Validity of Predictors Based on
A1 and A2
Predictors based on different numbers
of informative genes performed well
 For example: a 20-gene predictor

Validity of Predictors Based on
A1 and A2 cont.

Performance on
independent
samples:
– PS median 0.61
– Prediction made for
74% of samples
Validity of Predictors Based on
A1 and A2 cont.

Performance in
cross validation:
– 34 accurate
predictions with
high prediction
strength
– One error
– Three uncertains
the one cross
validation error
2 of the 3
cross validation
uncertains
Iterative Procedure
Use a SOM to initially cluster the data
 Construct a predictor
 Remove samples that are not correctly
predicted in cross-validation
 Use the remaining samples to generate
an improved predictor
 Test on an independent data set

Validity of Predictors Based on
Random Clusters

Performance:
– Poor accuracy in
cross validation
– Low PS on
independent
samples
Conclusion

The AML-ALL distinction could have
been automatically discovered and
confirmed without previous biological
knowledge
Application of a 4-cluster SOM
to the Test Case
Evaluation of the Classes

Complement approach:
– Construct class predictors to distinguish
each class from its complement

Pair-wise approach:
– Construct class predictors to distinguish
between each pair of classes Ci,Cj
– Perform cross validation only on samples
in Ci and Cj
Evaluation of the Classes

Class predictors distinguished the
classes from one another, with the
exception of B3 versus B4
Conclusion
The results suggest the merging of
classes B3 and B4
 The distinction corresponding to AML,
B-ALL and T-ALL was confirmed

Uses of Class Discovery
Identify fundamental subtypes of any
cancer
 Search for fundamental mechanisms
that cut across distinct types of cancers

Questions?

Thank you for listening