Presentation (PowerPoint File)
Download
Report
Transcript Presentation (PowerPoint File)
Class Prediction Based on
Gene Expression Data
Issues in the Design and Analysis of Microarray Experiments
Michael D. Radmacher, Ph.D.
Biometric Research Branch
National Cancer Institute
One Potential of Gene Expression Data
• Specimens will be distinguishable by their gene
expression profiles
• NCI Director’s Challenge: Toward a Molecular
Classification of Tumors
– “This challenge is intended to lay the groundwork for changing
the basis of tumor classification from morphological to
molecular characteristics.”
– Purpose is “...to define comprehensive profiles of molecular
alterations in tumors that can be used to identify subsets of
patients.”
• So one important goal is: Classification
What is meant by “Classification”?
Two important and distinct answers:
Class Discovery
• Identification of previously
unknown classes of specimens
• Use of “unsupervised” methods
–
–
–
–
Hierarchical Clustering
k-means Clustering
SOMs
Others
• Prevalent method used in
literature for analysis of gene
expression data.
Class Prediction
• Assignment of specimens into
known classes
• Use of “supervised” methods
–
–
–
–
Logistic Regression
CART
Discriminant Analysis
Others
• Class prediction is more
powerful than class discovery
for distinguishing specimens
based on a priori defined
classes.
Example of Class Discovery:
Distinct Types of Diffuse Large B-Cell Lymphoma
• DLBCL is clinically heterogeneous
• Specimens were clustered based on their
expression profiles of GC B-cell associated
genes.
• Two subgroups were discovered:
– GC B-like DLBCL
– Activated B-like DLBCL
(Figures and information taken from Alizadeh et al., Nature 403:503-11, 2000)
What is meant by “classification”?
Two important and distinct answers:
Class Discovery
• Identification of previously
unknown classes of specimens
• Use of “unsupervised” methods
–
–
–
–
Hierarchical Clustering
k-means Clustering
SOMs
Others
• Prevalent method used in
literature for analysis of gene
expression data.
Class Prediction
• Assignment of specimens into
known classes
• Use of “supervised” methods
–
–
–
–
Logistic Regression
CART
Discriminant Analysis
Others
• Class prediction is more
powerful than class discovery
for distinguishing specimens
based on a priori defined
classes.
Study of Gene Expression in Breast Tumors
(NHGRI, J. Trent)
cDNA Microarrays
Parallel Gene Expression Analysis
6526 genes /tumor
• How similar are the gene
expression profiles of
BRCA1 and BRCA2 (+) and
sporadic breast cancer
patient biopsies?
• Can we identify a set of
genes that distinguish the
different tumor types?
• Tumors studied:
– 7 BRCA1 +
– 8 BRCA2 +
– 7 Sporadic
BRCA2-
BRCA2-
BRCA2+
BRCA2+
BRCA2-
BRCA2BRCA2-
BRCA2-
BRCA2-
BRCA2-
BRCA2-
BRCA2+
BRCA2+
BRCA2-
BRCA2-
BRCA2-
BRCA1-
BRCA1-
BRCA2-
BRCA1-
BRCA1 Clustering
BRCA2+
BRCA2-
BRCA2+
BRCA2+
BRCA2+
BRCA1+
BRCA1+
BRCA1-
BRCA1+
BRCA1+
BRCA1-
BRCA1-
BRCA1+
BRCA1BRCA1-
BRCA1-
BRCA1-
BRCA1-
BRCA1-
BRCA1-
BRCA1-
BRCA1-
BRCA1+
BRCA1+
BRCA1 +/- and BRCA2 +/- Classification:
Results from Hierarchical Clustering
BRCA2 Clustering
Class Prediction Paradigm
1 Begin with a data set that can be separated into known groups.
2 Choose a method of class prediction.
3 Perform class prediction on the data set using “leave-one-out”
cross-validation.
•
•
•
•
Leave one specimen out of data set.
Build the class predictor using remaining data.
Predict class of the left out specimen.
Repeat so that a prediction is made for every specimen.
4 Use a permutation test to determine if there is a significant
difference in expression patterns between the groups.
•
•
•
•
Permute class labels among specimens.
Perform class prediction on the permuted data.
Repeat many times.
Report the % of permuted sets with an error rate equivalent to or less
than that for the actual data set.
The Compound Covariate Predictor (CCP)
• We consider only genes that are differentially expressed between
the two groups (using a two-sample t-test with small a).
• The CCP
– Motivated by J. Tukey, Controlled Clinical Trials, 1993
– Simple approach that may serve better than complex multivariate
analysis
– A compound covariate is built from the basic covariates (log-ratios)
CCPi t j xij
j
tj is the two-sample t-statistic for gene j.
xij is the log-ratio measure of sample i for gene j.
Sum is over all differentially expressed genes.
• Threshold of classification: midpoint of the CCP means for the two
classes.
BRCA1 +/- and BRCA2 +/- Classification:
Results from Class Prediction with CCP
Classification
BRCA1+
vs.
BRCA1BRCA2+
vs.
BRCA2-
a
0.0001
0.0001
Number of Number of
% of
significant misclassified
permutations
genes
tumors (m) with m or fewer
misclassifications
9
1
0 BRCA1+
1 BRCA1-
0.3
11
4
3 BRCA2+
1 BRCA2-
4.0
Sample Size Considerations for
Accurate Class Prediction
BRCA2 +/- Classification
Misclassification error rate
of original data set
Misclassification error rate
of original data set
BRCA1 +/- Classification
Mean
th
95 percentile
0.2
0.1
0.0
Mean
th
95 percentile
0.2
0.1
0.0
40
80
120
160
200
Size of training set (bootstrap data)
40
80
120
160
200
Size of training set (bootstrap data)
Summary
• Class discovery and prediction methods have distinct goals.
• When class information is known, class prediction is a more
powerful method for detecting differences.
• BRCA1 and BRCA2 mutation positive tumors have
distinguishable gene expression patterns.
– BRCA1 distinction is stronger than BRCA2.
– Some biological insight concerning misclassified specimens.
– Not at level of clinical classification yet.
• Sample size issues
Collaborators
NCI
Richard Simon
NHGRI
Mike Bittner
Yidong Chen
David Duggan
Ingrid Hedenfalk
Jeff Trent