Recursive partitioning for tumor classification with gene

Download Report

Transcript Recursive partitioning for tumor classification with gene

Recursive Partitioning for
Tumor Classification with Gene
Expression Microarray Data
Heping Zhang, Chang-Yung Yu,
Burton Singer, Momian Xiong
Presented by Weihua Huang
Data used in the article
Expression profiles of 2,000 genes using an Affymetrix
oligonucleotide array in 22 normal and 40 colon cancer
tissues
The response is binary indicating normal or cancer
tissue and the predictor variables are the 2000 genes
Classification Tree Using Recursive Partitioning
Goal:
To partition the feature space into disjoint regions by growing a
tree so that the group in the same region are homogeneous in
terms of response.
Algorithm:
Start with a root node containing the study sample and split it
into smaller and smaller nodes according to whether a particular
selected predictor is above a chosen cutoff value. At each
splitting step, the selected predictor and its corresponding level
are chosen to maximize the reduction in node impurity
ΔI= P(A)I(A) –P(AL)I(AL) –P(AR)I(AR)
Classification Tree using Recursive Partitioning
Node impurity:
One example of node impurity is measured by entropy
function:
- P log(P) - (1-P) log(1-P),
where P is the probability of a tissue being normal within the
node
• Minimum impurity ( =0 )
When all tissues are of the same type within the node ( P = 0 or 1)
• Maximum impurity ( = log2)
When half normal tissues and half cancer tissues are within the
node (P=0.5)
Results From Classification Tree on the Data
Fig 1. Classification tree for tissue types by using expression data from three
genes ( M26383, R15447, M28214)
Another Way to Visualize the Recursive Partitioning
Fig 3. A scatterplot of expression data from R15447 and M28214 for a
subset of tissues (node 3 in Fig. 1).
Results from Recursive partitioning
Quality of the tree-based classification:
Using localized 5-fold cross validation error rate:
•
•
•
The same genes to the same nodes
Randomly divide the 40 cancer tissues into 5 subsamples
of 8, and the 22 normal tissues into 5 subsamples of
4,4,4,5, and 5; four subsamples each from the cancer and
normal tissues were used to choose the cutoff values for
the three splits. The remaining samples were used to
count the misclassified tissues as a result of new cutoff
values.
The error rate is between 6-8% from two runs of cross
validation, which is much better than that obtained by
existing analysis.
Correlation Analysis on Genes
Functional expressions from various genes are
correlated.
Examine the correlation patterns of the three
selected genes in Fig. 1.
Correlation Between the Three Selected Genes and the
Remaining Expression Data
Another Tree Based on a Different Set of Three Genes
Fig. 6. Classification tree for tissue types using expression data from three
genes (R87126, T62947, X15183)
Correlation Matrix Among Genes in Fig.1 and Fig. 6
Advantages of the Classification Tree
1. Efficient with large number of genes
2. Automatically selects valuable and user-friendly
genes as predictors
3. More precise than some other classification
methods such as support vector machine and linear
discriminant analysis
Conclusions:
1. It is likely that the information contained in a
large number of genes can be captured by a
small optimal set of genes without significant
loss of information.
2. The precision of classification of recursive
partitioning is important for clinical application.