Classification (Supervised Clustering)

Download Report

Transcript Classification (Supervised Clustering)

Classification
(Supervised Clustering)
Naomi Altman
Nov '06
Objective
Starting from a sample from known groups:
1) Select a set of genes that identify the groups
2) Compute a function of the expression values that can be used
to classify a new sample.
e.g. Normal and cancer prostate tissues from 24 patients
1) a) Find the set of genes that may be involved in the disease
process (differential expression analysis).
b) Find a set of genes that mark the disease (possibly not all
the genes involved.)
2) Take a sample from a new patient - does this person have
prostate cancer?
The Main Picture
For Linear Discriminant Analysis
separating hyperplane
linear discrimination direction
To classify a new point, see what side
of the hyperplane it lies on
The Main Picture
For Support Vector Machines
separating hyperplane
To classify a new point, see what side
of the hyperplane it lies on
The Main Picture
For Linear Discriminant Analysis
To classify a new point, see what side
of the hypercurve it lies on
The Main Picture
For Recursive Partitioning
To classify a new point, see the
classification of its partition in space
Linear and Quadratic Discriminant Analysis,
Logistic Regression
Each sample belongs to A or B.
Linear and quadratic discriminant analysis are essentially
regressions on a 0/1 indicator variable.
Suppose we have samples of sizes m from A and n from B.
Sample ts (t=group, s=sample within group) has gene
expression values Y1ts ... YGts
For each group we can compute the mean expression values
for each gene,Ŷt, the variance of each gene, sit2 and the
covariance between genes sijt.
We can also compute the pooled variance and covariance of each
gene, which is essentially the average over the 2 groups.
Linear Discriminant Analysis
1
(YA  YB )' S Y*s
where S is the pooled variance matrix
is the linear discriminant function.
In the simplest case, we classify each sample depending on
whether it is above (A) or below (B) the midpoint of the line, which
is
1
(YA  YB )' S 1 (YA  YB )
2
If the 2 conditions are not equally likely, we may wish to weight
so that we classify new samples proportionally to the expected
percentages.
1
1
1
(YA  YB )' S Y*s  (YA  YB )' S 1 (YA  YB )  ln( 1 /  2 )
2
2
Linear Discriminant Analysis
This is extended to p groups by considering the discriminant
score, which is another SVD decomposition and is similar to
multivariate ANOVA.
1. Consider the covariance matrix of the sample means
weighted by the sample sizes.
2
n
(
Y

Y
)
between variance
 t it  /( p 1)
nt (Yit  Y )(Yjt  Y ) /( p 1)
between covariance
Assemble these into the Between group variance matrix V.
2. Consider the pooled covariance matrix S, (which in this context
is often called W for Within group variance matrix).
Linear Discriminant Analysis
Now consider the SVD of S-1/2BS-1/2. (It is symmetric, so the left
and right eigenvectors are the same.)
The first eigenvector is the direction of greatest separation of the
means, in terms of the axes of the ellipses defining the groups.
The 2nd eigenvector is the direction of 2nd greatest separation
that is orthogonal to the first. etc.
The rank of B is p-1, so there are only p-1 non-zero eigenvalues.
Each sample is assigned to the group with nearest mean in the
eigenvector coordinates.
This is equivalent to looking at the combinations of the pairwise
discriminant functions and mapping every sample to the group
with the nearest mean.
Linear Discriminant Analysis
As in the 2-group case, you can weight the discriminant scores
by the prior probability of group membership
SVD
LDA
LDA Regions
Quadratic Discriminant Analysis
Is very similar to linear discriminant analysis,
except that every group is allowed to have its
own variance matrix, allowing the ellipses to
have a different orientation.
Logistic Regression
Let t be the probability of membership in group t.
Use maximum likelihood to fit
log(t/(1- t)) = b0 + SbiYits
Classify a sample into group t if the predicted
log(t/(1- t)) is the maximum over all groups.
Again, we can weight by prior probability.
Recursive Partitioning
PL< 2.45
PW< 1.75
se(50 0 0)
ve (0 49 5)
vi (0 1 45)
Assessing Accuracy
Count the number of misclassifications of the training sample
(optimistic).
Cross-validation: Do not use a fraction of the data (test data).
"Train" using the remainder of the sample, with the same rule
used for the complete data.
Count the number of misclassifications of the test data.
Repeat.
About 1/3 test data appears to be best.
But...
a) If the number of genes exceeds the number of samples, we
always "overfit" - e.g. with logistic regression we can almost
always achieve perfect classification
b) Rank S=min(row rank, col rank) so S is not invertible (LDA) and
neither are the within treatment variance matrices (QDA)
b) Most of the methods use all of the genes.
i.e. With microarray data, we will need to select a smaller set of
genes to work with.
For medical diagnostics we often want a very small set of
markers.
Reducing the Number of Genes
1.With n samples, use the n-k most significantly differentially
expressing genes.
2. Cluster the genes and take the most significantly differentially
expressing gene in each cluster.
3. Add variables to your discrimination function stepwise.
4. PAM - shrink the group center to the overall center, and then
apply a robust QDA with moderated variance estimates (like
SAM). The method ends up with the within group centroid=total
centroid for most genes. So the differences among groups rely
only on the other genes, which are the only genes used in the
QDA.
Problem: (All methods) Often replicability is lost when studies are
repeated. e.g. we can tell the difference between ALL and AML
in all studies, but different discriminant functions are required,
maybe different genes.