Transcript 1017
Presenter: Yanlin Wu
Advisor: Professor Geman
Date: 10/17/2006
Is cross-validation valid for
small-sample classification?
Ulisses M. Barga-Neto and
Edward R. Dougherty
Background
What is the classification problem?
How to evaluate the accuracy of one
classifier? Namely measure the error of
classification model?
Different error measuring methods
Things need to pay attention to…
Classification problem
In statistical patter recognition, feature vector X R d
and a label Y R ,which takes on numerical values
representing the different classes; for two-class
problem, Y={0,1}
d
Classifier is a function: g : R {0,1}
Error rate of g is: [ g ] P[ g ( X ) Y ] E( Y g ( X ) )
Bayes classifier: g BAY ( x) 1 if P(Y 1 | X x) 1 / 2
and g BAY ( x) 0 otherwise.
For any g, [ g BAY ] [ g ] , so that g BAY is the optimal
classifier.
Training data
Feature-label distribution F is unknown – using a
training data Sn {( X1 , Y1 ),, ( X n , Yn )} to design a classifier
Classification rule is a mapping g : {R d {0,1}}n R d {0,1}
A classification rule maps the training data Sn into
the designed classifier g (Sn ,)
The true error of a designed classifier is its error rate
given a fixed training dataset:
n [ g (Sn ,)] EF ( Y g (Sn , X ) )
where EF indicates expectation with respect to F
Training data (continued)
The expected error rate over the data is given by
E[ n ] EFn EF ( Y g (Sn , X ) )
where Fn is the joint distribution of the data Sn.
It is also called unconditional error of the
classification rule.
Question: How can we state a
measure of the true error of a
model, since we don’t have
access to the universe of
observations to test it on,
namely we don’t know F ??
Answer: Error estimation
methods have been developed.
Error estimation techniques
For all methods, the final model M is built
base on all n observations, and then we
use these n observations again to
estimate the error of the model.
Types:
Re-substitution
Holdout Method
Cross-validation
Bootstrap
Re-substitution
Re-use the same training sample to
measure error
1 n
ˆresub yi g ( S n , xi )
n i 1
This error tends to be biased low and can
be made to be arbitrarily close to zero by
overfitting of the model and reusing same
data to measure error.
Holdout Method
For large samples, randomly choose a
subset S n S n for test data, design the
classifier on S n \ S n , and estimate its error
by applying it to S n .
Unbiased estimator of E[ ] , with respect
to expectation over S n
t
t
t
n nt
Holdout Method Comments
This error can be slightly biased high due to not
using all n observations to build the classifier. This
bias will tend to decrease as n increases.
The choice of what percentage of the n
observation are going to S n is important. Also it is
affected by n
Holdout method can be run multiple times, with the
accuracy estimates from all the runs averaged.
Impractical with small samples
t
Cross-Validation
Algorithm:
Split data into k mutually exclusive subsets S (i ) , then
build the model on k-1 of them and measuring error on
the other.
Each subset will act as testing set once
The error is the average of these k error measures.
ˆcvk
1 k n / k (i )
y j g (Sn \ S(i ) , x (ji ) )
n i 1 j 1
When k=n, it is called “leave-one-out crossvalidation”
Cross-Validation (continued)
Stratified Cross-Validation: the classes are
represented in each fold in the same
proportion as in the original data – there is
evidence that this improves the estimator
K-fold cross-validation estimator is
unbiased as an estimator of E[ nn / k ]
Leave-one-out estimator: nearly unbiased
as an estimator of E[ n ]
Cross-Validation Comments
May be biased high, same reason as
Holdout method
Often used when n is small which can
make the Holdout method may become
more biased.
Very computationally intensive, especially
for large k and n.
Bootstrap Method
Based on the notion of an ‘empirical distribution’ F*,
which puts mass 1/n on each of the n data points
A bootstrap sample Sn* from F* consists of n
equally-likely draws with replacement from the
original data Sn
The probabiliity that any given data point will not
appear in Sn* is (1 1 / n) n e 1 when n>>1
A bootstrap sample of size n contains on average
(1 e 1 )n 0.632n of the original data points.
Bootstrap Method (continued)
Bootstrap zero estimator:
ˆ0 EF * ( Y g ( S n* , X ) : ( X , Y ) S n \ S n* )
In practice, EF* has to be approximated by a sample
mean based on independent replicates S n*b , for
b=1,…,B, where B is recommended to be between
B
n
25 and 200:
*b
y
g
(
S
n , xi ) I P 0
b 1 i 1 i
ˆ0
B
n
b 1
*b
i
I
*b
i 1 Pi 0
where Pi*b is the actual proportion of times a data
point (xi,yi) appears in Sn*b
Bootstrap Method (continued)
The bootstrap zero estimator tends to be a high
biased estimator of E[ n ]
0.632 bootstrap estimator tries to correct this
bias:
ˆb632 (1 0.632)ˆresub 0.632ˆ0
Bias-corrected bootstrap estimator:
ˆbbc ˆresub
1 B
1
n
*b
*b
(
P
)
y
g
(
S
i
i
n , xi )
b 1 i 1
B
n
Bootstrap Comments
Computation intensive.
Choice of B is important.
Tends to be slightly more accurate than
cross-validation in some situation. But
tends to have greater variance.
Classification procedure
Assess gene expressions with microarrays
Determine genes whose expression levels
can be used as classifier variables
Apply a rule to design the classifier from
the sample data
Apply an error estimation procedure
Error estimation challenges
What if the number of training samples is
remarkably small?
The error estimation will be greatly
impacted by small samples.
A dilemma: unbiased or small variance?
Prefer small variance: an unbiased
estimator with large variance is of little use
Error estimators under small samples
Holdout: impractical with small samples
Resubstitution: always low-biased
cross-validation: have higher variance
than that of resubstitution or bootstrap.
The variance problem of cross-validation
makes its use questionable for the kinds of
very small samples!
Variability affecting error estimation
Internal variance Var[ˆ | Sn ] and the
variability due to the random training
sample. The latter is much large than the
internal variance.
Error-counting estimates, such as
resubstitution and cross-validation, can
only change by 1/n increments.
Variability (continued)
In cross-validation, test samples are not
independent samples. This adds variance to the
estimate.
Surrogate problem: original designed classifier is
assessed in terms of surrogate classifier,
designed by the classification rule applied on
reduce data. If these surrogate classifiers are
too different from the original classifier too often,
the estimate may be far from the true error rate.
Experimental Setup
Classification rules:
linear discriminant analysis (LDA)
3-nearest-neighbor (3NN)
decision trees (CART)
Error estimators:
resubstitution (resub)
cross-validation: leave-one-out (loo), 5-fold c-v (cv5),
10-fold c-v (cv10) and repeated 10-fold c-v (cv10r)
Bootstrap: 0.632 bootstrap (b632) and the biascorrected bootstrap (bbc)
Study terms of error estimators
Study the performance of an error estimator ˆ via
the distribution of the error: deviation distribution of
n ˆ
the error estimator
Estimator bias E[ n ˆ]
Confidence we can have in our estimates from
actual samples Var[ n ˆ]
2
ˆ
E
[
]
The root-mean-square (RMS) error:
n
Quartiles of the deviation distribution: is less
affected by outliers than the mean
Linear Discriminant Analysis (LDA)
Class posteriors: Pr(G / X ) for optimal classification.
Suppose f k (x) is the class-conditional density of X in
the class G=k, and let k be the prior probability of the
K
class k, with k 1 k 1 . A simple application of Bayes
theorem gives us
Pr(G k / X x)
f k ( x) k
K
l 1
f l ( x) l
Suppose we model each class density as multivariate
Gaussian
1
1 / 2(x )
( X )
f k ( x)
T
(2 )
p/2
k
1/ 2
e
k
K
1
K
LDA (continued)
Assume the class have a common covariance matrix
k k , we get LDA
In comparing two classes k and l, it is sufficient to look
at the log-ratio
log
f ( x)
Pr (G k / X x)
1
1
log k
log k log k 1 / 2( k l ) T ( k l ) x T ( k l )
Pr (G l / X x)
f l ( x)
l
l
Linear Discriminant function:
k ( x) x T k 1 / 2 kT k log k
1
1
is an equivalent description of the decision rule
With G(x)= arg max k k ( x)
LDA (continued)
Estimate the parameters of the Gaussian distributions:
1. ˆ k N k / N where N kis the number of class-k observations
2.
3.
̂ k g k xi / N k
i
k 1 g k ( xi ˆ k )( xi ˆ k ) T ( N K )
K
i
Figure1: 3 Gaussian distribution with same
covariance and different means. Included are
the contours of constant density enclosing
95% of the probability of each class. (Bayes
decision boundaries.)
Figuare2: 30 sample drawn
from each Gaussian
distribution, and fitted LDA
decision boundaries.
KNN: Nearest-Neighbor Methods
Nearest-neighbor methods use those observations in the
training set T closest in input space to X to form Yˆ .
Specifically, the k-nearest neighbor fit for YˆIS DEFINED
AS FOLLOWS:
1
Yˆ ( x)
k
y
xi N k ( x )
i
where N k (x) is the neighborhood of x defined by k closest
points xi in the training sample. Closeness implies a
metric, which for the moment we assume is Euclidean
distance (can define other distance also). So in words,
we find the observations with xi closest to x in input
space, and average their responses.
Figure 1. 15-nearest neighbor classifier
Figure 2 1-nearest neighbor classifier
Figure 3 7-nearest neighbor classifier
Figure 4 Misclassification curves
(training size=20, test size=10000)
Decision tree (CART)
Decide how to split (conditional Gini or
conditional Entropy)
Decide when to stop splitting
Decide how to prune the tree
Use Training sample:
Pessimistic Pruning/ Minimal Error Pruning/
Error-based Pruning/Cost Complexity Pruning
Use Pruning Sample:
Error Reduced Pruning
Simulation (synthetic data)
Six sample sizes: 20 to 120 in increments of 20
Total experimental conditions: 3*6*6=108
For each experimental condition and sample size,
compute the empirical deviation distribution using
1000 replications with different sample data drawn
from an underlying model.
True error
Computed exactly for LDA
By Monte-Carlo computation for 3NN and CART
Simulation (synthetic data)
Empirical deviation distribution for selected simulations
(synthetic data). beta fits, n = 20.
Simulation (synthetic data)
Empirical deviation distribution for selected simulations.
variance as a function of sample size.
Simulation (synthetic data)
Cross-validation: slightly high-biased, main
drawback is high variability. They also tend
to produce large outliers
Resubstitution is low-biased, but shows
smaller variance than cross-validation
0.632 bootstrap proved to be the best
Also need to consider computational cost
Simulation (patient data)
Microarrays, from breast tumor samples from 295
patients: 115 good-prognosis, 180 poor-prognosis.
Use log-ratio gene expression values associated
with the top p=2 and top p=5 genes. In each case,
1000 observations of size n=20 and 40 were draw
independently from the pool of 295 microarrays
Sampling was strafified
True error for each observation of size n: holdout
estimator, the 295-n samples points not drawn are
used as the test set (good approximation given large
test smaple)
Simulation (patient data)
Empirical deviation distribution for selected simulations
(patient data). beta fits, n = 20.
Simulation (patient data)
The observations are not independent, but
only weakly dependent
The results obtained with the patient data
confirm the general conclusions obtained
with the synthetic data
Conclusion
cross-validation error estimation is much
less biased than resubstitution, but with
excessive variance. Bootstrap methods
provide improved performance relative to
variance, but at a high computational cost
and often with increased bias (much less
than resubstitution).
My own opinion
Since the universal distribution of the training
sample is unknown, the true error can only be
defined on the training sample. So if the number
of the training samples is very small, or the
sampling method to get this training sample is
not carried out correctly, either of which will
cause the training samples not be able to
represent the universal samples, then the
classifiers and the error estimations based on
this small number samples can not provide
useful information about the universal
classification problem.
Outlier sums
statistic method
Robert Tibshirani, Trevor
Hastie
Background
What’s outlier?
Common methods to detect outliers
Outliers in cancer gene study
T-statistic in outlier study
COPA (Cancer Outlier Profile Analysis)
What’s Outlier?
Definition: An outlier is an unusual value in
a dataset; one that does not fit the typical
pattern of data
Sources of outliers
Recording
of measurement errors
Natural variation of the data (valid data)
Outlier analysis
Issues:
If
outlier is a true error and not dealt with,
results can be severely biased.
If outlier is valid data and is removed,
valuable information regarding important
patterns in the data is lost
Objective: Identify outliers, then decide
how to deal with them.
Outlier detection
Visual inspection of data – not applicable
for large complex datasets
Automated methods
Normal
distribution-based method
Median Absolute Deviation
Distance Based Method
…
Normal distribution-based method
Works on one variable at a time: Xk , k=1,…,p
Assume normal distribution for each variable
Algorithm:
The i th observation’s data for variable Xk (i=1,…,n): xik
Sample mean for variale xk :
Xk
Sample standard deviation for xk :
Calculation zik for each i=1,…,n:
1
xik
n i
Sk
zik
(x
ik
X k )2
i
n 1
xik X k
Sk
Lable xik an outlier if |xik|>3, about 0.25% will be labeled if
normality assumption correct
Normal distribution-based method
Very dependent on assumption of
normality
X , S themselves are not robust to outliers:
k
k
zik
X
positive outliers
zik
Sk
Many outliers
zik values are small if there are real outliers in
the data, so fewer outliers will be detected
Many
k
Only numeric-valued variables (same for
other methods)
Robust Normal Method
Deals with robustness problem
Same as normal distribution method, but
Use
trimmed mean or median instead of X
Use trimmed standard deviation instead of S k
R
xik X k
Calculate zik
and
still
use
|x
|>3
R
ik
Sk
labeling rule ( R superscript represents robust
versions of the mean and standard deviation)
k
Median Absolute Deviation (MAD)
Another method for dealing with robustness problem
Use median as robust estimate of mean
Use MAD as robust estimate o standard deviation
Calculate Dik (i=1,…,n):
Dik xik median( X k )
MAD median( D1k ,, Dnk )
x median( X k )
Calculate modified zik value: zik ik
1.4826( MAD)
Calculate MAD:
Lbel xik as outlier if |xik|>3.5
Note: 1.4826 used because E1.4826(MAD)
Distance Based Method
Non-parametric (no assumption of normality)
Multidimensional – detects outliers across all
attributes at once (instead of one attribute at a time)
Algorithm:
Calculate distance between all pairs of observations: the
Euclidean distance from observation i to j
d ij
2
(
x
x
)
k 1 ik jk
p
label observation i an outlier if less than r% of the total are
within d distance of i (r and d are parameters)
Distance Based Method
Computationally intensive, particularly for larger and
large samples
Time required for each distance calculation grows
with number of attributes
Choice of d and r are not obvious, trying different
values for a particular dataset further increases
computations
Outliers in cancer studies
In cancer studies, mutations can often amplify or
turn off gene expression in only a minority of
samples, namely produce outliers
t-statistic may yield high false discovery rate
(FDR) when trying to detect changes occurred in
a small number of samples.
COPA & PPST have been developed
Is there better method?
t-statistic method
Two standard normal distributions N(θ1,σ2) &
N(θ2,σ2), the expectation of (θ1-θ2) obey t distribution
Algorithm:
Assume Xij be the expression values for genes i and samples j;
2 group: 1. normal; 2. disease.
x x
Ti i 2 i1
si
Compute a two-sample t-statistic Ti for each gene
Here xik is the mean of gene i in group k and si is the pooled within
group standard deviation of gene i.
Call a gene significant if |Ti| exceeds some threshold c.
Using permutations of the sample labels to estimate the false
discovery rate (FDR) for different c.
t-statistic method
t-statistic method is Normal distribution
based method – largely affected by
outliers
t-statistic has no outlier dealing procedure
Not applicable for the cancer studies
where mutations occur in a minority of
samples
COPA
Cancer Outlier Profile Analysis
Algorithm:
gene expression values are median centered, setting each
gene’s median expression value to zero.
the median absolute deviation (MAD) is calculated and scaled to
1 by dividing each gene expression value by its MAD.
the 75th, 90th, and 95th percentiles of the transformed
expression values are tabulated for each gene and then genes
are rank-ordered by their percentile scores, providing a
prioritized list of outlier profiles.
COPA
median and MAD were
used for transformation
as opposed to mean and
standard deviation so that
outlier expression values
do not unduly influence
the distribution estimates,
and are thus preserved
post-normalization
Outlier-sum statistic
Idea: improve performance for “abnormal”
gene expressions in only a small number
of samples.
propose another method besides COPA
Compare it with COPA
Outlier-sum statistic
Algorithm
Define
medi and madi be the median and median
absolute deviation o the values for gene i
x'ij ( xij medi ) / madi
Standardize each gene:
Define qi(r) be the rth percentile of the xij’ values for
gene i, and the interquartile range IQR(i)=qi(75)qi(25).
Values greater than the limit qi(75)+IQR(i) are defined
to be outliers.
The outlier sum statistic is defined to be sum of
the values in the disease group that are beyond
this limit:
Wi
x'
ij
jC2
I [ x'ij qi (75) IQR (i)]
In real applications, one might expect negative
as well as positive outliers. Hence define
W 'i
x'
jC2
ij
I [ x'ij qi (25) IQR (i)]
Set the outlier-sum to the larger of W(i) and W’(i)
in absolute value. This is called “two-sided
outlier-sum statistic”
Simulation study
Generate 1000 genes and 30 samples, all
values drawn from a standard normal
distribution.
Add 2 units to gene 1 for k of the samples
in the second group.
Compute the p-value and compare the
median, mean and standard deviation of
the p-values between different methods.
Simulation result
k=15 (all samples in group 2 are
differentially expressed), the t-statistic
performs the best. This continues until k=8
For smaller values of k, the outlier-sum
statistic yields better results than COPA
and t-statistic.
Application to the skin data
12625 genes and 58 cancer patients: 14
with radiation sensitivity and 44 without.
Using the group of 44 as normal class.
Apply the outlier-sum statistic within the
SAM (Significance analysis of
microarrays) approach.
Experiment result
Outlier-sum statistic has
lower false discovery
rate (FDR) near the right
of plot.
but the FDR there may
be too high for it to be
useful in practice.
Top 12 genes
called by the
outlier-sum statistic
Conclusion
The outlier-sum statistic exhibits better
performance than simple t-statistic
thresholding and COPA when some gene
expressions are unusually high in some
but not all samples.
Otherwise, t-statistic performs well.
My point of view
Need more test examples to test the theory and see how
far this method will work.
In simulation study, the values are drawn from a
standard normal distribution. Will the variance of this
distribution affect the simulation result? (since exactly 2
units quantities were added on to simulate the abnormal
gene expression)
In simulation, only one gene was treated to become
outlier. What if more than one gene? In other words, if
there is another gene which also exhibits unusually high
expression in some samples but is irrelevant with the
classification problem, will it affect the outlier-sum
statistic?
Reference
U. M. Braga-Neto, E. R. Dougherty, Bioinformatics 20, 374380
Tomlins, S.A. et al, science 310, 644-648
Robert Tibshirani, Trevor Hastie, Biostatistics Advance
Access, May 15, 2006
The elements of statistical learning, by Trevor Hastie, Bobert
Tibshirani, Jerome Friedman, Springer
Class Notes from the ‘Data Mining’, by Paul Maiste
‘Introduction to data mining’, addison wesley, by Pang-ning
Tan...
Class notes from ‘machine learning’, by Donald. Geman