Transcript Document

Cross-validation, resampling,
bootstrapping, bagging,
boosting, etc.
Peter Fox
Data Analytics – ITWS-4963/ITWS-6965
Week 12a, April 15, 2014
1
Cross-validation
• Cross-validation is a model validation
technique for assessing how the results of a
statistical analysis will generalize to an
independent data set.
• The goal of cross validation is to define a
dataset to "test" the model in the training
phase (i.e., the validation dataset), in order to
limit problems like overfitting
2
K-fold
• Original sample is randomly partitioned into k
equal size subsamples.
• Of the k subsamples, a single subsample is
retained as the validation data for testing the
model, and the remaining k − 1 subsamples
are used as training data.
• Repeat cross-validation process k times
(folds), with each of the k subsamples used
exactly once as the validation data.
– The k results from the folds can then be
averaged (usually) to produce a single
estimation.
3
Leave out subsample
• As the name suggests, leave-one-out crossvalidation (LOOCV) involves using a single
observation from the original sample as the
validation data, and the remaining
observations as the training data.
• i.e. K=n-fold cross-validation
• Leave out > 1 = bootstraping and jackknifing
4
boot(strapping)
• Generate replicates of a statistic applied to
data (parametric and nonparametric).
– nonparametric bootstrap, possible methods:
ordinary bootstrap, the balanced bootstrap,
antithetic resampling, and permutation.
• For nonparametric multi-sample problems
stratified resampling is used:
– this is specified by including a vector of strata in
the call to boot.
– importance resampling weights may be specified.
5
Jackknifing
• Systematically recompute the statistic
estimate, leaving out one or more
observations at a time from the sample set
• From this new set of replicates of the statistic,
an estimate for the bias and an estimate for
the variance of the statistic can be calculated.
• Often use log(variance) [instead of variance]
especially for non-normal distributions
6
Repeat-random-subsample
• Random split of the dataset into training and
validation data.
– For each such split, the model is fit to the training
data, and predictive accuracy is assessed using
the validation data.
• Results are then averaged over the splits.
• Note: for this method can the results will vary
if the analysis is repeated with different
random splits.
7
Advantage?
• The advantage of K-fold over repeated
random sub-sampling is that all observations
are used for both training and validation, and
each observation is used for validation
exactly once.
– 10-fold cross-validation is commonly used
• The advantage of rep-random over k-fold
cross validation is that the proportion of the
training/validation split is not dependent on
the number of iterations (folds).
8
Disadvantage
• The disadvantage of rep-random is that some
observations may never be selected in the
validation subsample, whereas others may be
selected more than once.
– i.e., validation subsets may overlap.
9
coleman
> head(coleman)
salaryP fatherWc sstatus teacherSc motherLev
1
2
3
4
5
6
3.83
2.89
2.86
2.92
3.06
2.07
28.87 7.20
20.10 -11.71
69.05 12.32
65.40 14.28
29.59 6.31
44.82 6.16
26.6
24.4
25.7
25.7
25.4
21.6
Y
6.19 37.01
5.17 26.51
7.04 36.51
7.10 40.70
6.15 37.10
6.41 33.90
10
Lab11_11_2014.R
> call <- call("lmrob", formula = Y ~ .)
> # set up folds for cross-validation
> folds <- cvFolds(nrow(coleman), K = 5, R = 10)
> # perform cross-validation
> cvTool(call, data = coleman, y = coleman$Y, cost = rtmspe,
+
folds = folds, costArgs = list(trim = 0.1))
CV
Warning messages:
[1,] 0.9880672
1: In lmrob.S(x, y, control = control) :
[2,] 0.9525881
S refinements did not converge (to refine.tol=1e-07) in 200
(= k.max) steps
[3,] 0.8989264
2: In lmrob.S(x, y, control = control) :
[4,] 1.0177694
S refinements did not converge (to refine.tol=1e-07) in 200
[5,] 0.9860661
(= k.max) steps
3: In lmrob.S(x, y, control = control) :
[6,] 1.8369717
find_scale() did not converge in 'maxit.scale' (= 200) iterations
[7,] 0.9550428
4: In lmrob.S(x, y, control = control) :
[8,] 1.0698466
find_scale() did not converge in 'maxit.scale' (= 200) iterations
11
[9,] 1.3568537
[10,] 0.8313474
Lab11b_12_2014.R
> cvFits
5-fold CV results:
Fit
CV
1 LS 1.674485
2 MM 1.147130
3 LTS 1.291797
Best model:
CV
"MM"
12
50 and 75% subsets
fitLts50 <- ltsReg(Y ~ ., data = coleman, alpha = 0.5)
cvFitLts50 <- cvLts(fitLts50, cost = rtmspe, folds = folds,
fit = "both", trim = 0.1)
# 75% subsets
fitLts75 <- ltsReg(Y ~ ., data = coleman, alpha = 0.75)
cvFitLts75 <- cvLts(fitLts75, cost = rtmspe, folds = folds,
fit = "both", trim = 0.1)
# combine and plot results
cvFitsLts <- cvSelect("0.5" = cvFitLts50, "0.75" = cvFitLts75)
13
cvFitsLts (50/75)
> cvFitsLts
5-fold CV results:
Fit reweighted
raw
1 0.5 1.291797 1.640922
2 0.75 1.065495 1.232691
Best model:
reweighted
raw
"0.75" "0.75"
14
Tuning
tuning <- list(tuning.psi=c(3.14, 3.44, 3.88,
4.68))
# perform cross-validation
cvFitsLmrob <- cvTuning(fitLmrob$call, data =
coleman, y = coleman$Y, tuning = tuning, cost
= rtmspe, folds = folds, costArgs = list(trim =
0.1))
15
cvFitsLmrob
5-fold CV results:
tuning.psi
CV
1
3.14 1.179620
2
3.44 1.156674
3
3.88 1.169436
4
4.68 1.133975
Optimal tuning parameter:
tuning.psi
CV
4.68
16
Lab11b_18
mammals.glm <- glm(log(brain) ~ log(body), data = mammals)
(cv.err <- cv.glm(mammals, mammals.glm)$delta)
[1] 0.4918650 0.4916571
> (cv.err.6 <- cv.glm(mammals, mammals.glm, K = 6)$delta)
[1] 0.4967271 0.4938003
# As this is a linear model we could calculate the leave-one-out
# cross-validation estimate without any extra model-fitting.
muhat <- fitted(mammals.glm)
mammals.diag <- glm.diag(mammals.glm)
(cv.err <- mean((mammals.glm$y - muhat)^2/(1 mammals.diag$h)^2))
[1] 0.491865
17
Lab11b_18
# leave-one-out and 11-fold cross-validation prediction error for
# the nodal data set. Since the response is a binary variable
# an appropriate cost function is
> cost <- function(r, pi = 0) mean(abs(r-pi) > 0.5)
> nodal.glm <- glm(r ~ stage+xray+acid, binomial, data = nodal)
> (cv.err <- cv.glm(nodal, nodal.glm, cost, K =
nrow(nodal))$delta)
[1] 0.1886792 0.1886792
> (cv.11.err <- cv.glm(nodal, nodal.glm, cost, K = 11)$delta)
[1] 0.2264151 0.2228551
18
Cross validation - resampling
• Estimating the precision of sample statistics
(medians, variances, percentiles) by using
subsets of available data (jackknifing) or
drawing randomly with replacement from a
set of data points (bootstrapping)
• Exchanging labels on data points when
performing significance tests (permutation
tests, also called exact tests, randomization
tests, or re-randomization tests)
• Validating models by using random subsets
(bootstrapping, cross validation)
19
Bootstrap aggregation (bagging)
• Improve the stability and accuracy of machine
learning algorithms used in statistical
classification and regression.
• Also reduces variance and helps to avoid
overfitting.
• Usually applied to decision tree methods, but
can be used with any type of method.
– Bagging is a special case of the model averaging
approach.
20
Ozone
10 of 100 bootstrap samples
average
21
• Shows improvements for unstable
procedures (Breiman, 1996): e.g. neural nets,
classification and regression trees, and
subset selection in linear regression
• … can mildly degrade the performance of
stable methods such as K-nearest neighbors
22
Bagging (BreastCancer)
library(mlbench)
data(BreastCancer)
l <- length(BreastCancer[,1])
sub <- sample(1:l,2*l/3)
BC.bagging <- bagging(Class ~., data=BreastCancer[,-1],
mfinal=20, control=rpart.control(maxdepth=3))
BC.bagging.pred <-predict.bagging( BC.bagging,
newdata=BreastCancer[-sub,-1])
BC.bagging.pred$confusion
Observed Class
BC.bagging.pred$error
Predicted Class benign malignant
[1] 0.04291845
benign
142
2
malignant
8
81
23
A little later
> data(BreastCancer)
> l <- length(BreastCancer[,1])
> sub <- sample(1:l,2*l/3)
> BC.bagging <- bagging(Class ~.,data=BreastCancer[,-1],mfinal=20,
+
control=rpart.control(maxdepth=3))
> BC.bagging.pred <- predict.bagging(BC.bagging,newdata=BreastCancer[sub,-1])
> BC.bagging.pred$confusion
Observed Class
Predicted Class benign malignant
benign
147
1
malignant
7
78
> BC.bagging.pred$error
[1] 0.03433476
24
Bagging (Vehicle)
> data(Vehicle)
> l <- length(Vehicle[,1])
> sub <- sample(1:l,2*l/3)
> Vehicle.bagging <- bagging(Class ~.,data=Vehicle[sub, ],mfinal=40,
+
control=rpart.control(maxdepth=5))
> Vehicle.bagging.pred <- predict.bagging(Vehicle.bagging,
newdata=Vehicle[-sub, ])
> Vehicle.bagging.pred$confusion
Observed Class
Predicted Class bus opel saab van
bus
63 10 8 0
opel
1 42 27 0
saab
0 18 30 0
van
5 7 9 62
> Vehicle.bagging.pred$error
[1] 0.3014184
25
Weak models …
• A weak learner: a classifier which is only
slightly correlated with the true classification
(it can label examples better than random
guessing)
• A strong learner: a classifier that is arbitrarily
well-correlated with the true classification.
• Can a set of weak learners create a single
strong learner?
26
Boosting
• … reducing bias in supervised learning
• most boosting algorithms consist of iteratively
learning weak classifiers with respect to a
distribution and adding them to a final strong
classifier.
– typically weighted in some way that is usually related to
the weak learners' accuracy.
• After a weak learner is added, the data is
reweighted: examples that are misclassified gain
weight and examples that are classified correctly
lose weight
• Thus, future weak learners focus more on the
27
examples that previous weak learners misclassified.
Diamonds
require(ggplot2)
# or load package first
data(diamonds)
head(diamonds)
# look at the data!
#
ggplot(diamonds, aes(clarity, fill=cut)) + geom_bar()
ggplot(diamonds, aes(clarity)) + geom_bar() + facet_wrap(~ cut)
ggplot(diamonds) + geom_histogram(aes(x=price)) + geom_vline(xintercept=12000)
ggplot(diamonds, aes(clarity)) + geom_freqpoly(aes(group = cut, colour = cut))
28
ggplot(diamonds, aes(clarity)) +
geom_freqpoly(aes(group = cut, colour = cut))
29
30
Using diamonds… boost (glm)
> mglmboost<-glmboost(as.factor(Expensive) ~ ., data=diamonds,family=Binomial(link="logit"))
> summary(mglmboost)
Generalized Linear Models Fitted via Gradient Boosting
Call:
glmboost.formula(formula = as.factor(Expensive) ~ ., data = diamonds,
= "logit"))
family = Binomial(link
Negative Binomial Likelihood
Loss function: {
f <- pmin(abs(f), 36) * sign(f)
p <- exp(f)/(exp(f) + exp(-f))
y <- (y + 1)/2
-y * log(p) - (1 - y) * log(1 - p)
}
31
Using diamonds… boost (glm)
> summary(mglmboost) #continued
Number of boosting iterations: mstop = 100
Step size: 0.1
Offset: -1.339537
Coefficients:
NOTE: Coefficients from a Binomial model are half the size of coefficients
from a model fitted via glm(... , family = 'binomial').
See Warning section in ?coef.mboost
(Intercept)
carat clarity.L
-1.5156330 1.5388715 0.1823241
attr(,"offset")
[1] -1.339537
Selection frequencies:
carat (Intercept) clarity.L
0.50
0.42
0.08
32
example
cars.gb <- blackboost(dist ~ speed, data = cars,
control = boost_control(mstop = 50))
### plot fit
plot(dist ~ speed, data = cars)
lines(cars$speed, predict(cars.gb), col = "red")
33
Blackboosting (cf. brown)
Gradient boosting for optimizing arbitrary loss functions where regression
trees are utilized as base-learners.
> cars.gb
Model-based Boosting
Call:
blackboost(formula = dist ~ speed, data = cars, control =
boost_control(mstop = 50))
Squared Error (Regression)
Loss function: (y - f)^2
Number of boosting iterations: mstop = 50
Step size: 0.1
Offset: 42.98
Number of baselearners: 1
34
Assignment 7
• Later today
35
Admin info (keep/ print this slide)
•
•
•
•
•
•
•
•
•
Class: ITWS-4963/ITWS 6965
Hours: 12:00pm-1:50pm Tuesday/ Friday
Location: SAGE 3101
Instructor: Peter Fox
Instructor contact: [email protected], 518.276.4862 (do not
leave a msg)
Contact hours: Monday** 3:00-4:00pm (or by email appt)
Contact location: Winslow 2120 (sometimes Lally 207A
announced by email)
TA: Lakshmi Chenicheri [email protected]
Web site: http://tw.rpi.edu/web/courses/DataAnalytics/2014
– Schedule, lectures, syllabus, reading, assignments, etc.
36