Transcript model bbs

Cross-validation, Revisiting
Regression – local models,
and non-parametric…
Peter Fox
Data Analytics – ITWS-4600/ITWS-6600
Week 12a, April 19, 2016
1
coleman
> head(coleman)
salaryP fatherWc sstatus teacherSc motherLev
1
2
3
4
5
6
3.83
2.89
2.86
2.92
3.06
2.07
28.87 7.20
20.10 -11.71
69.05 12.32
65.40 14.28
29.59 6.31
44.82 6.16
26.6
24.4
25.7
25.7
25.4
21.6
Y
6.19 37.01
5.17 26.51
7.04 36.51
7.10 40.70
6.15 37.10
6.41 33.90
2
Cross-validation package cvTools
> call <- call("lmrob", formula = Y ~ .)
> # set up folds for cross-validation
> folds <- cvFolds(nrow(coleman), K = 5, R = 10)
> # perform cross-validation
> cvTool(call, data = coleman, y = coleman$Y, cost = rtmspe,
+
folds = folds, costArgs = list(trim = 0.1))
CV
Warning messages:
[1,] 0.9880672
1: In lmrob.S(x, y, control = control) :
[2,] 0.9525881
S refinements did not converge (to refine.tol=1e-07) in 200
(= k.max) steps
[3,] 0.8989264
2: In lmrob.S(x, y, control = control) :
[4,] 1.0177694
S refinements did not converge (to refine.tol=1e-07) in 200
[5,] 0.9860661
(= k.max) steps
3: In lmrob.S(x, y, control = control) :
[6,] 1.8369717
find_scale() did not converge in 'maxit.scale' (= 200) iterations
[7,] 0.9550428
4: In lmrob.S(x, y, control = control) :
[8,] 1.0698466
find_scale() did not converge in 'maxit.scale' (= 200) iterations
[9,] 1.3568537
[10,] 0.8313474
3
Evaluating?
> cvFits
5-fold CV results:
Fit
CV
1 LS 1.674485
2 MM 1.147130
3 LTS 1.291797
Best model:
CV
"MM"
4
LS, LTS, MM?
• The breakdown value of an estimator is defined as
the smallest fraction of contamination that can
cause the estimator to take on values arbitrarily far
from its value on the uncontaminated data.
• The breakdown value of an estimator can be used
as a measure of the robustness of the estimator.
Rousseeuw and Leroy (1987) and others introduced
high breakdown value estimators for linear
regression.
• LTS – see
http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/
viewer.htm#statug_rreg_sect018.htm#statug.rreg.robustregfltsest
• MM http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/
viewer.htm#statug_rreg_sect019.htm
5
50 and 75% subsets
fitLts50 <- ltsReg(Y ~ ., data = coleman, alpha = 0.5)
cvFitLts50 <- cvLts(fitLts50, cost = rtmspe, folds = folds,
fit = "both", trim = 0.1)
# 75% subsets
fitLts75 <- ltsReg(Y ~ ., data = coleman, alpha = 0.75)
cvFitLts75 <- cvLts(fitLts75, cost = rtmspe, folds = folds,
fit = "both", trim = 0.1)
# combine and plot results
cvFitsLts <- cvSelect("0.5" = cvFitLts50, "0.75" = cvFitLts75)
6
cvFitsLts (50/75)
> cvFitsLts
5-fold CV results:
Fit reweighted
raw
1 0.5 1.291797 1.640922
2 0.75 1.065495 1.232691
Best model:
reweighted
raw
"0.75" "0.75"
7
Tuning
tuning <- list(tuning.psi=c(3.14, 3.44, 3.88,
4.68))
# perform cross-validation
cvFitsLmrob <- cvTuning(fitLmrob$call, data =
coleman, y = coleman$Y, tuning = tuning, cost
= rtmspe, folds = folds, costArgs = list(trim =
0.1))
8
cvFitsLmrob
5-fold CV results:
tuning.psi
CV
1
3.14 1.179620
2
3.44 1.156674
3
3.88 1.169436
4
4.68 1.133975
Optimal tuning parameter:
tuning.psi
CV
4.68
9
Lab on Friday
mammals.glm <- glm(log(brain) ~ log(body), data = mammals)
(cv.err <- cv.glm(mammals, mammals.glm)$delta)
[1] 0.4918650 0.4916571
> (cv.err.6 <- cv.glm(mammals, mammals.glm, K = 6)$delta)
[1] 0.4967271 0.4938003
# As this is a linear model we could calculate the leave-one-out
# cross-validation estimate without any extra model-fitting.
muhat <- fitted(mammals.glm)
mammals.diag <- glm.diag(mammals.glm)
(cv.err <- mean((mammals.glm$y - muhat)^2/(1 mammals.diag$h)^2))
[1] 0.491865
10
Cost functions, etc.
# leave-one-out and 11-fold cross-validation prediction error for
# the nodal data set. Since the response is a binary variable
# an appropriate cost function is
> cost <- function(r, pi = 0) mean(abs(r-pi) > 0.5)
> nodal.glm <- glm(r ~ stage+xray+acid, binomial, data = nodal)
> (cv.err <- cv.glm(nodal, nodal.glm, cost, K =
nrow(nodal))$delta)
[1] 0.1886792 0.1886792
> (cv.11.err <- cv.glm(nodal, nodal.glm, cost, K = 11)$delta)
[1] 0.2264151 0.2228551
11
cvTools
• http://cran.rproject.org/web/packages/cvTools/cvTools.pd
f
• Very powerful and flexible package for CV
(regression) but very much a black box!
• If you use it, become very, very familiar with
the outputs and be prepared to experiment…
12
Diamonds
require(ggplot2)
# or load package first
data(diamonds)
head(diamonds)
# look at the data!
#
ggplot(diamonds, aes(clarity, fill=cut)) + geom_bar()
ggplot(diamonds, aes(clarity)) + geom_bar() + facet_wrap(~ cut)
ggplot(diamonds) + geom_histogram(aes(x=price)) + geom_vline(xintercept=12000)
ggplot(diamonds, aes(clarity)) + geom_freqpoly(aes(group = cut, colour = cut))
13
ggplot(diamonds, aes(clarity)) +
geom_freqpoly(aes(group = cut, colour = cut))
14
bodyfat
## regular linear model using three variables
lm1 <- lm(DEXfat ~ hipcirc + kneebreadth + anthro3a, data = bodyfat)
## Estimate same model by glmboost
glm1 <- glmboost(DEXfat ~ hipcirc + kneebreadth + anthro3a, data =
bodyfat)
# We consider all available variables as potential predictors.
glm2 <- glmboost(DEXfat ~ ., data = bodyfat)
# or one could essentially call:
preds <- names(bodyfat[, names(bodyfat) != "DEXfat"]) ## names of
predictors
fm <- as.formula(paste("DEXfat ~", paste(preds, collapse = "+"))) ## build
formula
15
Compare linear models
> coef(lm1)
(Intercept) hipcirc kneebreadth anthro3a
-75.2347840 0.5115264 1.9019904 8.9096375
> coef(glm1, off2int=TRUE) ## off2int adds the offset
to the intercept
(Intercept) hipcirc kneebreadth anthro3a
-75.2073365 0.5114861 1.9005386 8.9071301
Conclusion?
16
> fm
DEXfat ~ age + waistcirc + hipcirc + elbowbreadth +
kneebreadth +
anthro3a + anthro3b + anthro3c + anthro4
> coef(glm2, which = "") ## select all.
(Intercept)
age waistcirc
hipcirc elbowbreadth
kneebreadth anthro3a anthro3b anthro3c
-98.8166077 0.0136017 0.1897156 0.3516258 0.3841399 1.7365888 3.3268603 3.6565240 0.5953626
anthro4
0.0000000
attr(,"offset")
17
[1] 30.78282
> gam2 <- gamboost(DEXfat ~ ., baselearner = "bbs", data =
bodyfat,control = boost_control(trace = TRUE))
[ 1] .................................................. -- risk: 515.5713
[ 53] ..............................................
Final risk: 460.343
> set.seed(123) ## set seed to make results reproducible
> cvm <- cvrisk(gam2) ## default method is 25-fold bootstrap
cross-validation
18
> cvm
Cross-validated Squared Error (Regression)
gamboost(formula = DEXfat ~ ., data = bodyfat, baselearner = "bbs",
control = boost_control(trace = TRUE))
1
2
3
4
5
6
7
8
9
10
109.44043 93.90510 80.59096 69.60200 60.13397 52.59479 46.11235 40.80175 36.32637 32.66942
11
12
13
14
15
16
17
18
19
20
29.66258 27.07809 24.99304 23.11263 21.55970 20.40313 19.16541 18.31613 17.59806 16.96801
21
22
23
24
25
26
27
28
29
30
16.48827 16.07595 15.75689 15.47100 15.21898 15.06787 14.96986 14.86724 14.80542 14.74726
31
32
33
34
35
36
37
38
39
40
14.68165 14.68648 14.64315 14.67862 14.68193 14.68394 14.75454 14.80268 14.81760 14.87570
41
42
43
44
45
46
47
48
49
50
14.90511 14.92398 15.00389 15.03604 15.07639 15.10671 15.15364 15.20770 15.23825 15.30189
51
52
53
54
55
56
57
58
59
60
15.31950 15.35630 15.41134 15.46079 15.49545 15.53137 15.57602 15.61894 15.66218 15.71172
61
62
63
64
65
66
67
68
69
70
15.72119 15.75424 15.80828 15.84097 15.89077 15.90547 15.93003 15.95715 15.99073 16.03679
71
72
73
74
75
76
77
78
79
80
16.06174 16.10615 16.12734 16.15830 16.18715 16.22298 16.27167 16.27686 16.30944 16.33804
81
82
83
84
85
86
87
88
89
90
16.36836 16.39441 16.41587 16.43615 16.44862 16.48259 16.51989 16.52985 16.54723 16.58531
91
92
93
94
95
96
97
98
99
100
16.61028 16.61020 16.62380 16.64316 16.64343 16.68386 16.69995 16.73360 16.74944 16.75756
Optimal number of boosting iterations: 33
19
> mstop(cvm) ## extract the optimal mstop
[1] 33
> gam2[ mstop(cvm) ] ## set the model automatically to the optimal mstop
Model-based Boosting
Call:
gamboost(formula = DEXfat ~ ., data = bodyfat, baselearner = "bbs",
control = boost_control(trace = TRUE))
Squared Error (Regression)
Loss function: (y - f)^2
Number of boosting iterations: mstop = 33
Step size: 0.1
Offset: 30.78282
Number of baselearners: 9
20
plot(cvm)
21
> names(coef(gam2)) ## displays the selected base-learners at iteration 30
[1] "bbs(waistcirc, df = dfbase)" "bbs(hipcirc, df = dfbase)" "bbs(kneebreadth, df = dfbase)"
[4] "bbs(anthro3a, df = dfbase)" "bbs(anthro3b, df = dfbase)" "bbs(anthro3c, df = dfbase)"
[7] "bbs(anthro4, df = dfbase)"
> gam2[1000, return = FALSE] # return = FALSE just supresses "print(gam2)"
[ 101] .................................................. -- risk: 423.9261
[ 153] .................................................. -- risk: 397.4189
[ 205] .................................................. -- risk: 377.0872
[ 257] .................................................. -- risk: 360.7946
[ 309] .................................................. -- risk: 347.4504
[ 361] .................................................. -- risk: 336.1172
[ 413] .................................................. -- risk: 326.277
[ 465] .................................................. -- risk: 317.6053
[ 517] .................................................. -- risk: 309.9062
[ 569] .................................................. -- risk: 302.9771
[ 621] .................................................. -- risk: 296.717
[ 673] .................................................. -- risk: 290.9664
[ 725] .................................................. -- risk: 285.683
[ 777] .................................................. -- risk: 280.8266
[ 829] .................................................. -- risk: 276.3009
[ 881] .................................................. -- risk: 272.0859
[ 933] .................................................. -- risk: 268.1369
[ 985] ..............
Final risk: 266.9768
22
> names(coef(gam2)) ## displays the selected base-learners, now at iteration 1000
[1] "bbs(age, df = dfbase)"
"bbs(waistcirc, df = dfbase)" "bbs(hipcirc, df =
dfbase)"
[4] "bbs(elbowbreadth, df = dfbase)" "bbs(kneebreadth, df = dfbase)" "bbs(anthro3a,
df = dfbase)"
[7] "bbs(anthro3b, df = dfbase)" "bbs(anthro3c, df = dfbase)" "bbs(anthro4, df =
dfbase)”
> glm3 <- glmboost(DEXfat ~ hipcirc + kneebreadth + anthro3a, data =
bodyfat,family = QuantReg(tau = 0.5), control = boost_control(mstop = 500))
> coef(glm3, off2int = TRUE)
(Intercept) hipcirc
kneebreadth anthro3a
-63.5164304 0.5331394 0.7699975 7.8350858
23
More local methods…
24
Why local?
25
Sparse?
26
Remember this one?
10
5
0
log(bronx$SALE.PRICE)
15
How would you apply local
methods here?
6
8
10
12
log(bronx$GROSS.SQUARE.FEET)
14
27
SVM-type
• One-class-classification: this model tries to
find the support of a distribution and thus
allows for outlier/novelty detection;
• epsilon-regression: here, the data points lie in
between the two borders of the margin which
is maximized under suitable conditions to
avoid outlier inclusion;
• nu-regression: with analogue modifications of
the regression model as in the classification
case.
28
Reminder SVM and margin
29
Loss functions…
classification
outlier
regression
30
Regression
• By using a different loss function called the εinsensitive loss function ||y−f (x)||ε = max{0,
||y− f(x)|| − ε}, SVMs can also perform
regression.
• This loss function ignores errors that are
smaller than a certain threshold ε > 0 thus
creating a tube around the true output.
31
Example lm v. svm
32
33
Again SVM in R
• E1071 - the svm() function in e1071 provides
a rigid interface to libsvm along with
visualization and parameter tuning methods.
• kernlab features a variety of kernel-based
methods and includes a SVM method based
on the optimizers used in libsvm and bsvm
• Package klaR includes an interface to
SVMlight, a popular SVM implementation that
additionally offers classification tools such as
Regularized Discriminant Analysis.
34
• Svmpath – you get the idea…
Knn is local – right?
• nearest neighbors is a simple algorithm that
stores all available cases and predict the
numerical target based on a similarity
measure (e.g., distance functions).
• KNN has been used in statistical estimation
and pattern recognition already in the
beginning of 1970’s as a non-parametric
technique.
35
Distance…
• A simple implementation of KNN regression
is to calculate the average of the numerical
target of the K nearest neighbors.
• Another approach uses an inverse distance
weighted average of the K nearest
neighbors. Choosing K!
• KNN regression uses the same distance
functions as KNN classification.
• knn.reg and also in kknn
• http://cran.rproject.org/web/packages/kknn/kknn.pdf
36
Classes of local regression
• Locally (weighted) scatterplot smoothing
– LOESS
– LOWESS
• Fitting is done locally - the fit at point x, the fit
is made using points in a neighborhood of x,
weighted by their distance from x (with
differences in ‘parametric’ variables being
ignored when computing the distance)
37
38
Classes of local regression
• The size of the neighborhood is controlled by
α (set by span).
• For α < 1, the neighbourhood includes
proportion α of the points, and these have
tricubic weighting (proportional to (1 (dist/maxdist)^3)^3). For α > 1, all points are
used, with the ‘maximum distance’ assumed
to be α^(1/p) times the actual maximum
distance for p explanatory variables.
39
Classes of local regression
• For the default family, fitting is by (weighted)
least squares. For family="symmetric" a few
iterations of an M-estimation procedure with
Tukey's biweight are used.
• Be aware that as the initial value is the leastsquares fit, this need not be a very resistant
fit.
• It can be important to tune the control list to
achieve acceptable speed.
40
Friedman (supsmu in modreg)
• is a running lines smoother which chooses
between three spans for the lines.
• The running lines smoothers are symmetric,
with k/2 data points each side of the predicted
point, and values of k as 0.5 * n, 0.2 * n and
0.05 * n, where n is the number of data
points.
• If span is specified, a single smoother with
span span * n is used.
41
Friedman
• The best of the three smoothers is chosen by
cross-validation for each prediction. The best
spans are then smoothed by a running lines
smoother and the final prediction chosen by
linear interpolation.
• “For small samples (n < 40) or if there are
substantial serial correlations between
observations close in x-value, then a
prespecified fixed span smoother (span > 0)
should be used. Reasonable span values are
0.2 to 0.4.”
42
Local non-param
• lplm (in Rearrangement)
• Local nonparametric method, local linear
regression estimator with box kernel (default),
for conditional mean functions
43
Ridge regression
• Addresses ill-posed regression problems
using filtering approaches (e.g. high-pass)
• Often called “regularization”
• lm.ridge (in MASS)
44
• Quantile regression
– is desired if conditional quantile functions are of
interest. One advantage of quantile regression,
relative to the ordinary least squares regression,
is that the quantile regression estimates are more
robust against outliers in the response
measurements.
– In practice we often prefer using different
measures of central tendency and statistical
dispersion to obtain a more comprehensive
analysis of the relationship between variables.
45
More…
• Partial Least Squares Regression (PLSR)
• mvr (in pls)
• Principal Component Regression (PCR)
• Canonical Powered Partial Least Squares
(CPPLS)
46
• PCR creates components to explain the
observed variability in the predictor variables,
without considering the response variable at
all.
• On the other hand, PLSR does take the
response variable into account, and therefore
often leads to models that are able to fit the
response variable with fewer components.
• Whether or not that ultimately translates into
a better model, in terms of its practical use,
depends on the context.
47
Splines
• smooth.spline, splinefun (stats, modreg) and
ns (in splines)
– http://www.inside-r.org/r-doc/splines
• a numeric function that is piecewise-defined
by polynomial functions, and which
possesses a sufficiently high degree of
smoothness at the places where the
polynomial pieces connect (which are known
as knots)
48
Splines
• For interpolation, splines are often preferred
to polynomial interpolation - they yields
similar results to interpolating with higher
degree polynomials while avoiding instability
due to overfitting
• Features: simplicity of their construction, their
ease and accuracy of evaluation, and their
capacity to approximate complex shapes
• Most common: cubic spline, i.e., of order 3—
in particular, cubic B-spline
49
cars
50
Smoothing/ local …
• https://web.njit.edu/all_topics/Prog_Lang_Doc
s/html/library/modreg/html/00Index.html
• http://cran.r-project.org/doc/contrib/Riccirefcard-regression.pdf
51
Lab on Friday…
• And reminder – Assignment 7 due on Apr. 29
– Friday 5pm
• Next week – mixed models! i.e. optimizing…
• Open lab on Fri. 29th (no new work)….
52