Transcript fox vanilla

Support Vector Machines,
Decision Trees, Crossvalidation
Peter Fox
Data Analytics – ITWS-4963/ITWS-6965
Week 11a, April 7, 2014
1
Reading?
2
Probabilities…
library(kernlab)
data(promotergene)
## create test and training set
ind <- sample(1:dim(promotergene)[1],20)
genetrain <- promotergene[-ind, ]
genetest <- promotergene[ind, ]
## train a support vector machine
gene <- ksvm(Class~.,data=genetrain,kernel="rbfdot",\
kpar=list(sigma=0.015),C=70,cross=4,prob.model=TRUE)
## predict gene type probabilities on the test set
genetype <- predict(gene,genetest,type="probabilities")
3
Result
> genetype
+
[1,] 0.205576217 0.794423783
[2,] 0.150094660 0.849905340
[3,] 0.262062226 0.737937774
[4,] 0.939660586 0.060339414
[5,] 0.003164823 0.996835177
[6,] 0.502406898 0.497593102
[7,] 0.812503448 0.187496552
[8,] 0.996382257 0.003617743
[9,] 0.265187582 0.734812418
[10,] 0.998832291 0.001167709
[11,] 0.576491204 0.423508796
[12,] 0.973798660 0.026201340
[13,] 0.098598411 0.901401589
[14,] 0.900670101 0.099329899
[15,] 0.012571774 0.987428226
[16,] 0.977704079 0.022295921
[17,] 0.137304637 0.862695363
[18,] 0.972861575 0.027138425
[19,] 0.224470227 0.775529773
[20,] 0.004691973 0.995308027
4
Glass
library(e1071)
library(rpart)
data(Glass, package="mlbench")
index <- 1:nrow(Glass)
testindex <- sample(index, trunc(length(index)/3))
testset <- Glass[testindex,]
trainset <- Glass[-testindex,]
svm.model <- svm(Type ~ ., data = trainset, cost = 100, gamma
= 1)
svm.pred <- predict(svm.model, testset[,-10])
5
> table(pred = svm.pred, true = testset[,10])
true
pred 1 2 3 5 6 7
1 12 9 1 0 0 0
2 6 19 6 5 2 2
3 1 0 2 0 0 0
5 0 0 0 0 0 0
6 0 0 0 0 1 0
7 0 1 0 0 0 4
6
Now what?
# now what happens?
> rpart.model <- rpart(Type ~ ., data = trainset)
> rpart.pred <- predict(rpart.model, testset[,-10],
type = "class”)
7
General idea behind trees
• Although the basic philosophy of all the classifiers
based on decision trees is identical, there are many
possibilities for its construction.
• Among all the key points in the selection of an algorithm
to build decision trees some of them should be
highlighted for their importance:
– Criteria for the choice of feature to be used in each
node
– How to calculate the partition of the set of examples
– When you decide that a node is a leaf
– What is the criterion to select the class to assign to
each leaf
8
• Some important advantages can be pointed to the
decision trees, including:
– Can be applied to any type of data
– The final structure of the classifier is quite simple and can
be stored and handled in a graceful manner
– Handles very efficiently conditional information,
subdividing the space into sub-spaces that are handled
individually
– Reveal normally robust and insensitive to
misclassification in the training set
– The resulting trees are usually quite understandable and
can be easily used to obtain a better understanding of the
phenomenon in question. This is perhaps the most
important of all the advantages listed
9
Stopping – leaves on the tree
• A number of stopping conditions can be used
to stop the recursive process. The algorithm
stops when any one of the conditions is true:
– All the samples belong to the same class, i.e.
have the same label since the sample is already
"pure"
– Stop if most of the points are already of the same
class. This is a generalization of the first
approach, with some error threshold
– There are no remaining attributes on which the
samples may be further partitioned
10
– There are no samples for the branch test attribute
Recursive partitioning
• Recursive partitioning is a fundamental tool in data mining. It
helps us explore the structure of a set of data, while
developing easy to visualize decision rules for predicting a
categorical (classification tree) or continuous (regression
tree) outcome.
• The rpart programs build classification or regression models
of a very general structure using a two stage procedure; the
resulting models can be represented as binary trees.
11
Recursive partitioning
• The tree is built by the following process:
– first the single variable is found which best splits
the data into two groups ('best' will be defined
later). The data is separated, and then this
process is applied separately to each sub-group,
and so on recursively until the subgroups either
reach a minimum size or until no improvement
can be made.
– second stage of the procedure consists of using
cross-validation to trim back the full tree.
12
Why are we careful doing this?
• Because we will USE these trees, i.e. apply
them to make decisions about what things
are and what to do with them!
13
> printcp(rpart.model)
Classification tree:
rpart(formula = Type ~ ., data = trainset)
Variables actually used in tree construction:
[1] Al Ba Mg RI
Root node error: 92/143 = 0.64336
n= 143
CP
nsplit rel error xerror xstd
1 0.206522
0 1.00000 1.00000 0.062262
2 0.195652
1 0.79348 0.92391 0.063822
3 0.050725
2 0.59783 0.63043 0.063822
4 0.043478
5 0.44565 0.64130 0.063990
5 0.032609
6 0.40217 0.57609 0.062777
6 0.010000
7 0.36957 0.51087 0.061056
14
plotcp(rpart.model)
15
> rsq.rpart(rpart.model)
Classification tree:
rpart(formula = Type ~ ., data = trainset)
Variables actually used in tree construction:
[1] Al Ba Mg RI
Root node error: 92/143 = 0.64336
n= 143
CP
nsplit rel error xerror xstd
1 0.206522
0 1.00000 1.00000 0.062262
2 0.195652
1 0.79348 0.92391 0.063822
3 0.050725
2 0.59783 0.63043 0.063822
4 0.043478
5 0.44565 0.64130 0.063990
5 0.032609
6 0.40217 0.57609 0.062777
6 0.010000
7 0.36957 0.51087 0.061056
Warning message:
In rsq.rpart(rpart.model) : may not be applicable for this method
16
rsq.rpart
17
> print(rpart.model)
n= 143
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 143 92 2 (0.3 0.36 0.091 0.056 0.049 0.15)
2) Ba< 0.335 120 70 2 (0.35 0.42 0.11 0.058 0.058 0.0083)
4) Al< 1.42 71 33 1 (0.54 0.28 0.15 0.014 0.014 0)
8) RI>=1.517075 58 22 1 (0.62 0.28 0.086 0.017 0 0)
16) RI< 1.518015 21 1 1 (0.95 0 0.048 0 0 0) *
17) RI>=1.518015 37 21 1 (0.43 0.43 0.11 0.027 0 0)
34) RI>=1.51895 25 10 1 (0.6 0.2 0.16 0.04 0 0)
68) Mg>=3.415 18 4 1 (0.78 0 0.22 0 0 0) *
69) Mg< 3.415 7 2 2 (0.14 0.71 0 0.14 0 0) *
35) RI< 1.51895 12 1 2 (0.083 0.92 0 0 0 0) *
9) RI< 1.517075 13 7 3 (0.15 0.31 0.46 0 0.077 0) *
5) Al>=1.42 49 19 2 (0.082 0.61 0.041 0.12 0.12 0.02)
10) Mg>=2.62 33 6 2 (0.12 0.82 0.061 0 0 0) *
11) Mg< 2.62 16 10 5 (0 0.19 0 0.37 0.37 0.062) *
3) Ba>=0.335 23 3 7 (0.043 0.043 0 0.043 0 0.87) *
18
Tree plot
> plot(rpart.model,compress=TRUE)
> text(rpart.model, use.n=TRUE)
19
plot(object, uniform=FALSE, branch=1, compress=FALSE, nspace, margin=0, minbranch=.3, args)
And if you are brave
summary(rpart.model)
… pages….
20
Wait, did anyone LOOK at the data?
> names(Glass)
[1] "RI" "Na" "Mg" "Al" "Si" "K" "Ca" "Ba" "Fe"
"Type"
> head(Glass)
RI
Na Mg Al Si K
Ca Ba Fe Type
1 1.52101 13.64 4.49 1.10 71.78 0.06 8.75 0 0.00 1
2 1.51761 13.89 3.60 1.36 72.73 0.48 7.83 0 0.00 1
3 1.51618 13.53 3.55 1.54 72.99 0.39 7.78 0 0.00 1
4 1.51766 13.21 3.69 1.29 72.61 0.57 8.22 0 0.00 1
5 1.51742 13.27 3.62 1.24 73.08 0.55 8.07 0 0.00 1
6 1.51596 12.79 3.61 1.62 72.97 0.64 8.07 0 0.26 1
21
rpart.pred
> rpart.pred
91 163 138 135 172 20 1 148 169 206 126 157 107 39 150
203 151 110 73 104 85 93 144 160 145 89 204 7 92 51
1 1 2 1 5 2 1 1 5 7 2 1 7 1 1 7 2 2 2 1
2 2 2 2 1 2 7 1 2 1
186 14 190 56 184 82 125 12 168 175 159 36 117 114 154
62 139 5 18 98 27 183 42 66 155 180 71 83 123 11
7 1 7 2 2 2 1 1 5 5 2 1 1 1 1 7 2 1 1 1
1 5 1 1 1 5 2 1 2 2
195 101 136 45 130 6 72 87 173 121 3
7 2 1 1 5 2 1 2 5 1 2
Levels: 1 2 3 5 6 7
22
plot(rpart.pred)
23
Hierarchical clustering
> dswiss <- dist(as.matrix(swiss))
> hs <- hclust(dswiss)
> plot(hs)
24
ctree
require(party)
swiss_ctree <- ctree(Fertility ~ Agriculture +
Education + Catholic, data = swiss)
plot(swiss_ctree)
25
26
hclust for iris
27
plot(iris_ctree)
1
Petal.Length
p < 0.001
£ 1.9
> 1.9
3
Petal.Width
p < 0.001
£ 1.7
4
Petal.Length
p < 0.001
£ 4.8
Node 2 (n = 50)
1
0.8
0.6
0.4
0.2
0
setosa
Node 5 (n = 46)
1
0.8
0.6
0.4
0.2
0
setosa
> 1.7
> 4.8
Node 6 (n = 8)
1
0.8
0.6
0.4
0.2
0
setosa
Node 7 (n = 46)
1
0.8
0.6
0.4
0.2
0
setosa
28
Ctree
> iris_ctree <- ctree(Species ~ Sepal.Length + Sepal.Width + Petal.Length +
Petal.Width, data=iris)
> print(iris_ctree)
Conditional inference tree with 4 terminal nodes
Response: Species
Inputs: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width
Number of observations: 150
1) Petal.Length <= 1.9; criterion = 1, statistic = 140.264
2)* weights = 50
1) Petal.Length > 1.9
3) Petal.Width <= 1.7; criterion = 1, statistic = 67.894
4) Petal.Length <= 4.8; criterion = 0.999, statistic = 13.865
5)* weights = 46
4) Petal.Length > 4.8
6)* weights = 8
3) Petal.Width > 1.7
7)* weights = 46
29
> plot(iris_ctree, type="simple”)
30
New dataset to work with trees
fitK <- rpart(Kyphosis ~ Age + Number + Start, method="class",
data=kyphosis)
printcp(fitK) # display the results
plotcp(fitK) # visualize cross-validation results
summary(fitK) # detailed summary of splits
# plot tree
plot(fitK, uniform=TRUE, main="Classification Tree for
Kyphosis")
text(fitK, use.n=TRUE, all=TRUE, cex=.8)
# create attractive postscript plot of tree
post(fitK, file = “kyphosistree.ps", title = "Classification Tree for
Kyphosis") # might need to convert to PDF (distill)
31
32
> pfitK<- prune(fitK, cp= fitK$cptable[which.min(fitK$cptable[,"xerror"]),"CP"])
> plot(pfitK, uniform=TRUE, main="Pruned Classification Tree for Kyphosis")
33
> text(pfitK, use.n=TRUE, all=TRUE, cex=.8)
> post(pfitK, file = “ptree.ps", title = "Pruned Classification Tree for Kyphosis”)
> fitK <- ctree(Kyphosis ~ Age + Number + Start, data=kyphosis)
> plot(fitK, main="Conditional Inference Tree for Kyphosis”)
34
> plot(fitK, main="Conditional Inference Tree for Kyphosis",type="simple")
35
randomForest
> require(randomForest)
> fitKF <- randomForest(Kyphosis ~ Age + Number + Start, data=kyphosis)
> print(fitKF)
# view results
Call:
randomForest(formula = Kyphosis ~ Age + Number + Start, data = kyphosis)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 1
OOB estimate of error rate: 20.99%
Confusion matrix:
absent present class.error
absent
59
5 0.0781250
present 12
5 0.7058824
> importance(fitKF) # importance of each predictor
MeanDecreaseGini
Age
8.654112
Number
5.584019
Start
10.168591
Random forests improve
predictive accuracy by
generating a large number of
bootstrapped trees (based on
random samples of variables),
classifying a case using each
tree in this new "forest", and
deciding a final predicted
outcome by combining the
results across all of the trees 36
(an average in regression, a
majority vote in classification).
More on another dataset.
# Regression Tree Example
library(rpart)
# build the tree
fitM <- rpart(Mileage~Price + Country + Reliability + Type,
method="anova", data=cu.summary)
printcp(fitM) # display the results
….
Root node error: 1354.6/60 = 22.576
n=60 (57 observations deleted due to missingness)
CP
nsplit rel error xerror xstd
1 0.622885
0 1.00000 1.03165 0.176920
2 0.132061
1 0.37711 0.51693 0.102454
3 0.025441
2 0.24505 0.36063 0.079819
4 0.011604
3 0.21961 0.34878 0.080273
5 0.010000
4 0.20801 0.36392 0.075650
37
Mileage…
plotcp(fitM) # visualize cross-validation results
summary(fitM) # detailed summary of splits
38
par(mfrow=c(1,2))
rsq.rpart(fitM) # visualize cross-validation results
39
# plot tree
plot(fitM, uniform=TRUE, main="Regression Tree for
Mileage ")
text(fitM, use.n=TRUE, all=TRUE, cex=.8)
# prune the tree
pfitM<- prune(fitM, cp=0.01160389) # from cptable
# plot the pruned tree
plot(pfitM, uniform=TRUE, main="Pruned Regression
Tree for Mileage")
text(pfitM, use.n=TRUE, all=TRUE, cex=.8)
40
?????
41
# Conditional Inference Tree for Mileage
fit2M <- ctree(Mileage~Price + Country +
Reliability + Type, data=na.omit(cu.summary))
42
Example
n <- 150 # number of data points
p <- 2 # dimension
sigma <- 1 # variance of the distribution
meanpos <- 0 # centre of the distribution of positive examples
meanneg <- 3 # centre of the distribution of negative examples
npos <- round(n/2) # number of positive examples
nneg <- n-npos # number of negative examples
# Generate the positive and negative examples
xpos <- matrix(rnorm(npos*p,mean=meanpos,sd=sigma),npos,p)
xneg <- matrix(rnorm(nneg*p,mean=meanneg,sd=sigma),npos,p)
x <- rbind(xpos,xneg)
# Generate the labels
y <- matrix(c(rep(1,npos),rep(-1,nneg)))
43
Train/ test
ntrain <- round(n*0.8) # number of training examples
tindex <- sample(n,ntrain) # indices of training samples
xtrain <- x[tindex,]
xtest <- x[-tindex,]
ytrain <- y[tindex]
ytest <- y[-tindex]
istrain=rep(0,n)
istrain[tindex]=1
44
Example
svp <- ksvm(xtrain,ytrain,type="C-svc", kernel='vanilladot',
C=100,scaled=c())
> alpha(svp)
[[1]]
[1] 100.00000 100.00000 100.00000 100.00000 58.00009 88.71759
100.00000 69.28250
> svp
Support Vector Machine object of class "ksvm”
SV type: C-svc (classification)
parameter : cost C = 100
Linear (vanilla) kernel function.
Number of Support Vectors : 8
Objective Function Value : -713.6363
Training error : 0.025
45
Example
> alphaindex(svp)
[[1]]
[1] 2 14 31 35 38 60 68 92
> b(svp)
[1] -4.20671
plot(svp,data=xtrain)
46
Cross-validation
• Cross-validation is a model validation
technique for assessing how the results of a
statistical analysis will generalize to an
independent data set.
• It is mainly used in settings where the goal is
prediction, and one wants to estimate how
accurately a predictive model will perform in
practice.
• I.e. predictive and prescriptive analytics…
47
Cross-validation
In a prediction problem, a model is usually
given a dataset of known data on which training
is run (training dataset), and a dataset of
unknown data (or first seen data) against which
the model is tested (testing dataset).
Sound familiar?
48
Cross-validation
The goal of cross validation is to define a
dataset to "test" the model in the training phase
(i.e., the validation dataset), in order to limit
problems like overfitting
And, give an insight on how the model will
generalize to an independent data set (i.e., an
unknown dataset, for instance from a real
problem), etc.
49
Common type of x-validation
•
•
•
•
K-fold
2-fold (do you know this one?)
Rep-random-subsample
Leave out-subsample
• Lab on Friday… to try these out
50
Admin info (keep/ print this slide)
•
•
•
•
•
•
•
•
•
Class: ITWS-4963/ITWS 6965
Hours: 12:00pm-1:50pm Tuesday/ Friday
Location: SAGE 3101
Instructor: Peter Fox
Instructor contact: [email protected], 518.276.4862 (do not
leave a msg)
Contact hours: Monday** 3:00-4:00pm (or by email appt)
Contact location: Winslow 2120 (sometimes Lally 207A
announced by email)
TA: Lakshmi Chenicheri [email protected]
Web site: http://tw.rpi.edu/web/courses/DataAnalytics/2014
– Schedule, lectures, syllabus, reading, assignments, etc.
51
Ozone
> library(e1071)
> library(rpart)
> data(Ozone, package=“mlbench”) # beware of “()”
> ## split data into a train and test set
> index <- 1:nrow(Ozone)
> testindex <- sample(index, trunc(length(index)/3))
> testset <- na.omit(Ozone[testindex,-3])
> trainset <- na.omit(Ozone[-testindex,-3])
> svm.model <- svm(V4 ~ ., data = trainset, cost = 1000,
gamma = 0.0001)
> svm.pred <- predict(svm.model, testset[,-3])
> crossprod(svm.pred - testset[,3]) / length(testindex)
See: http://cran.r-project.org/web/packages/e1071/vignettes/svmdoc.pdf
52