Decision Trees and Random Forest

Download Report

Transcript Decision Trees and Random Forest

Decision Tree
Dr. Jieh-Shan George YEH
[email protected]
Decision Tree
• Recursive partitioning is a fundamental tool in
data mining.
• It helps us explore the structure of a set of data,
while developing easy to visualize decision rules
for predicting a categorical (classification tree) or
continuous (regression tree) outcome.
• Decision tree is an algorithm the can have a
continuous or categorical dependent (DV) and
independent variables (IV).
Decision Tree
Advantages to using trees
Simple to understand and interpret.
People are able to understand decision tree models after a
brief explanation.
Requires little data preparation.
Other techniques often require data normalization, dummy
variables need to be created and blank values to be
removed.
Able to handle both numerical and categorical data.
Advantages to using trees
Uses a white box model.
If a given situation is observable in a model the explanation
for the condition is easily explained by Boolean logic
Possible to validate a model using statistical tests.
That makes it possible to account for the reliability of the
model.
Performs well with large data in a short time.
Some things to consider when coding
the model…
 Splits. Gini or information.
 Type of DV (method). Classification (class), regression (anova),
count (poison), survival (exp).
 Minimum of observations for a split (minsplit).
 Minimum of observations in a node (minbucket).
 Cross validation (xval). Used more in model building rather than in
exploration.
 Complexity parameter (Cp). This value is used for pruning. A
smaller tree is perhaps less detailed, but with less error.
R has many packages for similar/same
endeavors
 party.
 rpart. Comes with R.
 C50.
 Cubists.
 rpart.plot. Makes rpart plots much nicer.
Dataset iris
• The iris dataset has been used for classification in many research
publications. It consists of 50 samples from each of three classes
of iris flowers [Frank and Asuncion, 2010]. One class is linearly
separable from the other two, while the latter are not linearly
separable from each other.
There are five attributes in the dataset:
Sepal.Length in cm,
Sepal.Width in cm,
Petal.Length in cm,
Petal.Width in cm, and
Species: Iris Setosa, Iris Versicolour, and Iris Virginica.
Sepal.Length, Sepal.Width, Petal.Length and Petal.Width
are used to predict the Species of flowers.
str(iris)
• head(iris)
Sepal.Length Sepal.Width
1
5.1
3.5
2
4.9
3.0
3
4.7
3.2
4
4.6
3.1
5
5.0
3.6
6
5.4
3.9
Petal.Length Petal.Width
1.4
0.2
1.4
0.2
1.3
0.2
1.5
0.2
1.4
0.2
1.7
0.4
Species
setosa
setosa
setosa
setosa
setosa
setosa
plot(iris, col=iris$Species)
http://cran.r-project.org/web/packages/party/party.pdf
CTREE: CONDITIONAL INFERENCE
TREE
Conditional Inference Trees
Description
Recursive partitioning for continuous, censored, ordered, nominal and
multivariate response variables in a conditional inference framework.
Usage
ctree(formula, data, subset = NULL, weights = NULL, controls =
ctree_control(), xtrafo = ptrafo, ytrafo = ptrafo, scores = NULL)
Arguments
formula
a symbolic description of the model to be fit. Note that symbols
like : and - will not work and the tree will make use of all variables
listed on the rhs of formula.
data
a data frame containing the variables in the model.
subset
an optional vector specifying a subset of observations to be used
in the fitting process.
weights
an optional vector of weights to be used in the fitting process. Only
non-negative integer valued weights are allowed.
controls
an object of class TreeControl, which can be obtained
using ctree_control.
• Before modeling, the iris data is split below into
two subsets: training (70%) and test (30%)
• The random seed is set to a fixed value below to
make the results reproducible
set.seed(1234)
ind <- sample(2, nrow(iris), replace=TRUE,
prob=c(0.7, 0.3))
trainData <- iris[ind==1,]
testData <- iris[ind==2,]
install.packages("party")
library(party)
# Species is the target variable and all other
variables are independent variables.
myFormula <- Species ~ Sepal.Length +
Sepal.Width + Petal.Length + Petal.Width
iris_ctree <- ctree(myFormula, data=trainData)
Prediction Table
# check the prediction
table(predict(iris_ctree), trainData$Species)
print(iris_ctree)
Conditional inference tree with 4 terminal nodes
Response: Species
Inputs: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width
Number of observations: 112
1) Petal.Length <= 1.9; criterion = 1, statistic = 104.643
2)* weights = 40
1) Petal.Length > 1.9
3) Petal.Width <= 1.7; criterion = 1, statistic = 48.939
4) Petal.Length <= 4.4; criterion = 0.974, statistic = 7.397
5)* weights = 21
4) Petal.Length > 4.4
6)* weights = 19
3) Petal.Width > 1.7
7)* weights = 32
plot(iris_ctree)
plot(iris_ctree, type="simple")
# predict on test data
testPred <- predict(iris_ctree, newdata =
testData)
table(testPred, testData$Species)
Issues on ctree()
• The current version of ctree() does not handle
missing values well, in that an instance with a
missing value may sometimes go to the left
sub-tree and sometimes to the right. This
might be caused by surrogate rules.
• When a variable exists in training data and is
fed into ctree() but does not appear in the
built decision tree, the test data must also
have that variable to make prediction.
Otherwise, a call to predict() would fail.
Issues on ctree()
• If the value levels of a categorical variable in
test data are different from that in training
data, it would also fail to make prediction on
the test data.
• One way to get around the above issue is, after
building a decision tree, to call ctree() to build a new
decision tree with data containing only those
variables existing in the first tree, and to explicitly set
the levels of categorical variables in test data to the
levels of the corresponding variables in training data.
More info
#Edgar Anderson's Iris Data
help("iris")
#Conditional Inference Trees
help("ctree")
#Class "BinaryTree"
help("BinaryTree-class")
#Visualization of Binary Regression Trees
help("plot.BinaryTree")
http://cran.r-project.org/web/packages/rpart/rpart.pdf
RPART: RECURSIVE PARTITIONING AND
REGRESSION TREES
Recursive partitioning for classification,
regression and survival trees
data("bodyfat", package="TH.data")
dim(bodyfat)
set.seed(1234)
ind <- sample(2, nrow(bodyfat), replace=TRUE, prob=c(0.7, 0.3))
bodyfat.train <- bodyfat[ind==1,]
bodyfat.test <- bodyfat[ind==2,]
# train a decision tree
library(rpart)
myFormula <- DEXfat ~ age + waistcirc + hipcirc + elbowbreadth + kneebreadth
bodyfat_rpart <- rpart(myFormula, data = bodyfat.train,
control = rpart.control(minsplit = 10))
attributes(bodyfat_rpart)
print(bodyfat_rpart$cptable)
print(bodyfat_rpart)
plot(bodyfat_rpart)
text(bodyfat_rpart, use.n=T)
• select the tree with the minimum prediction
error
opt <- which.min(bodyfat_rpart$cptable[,"xerror"])
cp <- bodyfat_rpart$cptable[opt, "CP"]
bodyfat_prune <- prune(bodyfat_rpart, cp = cp)
print(bodyfat_prune)
plot(bodyfat_prune)
text(bodyfat_prune, use.n=T)
• After that, the selected tree is used to make
prediction and the predicted values are
compared with actual labels.
• Function abline() draws a diagonal line. The
predictions of a good model are expected to
be equal or very close to their actual values,
that is, most points should be on or close to
the diagonal line.
DEXfat_pred <- predict(bodyfat_prune,
newdata=bodyfat.test)
xlim <- range(bodyfat$DEXfat)
plot(DEXfat_pred ~ DEXfat, data=bodyfat.test,
xlab="Observed",
ylab="Predicted", ylim=xlim, xlim=xlim)
abline(a=0, b=1)
More info
#Recursive Partitioning and Regression Trees
help("rpart")
#Control for Rpart Fits
help("rpart.control")
#Prediction of Body Fat by Skinfold Thickness,
Circumferences, and Bone Breadths
??TH.data::bodyfat
http://cran.r-project.org/web/packages/C50/C50.pdf
C5.0
C50
library(C50)
myFormula <- Species ~ Sepal.Length + Sepal.Width +
Petal.Length + Petal.Width
iris_C5.0 <- C5.0(myFormula, data=trainData)
summary(iris_C5.0)
C5imp(iris_C5.0)
C5.0testPred <- predict(iris_C5.0, testData)
table(C5.0testPred, testData$Species)
predict(iris_C5.0, testData, type = "prob")
More info
#C5.0 Decision Trees and Rule-Based Models
help("C5.0")
#Control for C5.0 Models
help("C5.0Control")
#Summaries of C5.0 Models
help("summary.C5.0")
#Variable Importance Measures for C5.0 Models
help("C5imp")
http://cran.r-project.org/web/packages/rpart.plot/rpart.plot.pdf
PLOT RPART MODELS. AN ENHANCED
VERSION OF PLOT.RPART
rpart.plot
library(rpart.plot)
data(ptitanic) #Titanic data
tree <- rpart(survived ~ ., data=ptitanic, cp=.02)
# cp=.02 because want small tree for demo
rpart.plot(tree, main="default rpart.plot\n(type = 0,
extra = 0)")
prp(tree, main="type = 4, extra = 6", type=4, extra=6,
faclen=0)
# faclen=0 to print full factor names
rpart.plot
rpart.plot(tree, main="extra = 106, under = TRUE",
extra=106, under=TRUE, faclen=0)
# the old way for comparison
plot(tree, uniform=TRUE, compress=TRUE, branch=.2)
text(tree, use.n=TRUE, cex=.6, xpd=NA) # cex is a guess,
depends on your window size
title("rpart.plot for comparison", cex=.6)
rpart.plot(tree, box.col=3, xflip=FALSE)
More info
#Titanic data with passenger names and other
details removed.
help("ptitanic")
# Plot an rpart model.
help("rpart.plot")
# Plot an rpart model. A superset of rpart.plot.
help("prp")