Transcript CART

CART: Classification and
Regression Trees
Chris Franck
LISA Short Course
March 26, 2013
Outline
• Overview of LISA
• Overview of CART
• Classification tree description
– Examples – iris and skull data.
• Regression tree description
– Examples – simulated and car data
• Going further
– Mention cross validation, pruning, cost-complexity
In addition to CART, these statistical and
practical principals will be discussed
• R programming.
• Importance of exploratory data analysis.
• Use trees to predict outcomes for newly
collected data.
• Graphical Comparison with regression.
• Performance assessment on simulated data.
• Importance of model validation (brief).
Laboratory for Interdisciplinary Statistical
Analysis
LISA helps VT researchers benefit from the use of
Statistics
Collaboration:
Visit our website to request personalized statistical advice and assistance with:
Experimental Design • Data Analysis • Interpreting Results
Grant Proposals • Software (R, SAS, JMP, SPSS...)
LISA statistical collaborators aim to explain concepts in ways useful for your research.
Great advice right now: Meet with LISA before collecting your data.
LISA also offers:
Educational Short Courses: Designed to help graduate students apply statistics in their research
Walk-In Consulting: M-F 1-3 PM GLC Video Conference Room for questions requiring <30 mins
Also 3-5 PM Port (Library/Torg Bridge) and 9-11 AM ICTAS Café X
All services are FREE for VT researchers. We assist with research—not class projects or homework.
www.lisa.stat.vt.edu
4
Tree-based methods
• Imagine you have two predictor variables 𝑋1 ,
𝑋2 , and you want to predict the outcome
variable 𝑌.
• 𝑌 may be continuous or categorical.
• The idea behind CART is to split the predictor
space into a number of regions based on
𝑋1 , 𝑋2 such that 𝑦 is well-characterized in each
region.
The above idea is simple, although some of the
language surrounding CART can sound technical.
• Classification: When 𝑦 is categorical, guessing 𝑦 ′ 𝑠
category (using predictors) is referred to as
classification.
• Regression: when 𝑦 is continuous, characterizing
average behavior of y as a function of predictors is
known as regression.
• Recursive binary partitioning: A technique to split the
predictor space: Binary because we split it into two
pieces, recursive because we do it over and over, and
partition means that the resulting regions do not
overlap and they cover entire predictor space.
Imagine the predictor space of 𝑋1 and
𝑋2 as a rectangular box.
• The outcome variable 𝑦
is a third dimension
coming towards you
from the slides.
• We want to split this
rectange up into
regions such that every
regions does a good
job characterizing 𝑦.
How can CART divide the data space?
Example 1 iris data
• In Rstudio, type ‘?iris’ (no quotes) to open the
help file on the iris data. Preceding built-in
data objects or functions with ‘?’ in R opens
the help file.
• Install the ‘tree’ package
– Tools -> Install Packages… -> type ‘tree’ -> click
install
• Open ‘CART course code’
Iris data review
• Exploratory data analysis (e.g. graphical plots) is generally a
good first step when you encounter a new data set.
• We built and snipped a classification tree for irises, which
had a misclassification rate of 2.667% (only 4 incorrect
predictions)
• We only used two predictors. (Would you have guessed
based on EDA?)
• Tree diagrams and wireframes can be useful ways to
visualize the tree.
– Tree diagrams work regardless of the number of 𝑋 variables
used in the tree.
– Wireframes work with 2 𝑋 variables.
Tree splits are chosen to minimize
Deviance at each step
• Statistical deviance generally encapsulates the
difference between two probability measures.
• In CART, deviance measures the heterogeneity
(e.g. misclassification, variability) at each node.
• Classification trees: 𝐷𝑖 = −2 𝑘 𝑛𝑖𝑘 log(𝑝𝑖𝑘 )
• Regression trees: 𝐷𝑖 =
2
(𝑦
−
𝜇
)
𝑗 𝑗
𝑗
– Where 𝑖 is index for terminal nodes, 𝑗 = 1, … , 𝑛 is
index for data, 𝑘 indexes the classes in each leaf. (See
skull, simulation examples for an illustration).
Another example – Tibetan skulls
Description from Hand et. al. (1996).
Skull data review
• We grew another classification tree
• Predicted an outcome based on new data
• Looked at deviance calculation
Under the hood
• CART uses a greedy algorithm.
• At each step the chosen split is the one which.
maximizes classification/ minimizes error.
• Similar to forward variable selection in
regression.
• The splitting continues until nodes become
“too small” or deviance explained by a new
split is small relative to starting deviance (see
‘?tree.control’ for more details)
Final two examples: Regression trees
• Similar to classification trees but for
continuous outcomes.
• Simulated example – when we know the
correct answer, does the method work?
• Motor trend car data
Advantages of CART
• Can be used to characterize outcomes as a
function of many predictors
• Simple, yet powerful.
• Tree can be visualized easily in high
dimension.
• Classification is highly similar to regression in
CART. (more similar than orginary least
squares versus logistic regression in my
opinion).
Caveats of CART
• Trees tend to overfit data
– We saw low classification error rates and good
deviance performance for the data used to
construct the tree.
– Would the trees we built necessarily predict new
irises, skulls, or cars as well?
• Small changes to input data could result in
major changes to tree structure (homework).
Cross validation is typically used to
assess overfitting
• K-fold cross validation: A technique which
assesses the predictive value of a model (tree
in this case) for new data.
– Split the data into k (say 10) parts.
– Withhold one part (validation set), grow the tree
using other 9 parts (training set).
– Assess predictive accuracy on the validation part
using the tree.
– Repeat, holding all 10 parts out in turn.
How big should the tree be?
• Too big will overfit data
• Too small might miss important structures
• Generally, cost-complexity pruning can be
used.
– Make a big tree (say until no node has more than
5 observations)
– Consider all subtrees which can be achieved by
pruning the big tree. Choose tree which satisfies a
cost-complexity criterion (see ?prune.tree in R,
references for more detail).
Random Forests – another tree-based
technique
• Basic idea is to sample data with replacement
(i.e. bootstrap sample).
• 1/3 of each sample is left out.
• 2/3 of data used to build a tree, then
performance of tree determined based on holdout data
• Grow a large number of trees, each of which
“votes” for a certain classification.
• See
http://www.stat.berkeley.edu/~breiman/Random
Forests/cc_home.htm
References
• The Elements of Statistical Learning: Data Mining,
Inference, and Prediction. Hastie T, Tibshirani R,
Friedman J. Second Edition 2009.
• Morant GM. A First Study of the Tibetan Skull
Biometrika, Vol. 14, No. 3/4 (Mar., 1923), pp. 193-260
• Discussion of “deviance”
http://stats.stackexchange.com/questions/6581/whatis-deviance-specifically-in-cart-rpart
• Hand DJ, Daly F, McConway K, Lunn D, Ostrowski E. A
Handbook of Small Data Sets, Chapman and Hall 1996.