7class - Meetup

Download Report

Transcript 7class - Meetup

Applied Data Mining
Basic Decision Trees in R
Myles Gartland
Rockhurst University
April 8, 2016
Data Mining: Concepts and Techniques
1
To GUI or Not to GUI
• CLI
• IDE
• GUI
April 8, 2016
Data Mining: Concepts and Techniques
2
Classification vs. Prediction
• Classification
– predicts categorical class labels (discrete or nominal)
– classifies data (constructs a model) based on the training
set and the values (class labels) in a classifying attribute
and uses it in classifying new data
• Prediction
– models continuous-valued functions, i.e., predicts unknown
or missing values
• Typical applications
– Credit approval
– Target marketing
– Medical diagnosis
– Fraud detection
April 8, 2016
Data Mining: Concepts and Techniques
3
Classification—A Two-Step Process
• Model construction: describing a set of predetermined classes
– Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
– The set of tuples used for model construction is training set
– The model is represented as classification rules, decision trees, or
mathematical formulae
• Model usage: for classifying future or unknown objects
– Estimate accuracy of the model
• The known label of test sample is compared with the classified
result from the model
• Accuracy rate is the percentage of test set samples that are
correctly classified by the model
• Test set is independent of training set, otherwise over-fitting will
occur
– If the accuracy is acceptable, use the model to classify data tuples
whose class labels are not known
April 8, 2016
Data Mining: Concepts and Techniques
4
Supervised vs. Unsupervised Learning
• Supervised learning (classification)
– Supervision: The training data (observations,
measurements, etc.) are accompanied by labels indicating
the class of the observations
– New data is classified based on the training set
• Unsupervised learning (clustering)
– The class labels of training data is unknown
– Given a set of measurements, observations, etc. with the
aim of establishing the existence of classes or clusters in
the data
April 8, 2016
Data Mining: Concepts and Techniques
5
Process (1): Model Construction
Training
Data
NAME
M ike
M ary
B ill
Jim
D ave
Anne
RANK
YEARS TENURED
A ssistan t P ro f
3
no
A ssistan t P ro f
7
yes
P ro fesso r
2
yes
A sso ciate P ro f
7
yes
A ssistan t P ro f
6
no
A sso ciate P ro f
3
no
April 8, 2016
Data Mining: Concepts and Techniques
Classification
Algorithms
Classifier
(Model)
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
6
Process (2): Using the Model in Prediction
Classifier
Testing
Data
Unseen Data
(Jeff, Professor, 4)
NAME
Tom
M erlisa
G eo rg e
Jo sep h
April 8, 2016
RANK
YEARS TENURED
A ssistan t P ro f
2
no
A sso ciate P ro f
7
no
P ro fesso r
5
yes
A ssistan t P ro f
7
yes
Data Mining: Concepts and Techniques
Tenured?
7
Decision Tree with rpart
• Recursive partitioning and regression trees.
• Non-trademarked version of CART
• Rules based on variables' values are selected
to get the best split to differentiate
observations based on the dependent variable
• Once a rule is selected and splits a node into
two, the same process is applied to each
"child" node (i.e. it is a recursive procedure)
April 8, 2016
Data Mining: Concepts and Techniques
8
• Splitting stops when CART detects no further
gain can be made, or some pre-set stopping
rules are met. (Alternatively, the data are split
as much as possible and then the tree is later
pruned.)
• Each branch of the tree ends in a terminal
node. Each observation falls into one and
exactly one terminal node, and each terminal
node is uniquely defined by a set of rules.
April 8, 2016
Data Mining: Concepts and Techniques
9
CP complexity parameter.
• Any split that does not decrease the overall lack
of fit by a factor of cp is not attempted. For
instance, with anova splitting, this means that the
overall R-squared must increase by cp at each
step. The main role of this parameter is to save
computing time by pruning off splits that are
obviously not worthwhile. Essentially ,the user
informs the program that any split which does
not improve the fit by cp will likely be pruned off
by cross-validation, and that hence the program
need not pursue it.
April 8, 2016
Data Mining: Concepts and Techniques
10
What is a Decision Tree
April 8, 2016
Data Mining: Concepts and Techniques
11
The process
• Start with the business question: “what are
the characteristics of people who are likely to
purchase a personal savings account”
• Build the model
– Create training and validation samples
– Build a functional formula: SA ~ EV
– Build the tree rpart(formula, data, control)
– Test for accuracy
– Build back into business rules
April 8, 2016
Data Mining: Concepts and Techniques
12
Algorithm for Decision Tree Induction
• Basic algorithm (a greedy algorithm)
– Tree is constructed in a top-down recursive divide-and-conquer manner
– At start, all the training examples are at the root
– Attributes are categorical (if continuous-valued, they are discretized in
advance)
– Examples are partitioned recursively based on selected attributes
– Test attributes are selected on the basis of a heuristic or statistical
measure (e.g., information gain)
• Conditions for stopping partitioning
– All samples for a given node belong to the same class
– There are no remaining attributes for further partitioning – majority
voting is employed for classifying the leaf
– There are no samples left
April 8, 2016
Data Mining: Concepts and Techniques
13
Random Forest
• Random forests are an ensemble learning
method for classification (and regression)
that operate by constructing a multitude of
decision trees at training time and
outputting the class that is the mode of the
classes output by individual trees.
April 8, 2016
Data Mining: Concepts and Techniques
14
Random Forest Limits
• No missing data
• Predictors cannot have more than 32 levels
• rf.model <- randomForest(DV ~ ., data=,
importance=TRUE, proximity=TRUE)
April 8, 2016
Data Mining: Concepts and Techniques
15