Decision Trees

Download Report

Transcript Decision Trees

COT5230 Data Mining
Week 7
Decision Trees
MONASH
AUSTRALIA’S
INTERNATIONAL
UNIVERSITY
Decision Trees
7.1
Outline
 Why use Decision Trees?
 What is a Decision Tree?
 Examples
 Use as a data mining technique
 Popular Models
– CART
– CHAID
– ID3 & C4.5
Decision Trees
7.2
Why use Decision Trees? - 1
 Whereas neural networks compute a
mathematical function of their inputs to generate
their outputs, decision trees use logical rules
IF
the vehicle has a 2-door FRAME AND
the vehicle has at least six CYLINDERS AND
the BUYER is less than 50 years old AND
the COST of the vehicle is >$30,000 AND
the vehicle COLOUR is red
THEN
the BUYER is likely to be a MALE
Decision Trees
7.3
Why use Decision Trees? - 2
 For some applications accuracy of classification
or prediction is sufficient, e.g.:
– Direct mail firm needing to find a model for identifying
customers who will respond to mail
– Predicting the stock market using past data
 In other applications it is better (sometimes
essential) that the decision be explained, e.g.:
– Rejection of a credit application
– Medical diagnosis
 Humans generally require explanations for most
decisions
Decision Trees
7.4
Why use Decision Trees? - 3
 Example: When a bank rejects a credit card
application, it is better to explain to the customer
that it was due to the fact that:
– He is not a permanent resident of Australia AND
– He has been residing in Australia for < 6 months AND
– He does not have a permanent job.
 This is better than saying:
 “We are very sorry, but our neural network thinks
that you are not a credit-worthy customer !!!” (In
which case the customer might become angry
and move to another bank)
Decision Trees
7.5
What is a Decision Tree? - 1
Decision Trees
7.6
What is a Decision Tree? - 2
 Built from root node (top) to leaf nodes (bottom)
 A record first enters the root node
 A test is applied to determine to which child
node it should go next
– A variety of algorithms for choosing the initial test exist.
The aim is to discriminate best between the target
classes
 The process is repeated until a record arrives at a
leaf node
 The path from the root to a leaf node provides an
expression of a rule
Decision Trees
7.7
Building a Decision Tree - 1
 Algorithms for building decision trees (DTs) begin
by trying to find the test which does the “best job”
of splitting the data into the desired classes
– The desired classes have to be identified at the start
 Example: we need to describe the car purchasing
profiles of males and females. The DT building
algorithm examines the car purchase details
database to find the best splitting criterion:
Engine type
No_of_doors
Age_of_customer
Colour
Cost
Type
 The DT algorithm may discover out that the
No_of _doors variable is best for separating
males and females
Decision Trees
7.8
Building a Decision Tree - 2
 The process is repeated to discover the best
splitting criterion for the records assigned to each
node
No_of_doors
4
2
Engine type
V6
Engine type
V4

Once built, the effectiveness of a decision tree
can be measured by applying it to a collection of
previously unseen records and observing the
percentage of correctly classified records
Decision Trees
7.9
Example - 1
Phone Technology
50 Churners
50 Non-churners
 Requirement: Classify
customers who churn
(do not renew their
phone contracts)
new
Time a Customer
30 Churners
50 Non-churners
<= 2.3 years
5 Churners
40 Non-churners
25 Churners
10 Non-churners
20 Churners
0 Non-churners
20 Churners
0 Non-churners
> 2.3 years
Age
<= 35
old
> 35
5 Churners
10 Non-churners
Decision Trees
7.10
Example - 2
 The number of records in a given parent node
equals the sum of the records contained in the
child nodes
 Quite easy to understand how the model is being
built (unlike NNs)
 Easy use the model
– say for a target marketing campaign
 Provides intuitive ideas about the customer base
– e.g: “Customers who have been with the company for a
couple of years and have new phones are pretty loyal”
Decision Trees
7.11
Use as a data mining technique - 1
 Exploration
– Analyzing the and splitting criteria selected by the
algorithm may provide interesting insights which can
be acted upon
– e.g. if the following rule was identified:
IF
time a customer < 1.1 years AND
sales channel = telesales
THEN chance of churn is 65%
– It might be worthwhile conducting a study on the way
the telesales operators are making their calls
Decision Trees
7.12
Use as a data mining technique - 2
 Exploration (continued)
– Gleaning information from rules that fail
– e.g. from the car example we obtained the rule:
IF
No_of_doors = 2 AND
Engine_type = (>= V6)
AND Customer_Age > 50
THEN there are almost no customers
– Can this rule be useful?
» A) Perhaps we can attempt to build up this non-existing
market. If this is possible then we have the edge over
competitors since we have a head start in this knowledge
» B) We can remove these customers from our direct marketing
campaign
Decision Trees
7.13
Use as a data mining technique - 3
 Exploration (continued)
– Again from the car example we noticed that:
» There was no combination of rules to reliably discriminate
between males and females for 4-door V4 engine vehicles
» Do we consider this as an occasion where it was not possible
to achieve our objective?
– From this failure we have learnt that it is not important
whether the customer is male or female for this
category of vehicles
– Perhaps we were asking the wrong question all along this warrants further analysis
Decision Trees
7.14
Use as a data mining technique - 4
 Data Pre-processing
– Decision trees are very robust at handling different
predictor types (number/categorical), and run quickly.
Therefore the can be good for a first pass over the data
in a data mining operation
– This will create a subset of the possibly useful
predictors which can then be fed into another model,
say a neural network
 Prediction
– Once the decision tree is built it can be then be used
as a prediction tool, by using on a set of new data
Decision Trees
7.15
Popular Decision Tree Models: CART - 1
 CART: Classification & Regression Trees,
developed in 1984 by a team of researchers (Leo
Breiman et al.) from Stanford University
– Used in the DM software Darwin - from Thinking
Machines Corporation (recently bought by Oracle)
 Uses an entropy metric to determine the split
point (Information theory - Shannon 1949)
 Measure of disorder =
  p log( p)
where p is is the probability of that prediction
value occurring in a particular node of the tree
 - CART produces a binary tree
Decision Trees
7.16
Popular Decision Tree Models: CART - 2
 Consider the “Churn” problem from slide 7.10
– At a certain node there are 100 customers to split
– The algorithm will try each predictor using the measure
of disorder (MOD)
– For each predictor the algorithm will calculate the MOD
for several values to identify the optimum
– e.g: take age as the predictor and split the records
according to age <=32 and age >32
» 100 total customers
» 30 who churn
» 70 who don’t churn
MOD  - (0.3log(0. 3)  0.7log(0.7 ))  0.4512
– CART will select the predictor with the lowest MOD as
the split point
Decision Trees
7.17
Node splitting
A good split
Name
Churned?
Name
Churned?
Jim
Yes
Bob
No
Sally
Yes
Betty
No
Steve
Yes
Sue
No
Joe
Yes
Alex
No
A bad split
Name
Churned?
Name
Churned?
Jim
Yes
Bob
No
Sally
Yes
Betty
No
Steve
No
Sue
Yes
Joe
No
Alex
Yes
Decision Trees
7.18
Popular Decision Tree Models: CHAID
 CHAID: Chi-squared Automatic Interaction
Detector, developed by J. A. Hartigan in 1975.
– Widely used since it is distributed as part of the popular
statistical packages SAS and SPSS
 Differs from CART in the way it identifies the split
points. Instead of the information measure, it uses
chi-squared test to identify the split points (a
statistical measure used for identifying
independent variables)
 All predictors must be categorical or put into
categorical form by binning
 The accuracy of the two methods CHAID and
CART have been found to be similar
Decision Trees
7.19
Popular Decision Tree Models: ID3 & C4.5
 ID3: Iterative Dichtomiser, developed by the
Australian researcher Ross Quinlan in 1979
– Used in the data mining software Clementine of
Integral Solutions Ltd. (taken over by SPSS)
 -ID3 picks predictors and their splitting values on
the basis of the gain in information provided
– Gain is the difference between the amount of
information that is needed to make a correct prediction
both before and after the split has been made
– If the amount of information required is much lower
after the split is made, then the split is said to have
decreased the disorder of the original data
Decision Trees
7.20
ID3 & C4.5 continued - 2
A
++++ -
B
+ ----
+++++----
-
By using the entropy   p log( p)
left
A ++++-
right left entropy
right entropy
start entropy
+---- -4/5log(4/5)+
-4/5log(4/5)+
-5/10log(5/10)+
-1/5log(1/5)=.72 -1/5log(1/5)=.72 -5/10(log(5/10)
=1
B +++++ ----
-5/9log(5/9)+
-1/1log(1/1)
-4/9log(4/9)=.99 =0
Decision Trees
7.21
ID3 & C4.5 continued - 3
Weighted Entropy
Gain
A
(5/10)*0.72+(5/10)*0.72
= 0.72
0.28
B
(9/10)*0.99+(1/10)*0
= 0.89
0.11
 Split A will be selected
 C4.5 introduces a number of extensions to ID3:
– Handles unknown field values in training set
– Tree pruning method
– Automated rule generation
Decision Trees
7.22
Strengths and Weaknesses
 Strengths of decision trees
– Able to generate understandable rules
– Classify with very little computation
– Handle both continuous and categorical data
– Provides a clear indication of which variables are most
important for prediction or classification
 Weaknesses
– Not appropriate for estimation or prediction tasks
(income, interest rates, etc.)
– Problematic with time series data (much preprocessing required), can be computationally
expensive
Decision Trees
7.23