Decision Tree Models in Data Mining
Download
Report
Transcript Decision Tree Models in Data Mining
Decision Tree Models
in Data Mining
Matthew J. Liberatore
Thomas Coghlan
Decision Trees in Data Mining
Decision Trees can be used to predict a categorical
or a continuous target (called regression trees in the
latter case)
Like logistic regression and neural networks decision
trees can be applied for classification and prediction
Unlike these methods no equations are estimated
A tree structure of rules over the input variables are
used to classify or predict the cases according to the
target variable
The rules are of an IF-THEN form – for example:
If Risk = Low, then predict on-time payment of a loan
Decision Tree Approach
A decision tree represents a hierarchical segmentation of
the data
The original segment is called the root node and is the
entire data set
The root node is partitioned into two or more segments by
applying a series of simple rules over an input variables
Each resulting segment can be further partitioned into
sub-segments, and so on
For example, risk = low, risk = not low
Each rule assigns the observations to a segment based on its input
value
For example risk = low can be partitioned into income = low and
income = not low
The segments are also called nodes, and the final
segments are called leaf nodes or leaves
Decision Tree Example – Loan
Payment
Income
< $30k
>= $30k
Age
Credit Score
< 25
>=25
< 600
>= 600
not on-time
on-time
not on-time
on-time
Growing the Decision Tree
Growing the tree involves successively partitioning
the data – recursively partitioning
If an input variable is binary, then the two categories
can be used to split the data
If an input variable is interval, a splitting value is
used to classify the data into two segments
For example, if household income is interval and
there are 100 possible incomes in the data set, then
there are 100 possible splitting values
For example, income < $30k, and income >= $30k
Evaluating the partitions
When the target is categorical, for each partition of
an input variable a chi-square statistic is computed
A contingency table is formed that maps responders
and non-responders against the partitioned input
variable
For example, the null hypothesis might be that there
is no difference between people with income <$30k
and those with income >=$30k in making an on-time
loan payment
The lower the significance or p-value, the more likely that
we reject this hypothesis, meaning that this income split is a
discriminating factor
Contingency Table
$<30k
Payment
on-time
Payment
not on-time
total
$>=30k
total
Chi-Square Statistic
The chi-square statistic computes a measure
of how different the number of observations is
in each of the four cells as compared to the
expected number
The p-value associated with the null hypothesis is
computed
Enterprise Miner then computes the logworth
of the p-value, logworth = - log10(p-value)
The split that generates the highest logworth
for a given input variable is selected
Growing the Tree
In our loan payment example, we have three intervalvalued input variables: income, age, and credit score
We compute the logworth of the best split for each of
these variables
We then select the variable that has the highest logworth
and use its split – suppose it is income
Under each of the two income nodes, we then find the
logworth of the best split of age and credit score and
continue the process -
subject to meeting the threshold on the significance of the chisquare value for splitting and other stopping criteria (described
later)
Other Splitting Criteria for a
Categorical Target
The gini and entropy measures are based on how
heterogeneous the observations are at a given node
relates to the mix of responders and non-responders at the node
Let p1 and p0 represent the proportion of responders and nonresponders at a node, respectively
If two observations are chosen (with replacement) from a node,
the probability that they are either both responders or both nonresponders is (p1)2 + (p0)2
The gini index = 1 – [(p1)2 + (p0)2], the probability that both
observations are different
Best case is a gini index of 0 (all observations are the same)
An index of ½ means both groups equally represented
Other Splitting Criteria for a
Categorical Target
The rarity of an event is defined as: -log2(pi)
Entropy sums up the rarity of response and
non-response over all observations
Entropy ranges from the best case of 0 (all
responders or all non-responders) to 1 (equal
mix of responders and non-responders)
Splitting Criteria for a
Continuous (Interval) Target
An F-statistic is used to measure the degree of
separation of a split for an interval target, such as
revenue
Similar to the sum of squares discussion under
multiple regression, the F-statistic is based on the
ratio of the sum of squares between the groups and
the sum of squares within groups, both adjusted for
the number of degrees of freedom
The null hypothesis is that there is no difference in
the target mean between the two groups
As before, the logworth of the p-value is computed
Some Adjustments
The more possible splits of an input variable,
the less accurate the p-value (bigger chance
of rejecting the null hypothesis)
If there are m splits, the Bonferroni adjustment adjusts
the p-value of the best case by subtracting log10(m)
from the logworth
If Time of Kass Adjustment is set to before
then the p-values of the splits are compared
with Bonferroni adjustment
Some Adjustments
Setting Split Adjustment property to Yes means that the
significance of the p-value can be adjusted by the depth of
the tree
For example, at the fourth split, a calculate p-value of 0.04
becomes 0.04*24 = 0.64, making the split statistically
insignificant
This leads to rejecting more splits, limiting the size of the tree
Tree growth can also be controlled by setting:
Leaf Size property (minimum number of observations in a
leaf)
Split Size property (minimum number of observations to
allow a node to be split)
Maximum Depth property (maximum number of generation of
nodes)
Some Results
The posterior probabilities are the proportions
of responders and non-responders at each
node
A node is classified as a responder or nonresponder depending on which posterior
probability is the largest
In selecting the best tree, one can use
Misclassification, Lift, or Average Squared
Error
Creating a Decision Tree
Model in Enterprise Miner
Open the bankrupt project, and create a new
diagram called Bankrupt_DecTree
Drag and drop the bankrupt data node and
the Decision Tree node (from the model tab)
onto the diagram
Connect the nodes
Select ProbChisq for the Criterion under Splitting Rule
Change Use Input Once to Yes (otherwise, the same variable
can appear more than once in the tree)
Under Subtree select Misclassification for Assessment Measure
Keep defaults under P-Value Adjustment and Output Variables
Under Score set Variable Selection to No (otherwise variables with
importance values greater than 0.05 are set as rejected and not
considered by the tree)
The Decision Tree has only one split on RE/TA. The misclassification
rate is 0.15 (3/20), with 2 false negatives and 1 false positive. The
cumulative lift is somewhat lower than the best cumulative lift, and starts
out at 1.777 vs. the best value of 2.000.
Under Subtree, set Method to Largest and rerun. The result show that
another split is added, using EBIT/TA. However, the misclassification rate
is unchanged at 0.15. This result shows that setting Method to
Assessment and Misclassification for Assessment Measure finds the
smallest tree having the lowest misclassification
Model Comparison
The Model Comparison node under the Assess tab
can be used to compare several different models
Create a diagram called Full Model that includes the
bankrupt data node connected into the regression,
decision tree, and neural network nodes
Connect the three model nodes into the Model
Comparison node, and connect it and the
bankrupt_score data node into a Score node
For Regression, set Selection Model to none; for Neural Network,
set Model Selection Criterion to Average Error, and the Network
properties as before; for Decision Tree, set Assessment Measure
as Average Squared Error, and the other properties as before.
This puts each of the models on a similar basis for fit. For Model
Comparison set Selection Criterion as Average Squared Error.
Neural Network is selected, although Regression is nearly identical in
average squared error. The Receiver Operating Characteristic (ROC)
curve shows sensitivity (true positives) vs. 1-specificity (false positives) for
various cutoff probabilities of a response. The chart shows that no matter
what the cutoff probabilities are, regression and neural network classify
100% of responders as responders (sensitivity) and 0% of nonresponders as responders (1-specificity). Decision tree performs
reasonably well, as indicated by the area above the diagonal line.