Transcript Slide 1
Chapter 8
Decision Tree Algorithms
Rule Based
Suitable for automatic generation
結束
Contents
Presents the concept of decision tree models
Discusses the concept rule interestingness
Demonstrates decision tree rules on a case
Review real applications of decision tree models
Shows the application of decision tree models to larger
data sets
Demonstrates See5 decision tree analysis in the
appendix
8-2
結束
Problems
Grocery stores has a massive data problem in inventory
control, dealt with a high degree by bar-coding.
The massive data database of transactions can be
mined to monitor customer demand.
Decision trees provide a means to obtain productspecific forecasting models in the form of rules (IFTHEN) that are easy to implement.
Decision-tree can be used by grocery stores in a
number of policy decisions, including ordering
inventory replenishment and evaluating alternative
promotion campaigns.
8-3
結束
Decision tree
Decision tree refers to the tree structure of rules (often
association rules).
The decision tree modeling process involves collecting
those variables that the analyst thinks might bear on the
decision at issue, and analyzing these variables for
their ability to predict the outcome.
The algorithm automatically determines which
variables are most important, based on their ability to
sort the data into the correct output category.
Decision tree has a relative advantage over ANN and
GA in that a reusable set of rules are provided, this
explaining model conclusions.
8-4
結束
Decision Tree Applications
Classifying loan applications, screening potential
consumers, and rating job applicants.
Decision tree provide a way to implement rule-based
system approach.
監督式學習模型
樹型結構
決策樹
(類別型屬性)
關聯模式
規則誘發
迴歸樹
(連續型屬性)
CN2
ITRULE
CART
CART
C4.5/C5
ID3
M5
Cubist
8-5
結束
Types of Trees
Classification tree
Variable values classes
Finite conditions
Regression tree
Variable values continuous numbers
Prediction or estimation
8-6
結束
Rule Induction
Automatically process data
Classification (logical, easier)
Regression (estimation, messier)
Search through data for patterns & relationships
Pure knowledge discovery
Assumes no prior hypothesis
Disregards human judgment
8-7
結束
Decision trees
Logical branching
Historical:
ID3 – early rule- generating system
C4.5
See 5
Branches:
Different possible values
Nodes:
From which branches emanate (spray)
8-8
結束
Decision tree operation
A bank may have a database of pass loan applicants for
short-term loans. (see table. 4.4)
The bank’s policy treats applicants differently by age
group, income level, and risk.
A tree sorts the possible combinations of these
variables. An exhaustive tree enumerates all
combinations of variable values in Table. 8.1
8-9
結束
Decision tree operation
8-10
結束
Decision tree operation
A rule-based system model would require that bank
loan officers who had respected judgment be
interviewed to classify the decision for each of these
combinations of variables.
Some situations can be directly reduced in decision
tree.
8-11
結束
Rule interestingness
Data, even categorical data can potentially involve many
rules.
See Table. 8.1, 333=27 combinations. If 10 variables
and each with 4 possible values(410=1,048,576), the
combinations become over a million. unreasonable
Decision tree models identify the most useful rules in terms
of predicting outcomes. Rule effectiveness is measured in
terms of confidence and support.
Confidence is the degree of accuracy of a rule
Support is the degree to which the antecedent
condition occur in the data.
8-12
結束
Tanagra Example
8-13
結束
Support & Confidence
Support for an association rule indicates the proportion
of records covered by the set of attributes in the
association rules.
Example: If there were 10 million book purchases,
support for the given rule would be 10/10,000,000, a
very small support measure of 0.000001. These
concepts are often used in the form of threshold levels
in machine learning systems.
Minimum confidence levels and support levels can be
specified to retain rules identified by decision tree.
8-14
結束
Machine learning
Rule-induction algorithms can automatically process
categorical data (also can work on continuous data). A
clear outcome is needed.
Rule induction works by searching through data for
patterns and relationships.
Machine learning starts with on assumptions, looking
only at input data and results.
Recursive partitioning algorithms split data (original
data) into finer and finer subsets leading to a decision
tree.
8-15
結束
Cases
20 past loan application cases in Table 8.3.
8-16
結束
Cases
Automatic machine learning begins with identifying
those variables that offer the great likelihood of
distinguishing between the possible outcomes.
For each of the three variables, the outcome
probabilities illustrated in Table 8.5 (next slide)
Most data mining packages use an entropy measure to
gauge the discriminating power of each variable (split
data) (Chi-square measures can also be used to select
variables)
8-17
結束
Cases
Table 8.4 Grouped data
Table 8.5 Combination outcomes
8-18
結束
Entropy formula
p
p
n
n
Inform log2
log2
pn
pn pn
pn
Where p is the number of positive examples and n is the
number of negative examples in the training set for
each value of the attribute.
The lower the measure (entropy), the greater the
information content
Can use to automatically select variable with most
productive rule potential
8-19
結束
Entropy formula
p
p
n
n
Inform log2
log2
pn
pn pn
pn
The entropy formula has a problem if either p or n are 0,
then the log2 is undefined.
Entropy for each Age category generated by the
formula is shown in Table 8.6.
×
+
×
×
Category Young: [-(8/12)(-0.585)-(4/12)(-1.585)](12/20)=0.551
The lower entropy measure, the greater the information content
(the greater the agreement probability)
Rule: If (Risk=low) then Predict on-time payment
Else predict late
8-20
結束
Entropy
Young
- 8/12 × -0.390 – 4/12 × -0.528 × 12/20:
0.551
Middle
- (4/5 × -0.322) – (1/5 x -2.322) × 5/20:
0.180
Old
- 3/3 × 0 – 0/3 × 0 × 3/20:
SUM
Income
Risk
0.000
0.731
0.782
0.446 (lowest)
By the measures, Risk has the greatest information content. If Risk is
low, the data indicates a 1.0 probability that the applicant will pay the
loan back on time.
8-21
結束
Evaluation
Two type errors may occur:
1. Those applicants rated as low risk actually not pay on
time (from the data, the probability of this case is 0.0)
2. Those applicants rated as high or average risk may
actually have paid if given a loan. (from the data, the
probability of this happening is 0.25. 5/20=0.25)
Expected error 0.250.5 (the p of being wrong)=0.125
Test the model using another set of data.
8-22
結束
Evaluation
The entropy formula for Age, given the risk was not
low 0.99, while the same calculation for income is
1. 971. Age has greater discriminating power.
If age is middle, the one case did not pay one time.
If (Risk is Not low) AND (Age=Middle)
Then Predict late
Else Predict On-time
8-23
結束
Evaluation
For last variable, income, give that Risk was not low,
and Age was not Middle, there are nine cases left and
shown in Table 8.8.
A third rule takes advantage of the case with
u’nanimous (agree) outcome is:
If (Risk Not low) AND (Age NOT middle) AND (income high)
Then predict Late
Else Predict on-time
See page 141 for more explanations
8-24
結束
Rule accuracy
The expected accuracy of the three rules is shown in Table 8.9.
The expected error is 8/20=0.375 (1-0.625)
An additional rule could be generated. For the case of Risk not low,
Age=Young, and Income Not high (four cases with low income (p of
on-time) =0.5 and four cases with average income (p of on-time =
0.75)
The greater discrimination is provided by average income, resulting in
the following rule:
If (Risk not low)
and (Age not middle) and (income average)
Then predict on-time
Else predict either
8-25
結束
Rule accuracy
There is no added accuracy obtained with this rule,
shown in Table 8.10.
The expected error is 4/20 times 0.5 = 0.15 the
same without the rule.
When machine learning methods encounter no
improvement, they generally stop.
8-26
結束
Rule accuracy
Table 8.11 shows the results.
8-27
結束
Rule accuracy
8-28
結束
Inventory Prediction
Groceries
Maybe over 100,000 SKUs
Barcode data input
Data mining to discover patterns
Random sample of over 1.6 million records
30 months
95 outlets
Test sample 400,000 records
Rule induction more workable than regression
28,000 rules
Very accurate, up to 27% improvement
8-29
結束
Clinical Database
Headache
Over 60 possible causes
Exclusive reasoning uses negative rules
Use when symptom absent
Inclusive reasoning uses positive rules
Probabilistic rule induction expert system
Headache: Training sample over 50,000 cases, 45
classes, 147 attributes
Meningitis(腦膜炎): 1200 samples on 41 attributes, 4
outputs
8-30
結束
Clinical Database
Used AQ15, C4.5
Average accuracy 82%
Expert System
Average accuracy 92%
Rough Set Rule System
Average accuracy 70%
Using both positive & negative rules from rough sets
Average accuracy over 90%
8-31
結束
Software Development Quality
Telecommunications company
Goal: find patterns in modules being developed likely
to contain faults discovered by customers
Typical module several million lines of code
Probability of fault averaged 0.074
Apply greater effort for those
Specification, testing, inspection
8-32
結束
Software Quality
Preprocessed data
Reduced data
Used CART
(Classification & Regression Trees)
Could specify prior probabilities
First model 9 rules, 6 variables
Better at cross-validation
But variable values not available until late
Second model 4 rules, 2 variables
About same accuracy, data available earlier
8-33
結束
Rules and evaluation
8-34
結束
Rules and evaluation
The Second models rules
Two models were very close in accuracy. The fist model was better
at cross validation accuracy, but its variables were available just
prior to release.
The second model had the advantage of being based on data
available at an earlier state and required less extensive data
reduction.
See also, page .146 for expert system
8-35
結束
Applications of methods to larger data sets
Expenditure application to find the characteristics of
potential customers for each expenditure category.
A simple case is to categorize clothing expenditures (or
other expenditures in the data set) per year as a 2-class
classification problem.
Data preparation data transformation, see page 154
Comparisons of A-prioriC4.5C5.0
8-36
結束
Fuzzy Decision Trees
Have assumed distinct (crisp) outcomes
Many data points not that clear
Fuzzy: Membership function represents belief
(between 0 and 1)
Fuzzy relationships have been incorporated in
decision tree algorithms
8-37
結束
Fuzzy Example
Age
Young 0.3
Income
Low 0.0
Risk
Low 0.1
Definitions:
Middle 0.9
Average 0.8
Average 0.8
Old 0.2
High 0.3
High 0.3
Sum will not necessarily equal 1.0
If ambiguous, select alternative with larger
membership value
Aggregate with mean
8-38
結束
Fuzzy Model
IF Risk=Low Then OT
Membership function: 0.1
IF Risk NOT Low & Age=Middle Then Late
Risk MAX(0.8, 0.3)
Age 0.9
Membership function: Mean = 0.85
8-39
結束
Fuzzy Model cont.
IF Risk NOT Low & Age NOT Middle &
Income=High THEN Late
Risk MAX(0.8, 0.3) 0.8
Age MAX(0.3, 0.2) 0.3
Income
0.3
Membership function: Mean = 0.433
8-40
結束
Fuzzy Model cont.
IF Risk NOT Low & Age NOT Middle & Income
NOT High THEN Late
Risk MAX(0.8, 0.3)
0.8
Age MAX(0.3, 0.2)
0.3
Income MAX(0.0, 0.8)
0.8
Membership function: Mean = 0.633
8-41
結束
Fuzzy Model cont.
Highest membership function is 0.633, for Rule 4
Conclusion: On-time
8-42
結束
Decision Trees
Very effective & useful
Automatic machine learning
Thus unbiased (but omit judgment)
Can handle very large data sets
Not affected much by missing data
Lots of software available
8-43