Issues Related to Decision Tree

Download Report

Transcript Issues Related to Decision Tree

Issues Related to Decision Tree
CIT365: Data Mining & Data Warehousing
Bajuna Salehe
The Institute of Finance Management:
Computing and IT Dept.
Which Splitting Test Condition?
• This depends on the attribute types
– Nominal
– Ordinal
– Continuous (interval/ratio)
• It also depends of ways to split
– 2 – way (Binary) split
– Multi-way split
Nominal Attributes
• Since a nominal attribute can produced many
values, its test condition can be expressed in
two ways, as shown in Figure below
Nominal Attributes
• For multi-way split, the number of outcomes
depends on the number of distinct values for
the corresponding attribute.
• For example, if an attribute such as marital
status has three distinct values — single,
married, or divorced — its test condition will
produce a three-way split, as shown in Figure
2 below:
Nominal Attributes
• Multi – way split uses as many partitions as
distinct values
Binary Split
• Divides values into two subsets
• Must find optimal partitioning
• e.g. C(3,2) = 3 possibilities. Diagram below
depicts this
Ordinal Attributes
• Ordinal attributes may also produce binary or
multi-way splits.
• Grouping of ordinal attribute values is also
allowed as long as it does not violate the
implicit order property among the attribute
values.
• Example of grouping shirt-size attribute is
shown in the diagram below
Ordinal Attributes
• The groupings shown in Figures (a) and (b) preserve
the order among the attribute values whereas the
grouping shown in Figure (c) violate such property
because it combines the attribute values Small and
Large into the same partition while Medium and Extra
Large into another.
Splitting Continuous Attributes
• Discretization—Convert to ordinal attribute.
Example for continuous attributes, the test
condition can be expressed as a comparison
test (A < v?) or (A ≥ v?) with binary outcomes
as shown in the diagram below
When to Stop Splitting?
• Stop expanding when
– All records belong to same class
– All records have “similar” attribute values
Classification Rules
• Represent the knowledge in the form of IFTHEN rules
• One rule is created for each path from the
root to a leaf
• Each attribute-value pair along a path forms a
conjunction
• The leaf node holds the class prediction
Classification Rules
• Consider the tree below:
Classification Rules
• Rules are easier for humans to understand
• Example from above tree, the rules would be:
IF age = “<=30” AND student = “no” THEN
buys_computer = “no”
IF age = “<=30” AND student = “yes” THEN
buys_computer = “yes”
IF age = “31…40”
THEN
buys_computer = “yes”
IF age = “>40” AND credit_rating = “excellent”
THEN buys_computer = “yes”
IF age = “<=30” AND credit_rating = “fair” THEN
buys_computer = “no”
Why Decision Tree Classifier?
• Inexpensive to train/construct
• Extremely FAST @ classifying unknown
records
– Linear in # attributes
• Easy to interpret for small trees
Why Decision Tree Classifier?
• Accuracy comparable to other classification
techniques for many simple data sets
• Example: C4.5
– Simple Depth-first construction
– Uses Information Gain (entropy)
– Sorts Continuous Attributes @ each node
– Data must fit in memory!
• unsuitable for large datasets
• needs out-of-core sorting
Decision Boundary
1
0.9
x < 0.43?
0.8
0.7
Yes
No
y
0.6
y < 0.33?
y < 0.47?
0.5
0.4
Yes
0.3
0.2
:4
:0
0.1
No
:0
:4
Yes
:0
:3
0
0
0.1
0.2
0.3
0.4
0.5
x
0.6
0.7
0.8
0.9
1
• Border line between two neighboring regions of different classes is known as
decision boundary
• Decision boundary is parallel to axes because test condition involves a single
attribute at-a-time
No
:4
:0
Classification Issues
• Underfitting and Overfitting
• Missing Values
• Costs of Classification
Model Underfitting/Overfitting
• The errors of a classification model are divided into
two types
– Training errors: the number of misclassification
errors committed on training records.
– Generalization errors: the expected error of the
model on previously unseen records.
• A good model must have both errors low.
• Model underfitting: both type of errors are large
when the decision tree is too small.
• Model overfitting: training error is small but
generalization error is large, when the decision tree
is too large.
Underfitting and Overfitting (Example)
500 circular and 500
triangular data points.
Circular points:
0.5  sqrt(x12+x22)  1
Triangular points:
sqrt(x12+x22) > 0.5 or
sqrt(x12+x22) < 1
Underfitting and Overfitting
Overfitting
Underfitting: when model is too simple, both training and test errors are large
An Example: Training Dataset
An Example: Testing Dataset
An Example: Two Models, M1 and M2
0% training error
30% testing error
20% training error
10% testing error
(human, warm-blooded, yes, no, no)
(dolphin, warm-blooded, yes, no, no)
(spiny anteater, warm-blooded, no, yes, yes)
Overfitting due to Insufficient Examples
Lack of data points in the lower half of the diagram makes it difficult to predict
correctly the class labels of that region
- Insufficient number of training records in the region causes the decision tree
to predict the test examples using other training records that are irrelevant to
the classification task
An Example: A Smaller Training Dataset
The training dataset does not provide enough information. All
warm-blooded vertebrates that do not hibernate are nonmammals.
An Example: A Model for Table 4.5
0% training error
30% testing error
(human, warm-blooded, yes, no, no)
(dolphin, warm-blooded, yes, no, no)
Notes on Overfitting
• Overfitting results in decision trees that are
more complex than necessary
• Training error no longer provides a good
estimate of how well the tree will perform on
previously unseen records
• Need new ways for estimating generalization
errors
– We cannot use a tree with smaller training error
to reduce generalization errors!
How to Address Overfitting
• Pre-Pruning (Early Stopping Rule)
– Stop the algorithm before it becomes a fully-grown tree
– Typical stopping conditions for a node:
• Stop if all instances belong to the same class
• Stop if all the attribute values are the same
– More restrictive conditions:
• Stop if number of instances is less than some userspecified threshold
• Stop if class distribution of instances are independent
of the available features (e.g., using  2 test)
• Stop if expanding the current node does not improve
impurity measures (e.g., Gini or information gain).
How to Address Overfitting (Con’d)
• Post-pruning
– Grow decision tree to its entirety
– Trim the nodes of the decision tree in a bottom-up
fashion
– If generalization error improves after trimming,
replace sub-tree by a leaf node.
– Class label of leaf node is determined from
majority class of instances in the sub-tree
Avoiding Overfitting In
Classification
• An induced tree may overfit the training data
– Too many branches, some may reflect anomalies due to
noise or outliers
– Poor accuracy for unseen samples
• Two approaches to avoiding overfitting
– Prepruning: Halt tree construction early
• Do not split a node if this would result in a measure of the
usefullness of the tree falling below a threshold
• Difficult to choose an appropriate threshold
– Postpruning: Remove branches from a “fully grown” tree
to give a sequence of progressively pruned trees
• Use a set of data different from the training data to decide which
is the “best pruned tree”
Model Evaluation
• Metrics for Performance Evaluation
– How to evaluate the performance of a model?
• Methods for Performance Evaluation
– How to obtain reliable estimates?
• Methods for Model Comparison
– How to compare the relative performance among
competing models?
Metrics for Performance Evaluation
• Focus on the predictive capability of a model
– Rather than how fast it takes to classify or build models, scalability, etc.
• Confusion Matrix:
PREDICTED CLASS
a: TP (true positive)
Class=Yes Class=No
ACTUAL Class=Yes
CLASS
Class=No
a
b
c
d
b: FN (false negative)
c: FP (false positive)
d: TN (true negative)
Metrics for Performance Evaluation
(Con’d)
PREDICTED CLASS
Class=Yes
ACTUAL
CLASS
Class=No
Class=Yes
a
(TP)
b
(FN)
Class=No
c
(FP)
d
(TN)
ad
TP  TN
Accuracy 

a  b  c  d TP  TN  FP  FN
• Most widely-used metric:
Missing Values
• In practice, the data set may contain missing
attribute values due to the following reasons.
– Some attributes may be costly to measure. E.g.,
certain medical tests may be far too expensive to
be performed on every patient. As a result, the
medical record for some patients may not contain
the results of such tests.
– Second, some attribute values may be
untrustworthy. E.g., measurements taken from a
known faulty sensor are often invalid, and thus,
should be ignored.
Missing Values
– Third, some attributes may be optional. An
attribute such as customer satisfaction, for
example, may be applicable only to those
customers who have prior experience using the
services of a company.
• Missing values may affect the performance of
classification models if not handled
appropriately.
Missing Values
• One of the approach that can be used for
handling missing values is to completely
ignore records with missing values.
• However this approach may not be desirable
especially if the number of training records is
limited.
Summary
• Classification is an extensively studied
problem
• Classification is probably one of the most
widely used data mining techniques with a lot
of extensions
• Classification techniques can be categorized as
either lazy or eager
Summary
• Scalability is still an important issue for
database applications: thus combining
classification with database techniques should
be a promising topic
• Research directions: classification of nonrelational data, e.g., text, spatial, multimedia,
etc. classification of skewed data sets