part 2 - Hong Kong University of Science and Technology

Download Report

Transcript part 2 - Hong Kong University of Science and Technology

Classification with Decision Trees II
Instructor: Qiang Yang
Hong Kong University of Science and Technology
[email protected]
Thanks: Eibe Frank and Jiawei Han
1
Part II: Industrial-strength
algorithms

Requirements for an algorithm to be useful in a wide
range of real-world applications:





Can deal with numeric attributes
Doesn’t fall over when missing values are present
Is robust in the presence of noise
Can (at least in principle) approximate arbitrary concept
descriptions
Basic schemes (may) need to be extended to fulfill
these requirements
2
Decision trees






Extending ID3 to deal with numeric attributes: pretty
straightforward
Dealing sensibly with missing values: a bit trickier
Stability for noisy data: requires sophisticated
pruning mechanism
End result of these modifications: Quinlan’s C4.5
Best-known and (probably) most widely-used
learning algorithm
Commercial successor: C5.0
3
Numeric attributes



Standard method: binary splits (i.e. temp < 45)
Difference to nominal attributes: every attribute
offers many possible split points
Solution is straightforward extension:




Evaluate info gain (or other measure) for every possible split
point of attribute
Choose “best” split point
Info gain for best split point is info gain for attribute
Computationally more demanding
4
An example

64
Yes
Split on temperature attribute from weather data:
65
No

68
69 70
Yes Yes Yes

72 72
No Yes
75 75
Yes Yes
80
No
81
Yes
83
Yes
85
No
Eg. 4 yeses and 2 nos for temperature < 71.5 and 5
yeses and 3 nos for temperature  71.5


71
No
Info([4,2],[5,3]) = (6/14)info([4,2]) + (8/14)info([5,3]) =
0.939 bits
Split points are placed halfway between values
All split points can be evaluated in one pass!
5
Avoiding repeated sorting

Instances need to be sorted according to the values
of the numeric attribute considered



Time complexity for sorting: O(n log n)
Does this have to be repeated at each node?
No! Sort order from parent node can be used to
derive sort order for children


Time complexity of derivation: O(n)
Only drawback: need to create and store an array of sorted
indices for each numeric attribute
6
Notes on binary splits


Information in nominal attributes is computed using
one multi-way split on that attribute
This is not the case for binary splits on numeric
attributes



The same numeric attribute may be tested several times
along a path in the decision tree
Disadvantage: tree is relatively hard to read
Possible remedies: pre-discretization of numeric
attributes or multi-way splits instead of binary ones
7
Example of Binary Split
Age<=3
Age<=5
Age<=10
8
Missing values

C4.5 splits instances with missing values into pieces
(with weights summing to 1)



A piece going down a particular branch receives a weight
proportional to the popularity of the branch
Info gain etc. can be used with fractional instances
using sums of weights instead of counts
During classification, the same procedure is used to
split instances into pieces

Probability distributions are merged using weights
9
Stopping Criteria



When all cases have the same class. The leaf node is
labeled by this class.
When there is no available attribute. The leaf node is
labeled by the majority class.
When the number of cases is less than a specified
threshold. The leaf node is labeled by the majority
class.
10
Pruning


Pruning simplifies a decision tree to prevent
overfitting to noise in the data
Two main pruning strategies:
1.
2.

Postpruning: takes a fully-grown decision tree and discards
unreliable parts
Prepruning: stops growing a branch when information
becomes unreliable
Postpruning preferred in practice because of early
stopping in prepruning
11
Prepruning




Usually based on statistical significance test
Stops growing the tree when there is no statistically
significant association between any attribute and the
class at a particular node
Most popular test: chi-squared test
ID3 used chi-squared test in addition to information
gain

Only statistically significant attributes where allowed to be
selected by information gain procedure
12
The Weather example: Observed
Count
Play 
Yes
No
Outlook
Subtotal
Sunny
2
0
2
Cloudy
0
1
1
Play
Subtotal:
2
1
Total count
in table =3
Outlook
13
The Weather example: Expected
Count
If attributes were independent, then the subtotals would be
Like this
Play 
Yes
No
Subtotal
Sunny
2*2/6=4/3=
1.3
2*1/6=2/3=
0.6
2
Cloudy
2*1/3=0.6
1*1/3=0.3
1
Subtotal:
2
1
Total count
in table =3
Outlook
14
Question: How different between
observed and expected?
•If Chi-squared value is very large, then A1 and A2
are not independent  that is, they are dependent!
•Degrees of freedom: if table has n*m items, then
freedom = (n-1)*(m-1)
•If all attributes in a node are independent with the
class attribute, then stop splitting further.
15
Postpruning

Builds full tree first and prunes it afterwards



Problem: identification of subtrees and nodes that
are due to chance effects
Two main pruning operations:
1.
2.

Attribute interactions are visible in fully-grown tree
Subtree replacement
Subtree raising
Possible strategies: error estimation, significance
testing, MDL principle
16
Subtree replacement

Bottom-up: tree is considered for replacement once
all its subtrees have been considered
17
Subtree raising


Deletes node and redistributes instances
Slower than subtree replacement (Worthwhile?)
18
Estimating error rates




Pruning operation is performed if this does not
increase the estimated error
Of course, error on the training data is not a useful
estimator (would result in almost no pruning)
One possibility: using hold-out set for pruning
(reduced-error pruning)
C4.5’s method: using upper limit of 25% confidence
interval derived from the training data

Standard Bernoulli-process-based method
19
Training Set
Outlook
sunny
sunny
overcast
rain
rain
rain
overcast
sunny
sunny
rain
sunny
overcast
overcast
rain
Tempreature Humidity Windy Class
hot
high
false
N
hot
high
true
N
hot
high
false
P
mild
high
false
P
cool
normal false
P
cool
normal true
N
cool
normal true
P
mild
high
false
N
cool
normal false
P
mild
normal false
P
mild
normal true
P
mild
high
true
P
hot
normal false
P
mild
high
true
N
Post-pruning in C4.5

Bottom-up pruning: at each non-leaf node v, if
merging the subtree at v into a leaf node improves
accuracy, perform the merging.



Method 1: compute accuracy using examples not seen by
the algorithm.
Method 2: estimate accuracy using the training examples:
Consider classifying E examples incorrectly out of N
examples as observing E events in N trials in the
binomial distribution.

For a given confidence level CF, the upper limit on the error
rate over the whole population is U CF ( E, N ) with CF%
confidence.
21
Pessimistic Estimate

Usage in Statistics: Sampling error estimation

Example:





population: 1,000,000 people, could be regarded as infinite
population mean: percentage of the left handed people
sample: 100 people
sample mean: 6 left-handed
How to estimate the REAL population mean?
Possibility(%)
75% confidence interval
15%
L0.25(100,6) 2
6
10 U0.25(100,6)
22
Pessimistic Estimate

Usage in Decision Tree (DT): error estimation for some node in the DT

example:





unknown testing data: could be regarded as infinite universe
population mean: percentage of error made by this node
sample: 100 examples from training data set
sample mean: 6 errors for the training data set
How to estimate the REAL average error rate?
Possibility(%)
75% confidence interval
Heuristic!
But works well...
2
L0.25(100,6)
6
10
U0.25(100,6)
23
C4.5’s method





Error estimate for subtree is weighted sum of error
estimates for all its leaves
Error estimate for a node:
2
2
2 

z
f
f
z

e   f 
z


2
2N
N N 4N 

 z2 
1  
 N
If c = 25% then z = 0.69 (from normal distribution)
f is the error on the training data
N is the number of instances covered by the leaf
24
Example for Estimating Error

Consider a subtree rooted at Outlook with 3 leaf
nodes:
Sunny: Play = yes : (0 error, 6 instances)
 Overcast: Play= yes: (0 error, 9 instances)
 Cloudy: Play = no (0 error, 1 instance)
U 0.25 (6,0)  0.206 ,U 0.25 (9,0)  0.143,U 0.25 (1,0)  0.750


The estimated error for this subtree is


6*0.206+9*0.143+1*0.750=3.273
If the subtree is replaced with the leaf “yes”, the
estimated error is
16 *U 0.25 (16,1)  16 * 0.157  2.512


So the pruning is performed and the tree is merged
(see next page)
25
Example continued
Outlook
sunny
cloudy
yes
overcast
yes
yes
no
26
Example
Combined using
ratios 6:2:6
this gives 0.51
f=5/14
e=0.46
f=0.33
e=0.47
f=0.5
e=0.72
f=0.33
e=0.47
27
Complexity of tree induction




Assume m attributes, n training instances and a tree
depth of O(log n)
Cost for building a tree: O(mn log n)
Complexity of subtree replacement: O(n)
Complexity of subtree raising: O(n (log n)2)



Every instance may have to be redistributed at every node
between its leaf and the root: O(n log n)
Cost for redistribution (on average): O(log n)
Total cost: O(mn log n) + O(n (log n)2)
28
The CART Algorithm
SDR  sd (T ) 

i
SD(T ) 
Ti
 sd (Ti )
T
 P( x) * ( x   )
2
y
(1)
w x
(1)
0 0
wx
(1)
1 1
w x
(1)
2 2
W  (X X )
T
1
 ...  wk x
k
(1)
k
  w j x (j1)
j 0
T
X y
xT
29
Numeric prediction

Counterparts exist for all schemes that we previously discussed


All classification schemes can be applied to regression problems
using discretization


Decision trees, rule learners, SVMs, etc.
Prediction: weighted average of intervals’ midpoints (weighted according to
class probabilities)
Regression more difficult than classification (i.e. percent correct
vs. mean squared error)
30
Regression trees

Differences to decision trees:






Splitting criterion: minimizing intra-subset variation
Pruning criterion: based on numeric error measure
Leaf node predicts average class values of training instances
reaching that node
Can approximate piecewise constant functions
Easy to interpret
More sophisticated version: model trees
31
Model trees



Regression trees with linear regression functions at
each node
Linear regression applied to instances that reach a
node after full regression tree has been built
Only a subset of the attributes is used for LR


Attributes occurring in subtree (+maybe attributes occurring
in path to the root)
Fast: overhead for LR not large because usually only
a small subset of attributes is used in tree
32
Smoothing


Naïve method for prediction outputs value of LR for
corresponding leaf node
Performance can be improved by smoothing
predictions using internal LR models



Predicted value is weighted average of LR models along path
from root to leaf
p 
np  kq
nk
Smoothing formula:
Same effect can be achieved by incorporating the
internal models into the leaf nodes
33
Building the tree

Splitting criterion: standard deviation reduction
SDR  sd (T ) 

i

Ti
 sd (Ti )
T
Termination criteria (important when building trees
for numeric prediction):


Standard deviation becomes smaller than certain fraction of
sd for full training set (e.g. 5%)
Too few instances remain (e.g. less than four)
34
Pruning





Pruning is based on estimated absolute error of LR
models
n 
 average_ab solute_err or
Heuristic estimate: n  v
LR models are pruned by greedily removing terms to
minimize the estimated error
Model trees allow for heavy pruning: often a single
LR model can replace a whole subtree
Pruning proceeds bottom up: error for LR model at
internal node is compared to error for subtree
35
Nominal attributes

Nominal attributes are converted into binary
attributes (that can be treated as numeric ones)


Nominal values are sorted using average class val.
If there are k values, k-1 binary attributes are generated



The ith binary attribute is 0 if an instance’s value is one of the
first i in the ordering, 1 otherwise
It can be proven that the best split on one of the new
attributes is the best binary split on original
But M5‘ only does the conversion once
36
Missing values

Modified splitting criterion: SDR  m   sd (T )   Ti  sd (Ti )
T

i
T

Procedure for deciding into which subset the instance
goes: surrogate splitting





Choose attribute for splitting that is most highly correlated
with original attribute
Problem: complex and time-consuming
Simple solution: always use the class
Testing: replace missing value with average
37
Pseudo-code for M5’

Four methods:






Main method: MakeModelTree()
Method for splitting: split()
Method for pruning: prune()
Method that computes error: subtreeError()
We’ll briefly look at each method in turn
Linear regression method is assumed to perform
attribute subset selection based on error
38
MakeModelTree()
MakeModelTree (instances)
{
SD = sd(instances)
for each k-valued nominal attribute
convert into k-1 synthetic binary attributes
root = newNode
root.instances = instances
split(root)
prune(root)
printTree(root)
}
39
split()
split(node)
{
if sizeof(node.instances) < 4 or
sd(node.instances) < 0.05*SD
node.type = LEAF
else
node.type = INTERIOR
for each attribute
for all possible split positions of the attribute
calculate the attribute's SDR
node.attribute = attribute with maximum SDR
split(node.left)
split(node.right)
}
40
prune()
prune(node)
{
if node = INTERIOR then
prune(node.leftChild)
prune(node.rightChild)
node.model = linearRegression(node)
if subtreeError(node) > error(node) then
node.type = LEAF
}
41
subtreeError()
subtreeError(node)
{
l = node.left; r = node.right
if node = INTERIOR then
return (sizeof(l.instances)*subtreeError(l)
+ sizeof(r.instances)*subtreeError(r))
/sizeof(node.instances)
else return error(node)
}
42
Model tree for servo data
43
Variations of CART

Applying Logistic Regression

predict probability of “True” or “False” instead of making a
numerical valued prediction
predict a probability value (p) rather than the outcome itself

Probability= odds ratio

p
log(
)  Wi  X i
1 p
1
p
 (W X )
1 e
44
Other Trees
• Classification Trees
•Current node Q  min( p 0 , p1 )
•Children nodes (L, R):
Q  p L min( p L0 , p1L )  p R min( p R0 , p1R )
•Decision Trees
•Current node Q   p 0 log p 0  p1 log p1
•Children nodes (L, R): Q   p L QL  p R QR
•GINI index used in CART (STD= p 0 p1 )
0 1
•Current node Q  p p
•Children nodes (L, R): Q  pL QL  pR QR
45
Efforts on Scalability





Most algorithms assume data can fit in memory.
Recent efforts focus on disk-resident implementation for
decision trees.
Random sampling
Partitioning
Examples




SLIQ (EDBT’96 -- [MAR96])
SPRINT (VLDB96 -- [SAM96])
PUBLIC (VLDB98 -- [RS98])
RainForest (VLDB98 -- [GRG98])
47