Data Mining: Decision Trees
Download
Report
Transcript Data Mining: Decision Trees
Flisha Fernandez
DATA MINING
1
WHAT IS DATA MINING?
Data Mining is the discovery of hidden
knowledge, unexpected patterns and new rules in
large databases
Automating the process of searching patterns in
the data
Emily Thomas, State University of New
York
2
OBJECTIVE OF DATA MINING
Any of four types of relationships are sought:
Classes – stored data is used to locate data in
predetermined groups
Clusters – data items grouped according to logical
relationships or consumer preferences
Associations – Walmart example (beer and diapers)
Sequential Patterns – data mined to anticipate
behavior patterns and trends
anderson.ucla.edu
3
DATA MINING TECHNIQUES
Rule Induction – extraction of if-then rules
Nearest Neighbors
Artificial Neural Networks – models that learn
Clustering
Genetic Algorithms – concept of natural evolution
Factor Analysis
Exploratory
Stepwise Regression
Data Visualization – usage of graphic tools
Decision Trees
Emily Thomas, State University of New
York
4
CLASSIFICATION
A major data mining operation
Given one attribute (e.g. Wealth), try to to predict
the value of new people’s wealths by means of
some of the other available attributes
Applies to categorical outputs
Flisha Fernandez
Categorical attribute: an attribute which takes
on two or more discrete values, also knows as a
symbolic attribute
Real attribute: a column of real numbers, also
known as a continuous attribute
5
DATASET
real
symbol
classification
Kohavi 1995
6
Flisha Fernandez
DECISION TREES
7
DECISION TREES
Also called classification trees or regression trees
Based on recursive partitioning of the sample
space
Tree-shaped structures that represent sets of
decisions/data which generate rules for the
classification of a (new, unclassified) dataset
The resulting classification tree becomes an input
for decision making
anderson.ucla.edu / Wikipedia
8
DECISION TREES
A Decision Tree is where:
The nonleaf nodes are labeled with attributes
The arcs out of a node labeled with attribute A
are labeled with each of the possible values
of the attribute A
The leaves of the tree are labeled with classifications
Learning Decision Trees
9
ADVANTAGES
Simple to understand and interpret
Requires little data preparation (other
techniques need data normalization, dummy
variables, etc.)
Can support both numerical and categorical data
Uses white box model (explained by boolean
logic)
Reliable – possible to validate model using
statistical tests
Robust – large amounts of data can be analyzed
in a short amount of time
Wikipedia
10
Flisha Fernandez
DECISION TREE EXAMPLE
11
DAVID’S DEBACLE
Objective: Come up with optimized staff schedule
From Wikipedia
Status Quo: Sometimes people play golf,
sometimes they do not
Means: Predict when people will play golf, when
they will not
12
DAVID’S DATASET
Temp
Humidity
Windy
Play
Sunny
85
85
No
No
Sunny
80
90
Yes
No
Overcast
83
78
No
Yes
Rain
70
96
No
Yes
Rain
68
80
No
Yes
Rain
65
70
Yes
No
Overcast
64
65
Yes
Yes
Sunny
72
95
No
No
Sunny
69
70
No
Yes
Rain
75
90
No
Yes
Sunny
75
70
Yes
Yes
Overcast
72
90
Yes
Yes
Overcast
81
75
No
Yes
Rain
71
80
Yes
No
Quinlan 1989
Outlook
13
DAVID’S DIAGRAM
Play: 9
Don’t: 5
From Wikipedia
outlook?
SUNNY
OVERCAST
RAIN
Play: 2
Don’t: 3
Play: 4
Don’t: 0
Play: 3
Don’t: 2
humid?
HUMIDITY <=
70
Play: 2
Don’t: 0
windy?
HUMIDITY > 70
WINDY
NOT WINDY
Play: 0
Don’t: 3
Play: 0
Don’t: 2
Play: 3
Don’t: 0
14
DAVID’S DECISION TREE
Arcs Labeled with
Possible Values
Root Node
Non-leaf Nodes
Labeled with
Attributes
Outlook?
Flisha Fernandez
Sunny
Rain
Overcast
Humidity?
Normal
Play
Windy?
High
Don’t Play
Play
False
Play
True
Don’t Play
15
Leaves Labeled with Classifications
DAVID’S DECISION
Dismiss staff when it is
Sunny AND Hot
Rainy AND Windy
Hire extra staff when it is
Cloudy
Sunny AND Not So Hot
Rainy AND Not Windy
From Wikipedia
16
Flisha Fernandez
DECISION TREE INDUCTION
17
DECISION TREE INDUCTION ALGORITHM
Basic Algorithm (Greedy Algorithm)
Simon Fraser University, Canada
Tree is constructed in a top-down recursive divide-andconquer manner
At start, all the training examples are at the root
Attributes are categorical (if continuous-valued, they
are discretized in advance)
Examples are partitioned recursively based on selected
attributes
Test attributes are selected on the basis of a certain
measure (e.g., information gain)
18
ISSUES
How should the attributes be split?
What is the appropriate root?
What is the best split?
Decision Tree Learning
When should we stop splitting?
Should we prune?
19
TWO PHASES
Tree Growing (Splitting)
Splitting data into progressively smaller subsets
Analyzing the data to find the independent variable
(such as outlook, humidity, windy) that when used as
a splitting rule will result in nodes that are most
different from each other with respect to the
dependent variable (play)
Tree Pruning
DBMS Data Mining Solutions Supplement
20
SPLITTING
Flisha Fernandez
21
SPLITTING ALGORITHMS
Random
Information Gain
Information Gain Ratio
GINI Index
DBMS Data Mining Solutions Supplement
22
DATASET
AIXploratorium
23
RANDOM SPLITTING
AIXploratorium
24
RANDOM SPLITTING
Disadvantages
Trees can grow huge
Hard to understand
Less accurate than smaller trees
AIXploratorium
25
SPLITTING ALGORITHMS
Random
Information Gain
Information Gain Ratio
GINI Index
DBMS Data Mining Solutions Supplement
26
pi ni
E ( A)
I ( pi , ni )
i 1 p n
INFORMATION ENTROPY
A measure of the uncertainty associated with a
random variable
A measure of the average information content the
recipient is missing when they do not know the
value of the random variable
A long string of repeating characters has an
entropy of 0, since every character is predictable
Example: Coin Toss
Wikipedia
Independent fair coin flips have an entropy of 1 bit
per flip
A double-headed coin has an entropy of 0 – each toss
of the coin delivers no information
27
INFORMATION GAIN
A good measure for deciding the relevance of an
attribute
The value of Information Gain is the reduction in
the entropy of X achieved by learning the state of
the random variable A
Can be used to define a preferred sequence
(decision tree) of attributes to investigate to most
rapidly narrow down the state of X
Used by ID3 and C4.5
Wikipedia
28
CALCULATING INFORMATION GAIN
Ex. Attribute Thread = New, Skips = 3, Reads = 7
-0.3 * log 0.3 - 0.7 * log 0.7
= 0.881 (using log base 2)
Flisha Fernandez
First compute information content
p
p
n
n
I ( p, n)
log2
log2
pn
pn pn
pn
Ex. Attribute Thread = Old, Skips = 6, Reads = 2
-0.6 * log 0.6 - 0.2 * log 0.2
= 0.811 (using log base 2)
Information Gain is...
Of 18, 10 threads are new and 8 threads are old
1.0 - (10/18)*0.881 + (8/18)*0.811
0.150
Gain( A) I ( p, n) E( A)
29
INFORMATION GAIN
AIXploratorium
30
TEST DATA
Whe
n
Fred
Joe
Starts Offense
Joe
Defense
Opp
C
Outcom
e
Away
9pm
No
Forward
Tall
??
Center
AIXploratorium
Wher
e
31
DRAWBACKS OF INFORMATION GAIN
Prefers attributes with many values (real
attributes)
Wikipedia / AIXploratorium
Prefers AudienceSize {1,2,3,..., 150, 151, ..., 1023,
1024, ... }
But larger attributes are not necessarily better
Example: credit card number
Has a high information gain because it uniquely
identifies each customer
But deciding how to treat a customer based on their
credit card number is unlikely to generalize
customers we haven't seen before
32
SPLITTING ALGORITHMS
Random
Information
Information Gain Ratio
GINI Index
DBMS Data Mining Solutions Supplement
33
INFORMATION GAIN RATIO
Works by penalizing multiple-valued attributes
Gain ratio should be
Large when data is evenly spread
Small when all data belong to one branch
http://www.it.iitb.ac.in/~sunita
Gain ratio takes number and size of branches
into account when choosing an attribute
It corrects the information gain by taking the
intrinsic information of a split into account (i.e. how
much info do we need to tell which branch an
instance belongs to)
34
DATASET
ID
Humidit
y
Windy Play?
A
sunny
hot
high
false
No
B
sunny
hot
high
true
No
C
overcast
hot
high
false
Yes
D
rain
mild
high
false
Yes
E
rain
cool
normal
false
Yes
F
rain
cool
normal
true
No
G
overcast
cool
normal
true
Yes
H
sunny
mild
high
false
No
I
sunny
cool
normal
false
Yes
J
rain
mild
normal
false
Yes
K
sunny
mild
normal
true
Yes
L
overcast
mild
high
true
Yes
M
overcast
hot
normal
false
Yes
N
rain
mild
high
true
No
http://www.it.iitb.ac.in/~sunita
Outlook
Temperatur
e
35
CALCULATING GAIN RATIO
Intrinsic information: entropy of distribution of
instances into branches
Gain ratio (Quinlan’86) normalizes info gain by
GainRatio(S, A)
Gain(S, A)
.
IntrinsicInfo(S, A)
http://www.it.iitb.ac.in/~sunita
|S |
|S |
IntrinsicInfo(S, A) i log i .
|S| 2 | S |
36
COMPUTING GAIN RATIO
Example: intrinsic information for ID code
Importance of attribute decreases as intrinsic
information gets larger
Example of gain ratio:
gain(" Attribute" )
gain_ratio (" Attribute" )
intrinsic_ info(" Attribute" )
gain_ratio (" ID_code" )
0.940 bits
0.246
3.807 bits
http://www.it.iitb.ac.in/~sunita
info([1,1,,1) 14 (1 / 14 log1 / 14) 3.807 bits
37
DATASET
ID
Humidit
y
Windy Play?
A
sunny
hot
high
false
No
B
sunny
hot
high
true
No
C
overcast
hot
high
false
Yes
D
rain
mild
high
false
Yes
E
rain
cool
normal
false
Yes
F
rain
cool
normal
true
No
G
overcast
cool
normal
true
Yes
H
sunny
mild
high
false
No
I
sunny
cool
normal
false
Yes
J
rain
mild
normal
false
Yes
K
sunny
mild
normal
true
Yes
L
overcast
mild
high
true
Yes
M
overcast
hot
normal
false
Yes
N
rain
mild
high
true
No
http://www.it.iitb.ac.in/~sunita
Outlook
Temperatur
e
38
INFORMATION GAIN RATIOS
Outlook
Temperature
0.693
Info:
0.911
Gain: 0.940-0.693
0.247
Gain: 0.940-0.911
0.029
Split info: info([5,4,5])
1.577
Split info: info([4,6,4])
1.362
Gain ratio: 0.247/1.577
0.156
Gain ratio: 0.029/1.362
0.021
Humidity
Windy
Info:
0.788
Info:
0.892
Gain: 0.940-0.788
0.152
Gain: 0.940-0.892
0.048
Split info: info([7,7])
1.000
Split info: info([8,6])
0.985
Gain ratio: 0.152/1
0.152
Gain ratio: 0.048/0.985
0.049
http://www.it.iitb.ac.in/~sunita
Info:
39
MORE ON GAIN RATIO
“Outlook” still comes out top
However, “ID code” has greater gain ratio
Standard fix: ad hoc test to prevent splitting on that
type of attribute
Problem with gain ratio: it may overcompensate
May choose an attribute just because its intrinsic
information is very low
Standard fix:
http://www.it.iitb.ac.in/~sunita
First, only consider attributes with greater than average
information gain
Then, compare them on gain ratio
40
INFORMATION GAIN RATIO
Below is the decision tree
AIXploratorium
41
TEST DATA
Whe
n
Fred
Joe
Starts Offense
Joe
Defense
Opp
C
Outcom
e
Away
9pm
No
Forward
Tall
??
Center
AIXploratorium
Wher
e
42
SPLITTING ALGORITHMS
Random
Information
Information Gain Ratio
GINI Index
DBMS Data Mining Solutions Supplement
43
GINI INDEX
All
Flisha Fernandez
attributes are assumed continuousvalued
Assumes there exist several possible split
values for each attribute
May need other tools, such as clustering, to
get the possible split values
Can be modified for categorical attributes
Used in CART, SLIQ, SPRINT
44
GINI INDEX
If a data set T contains examples from n classes, gini
n
index, gini(T) is defined as
2
where pj is the relative frequency of class j in node T
If a data set T is split into two subsets T1 and T2 with
sizes N1 and N2 respectively, the gini index of the
split data contains examples from n classes, the gini
index gini(T) is defined as
[email protected]
gini (T ) 1 p j
j 1
N1 gini( ) N 2 gini( )
(
T
)
gini split
T1
T2
N
N
The attribute that provides the smallest ginisplit(T) is
chosen to split the node
45
GINI INDEX
[email protected]
Maximum (1 - 1/nc) when records are equally
distributed among all classes, implying least
interesting information
Minimum (0.0) when all records belong to one
class, implying most interesting information
C1
C2
0
6
Gini=0.000
C1
C2
1
5
Gini=0.278
C1
C2
2
4
Gini=0.444
C1
C2
3
3
Gini=0.500
46
EXAMPLES FOR COMPUTING GINI
GINI(t ) 1 [ p( j | t )]2
C1
C2
0
6
C1
C2
1
5
P(C1) = 1/6
C1
C2
2
4
P(C1) = 2/6
P(C1) = 0/6 = 0
P(C2) = 6/6 = 1
Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0
[email protected]
j
P(C2) = 5/6
Gini = 1 – (1/6)2 – (5/6)2 = 0.278
Gini = 1 –
P(C2) = 4/6
(2/6)2 –
(4/6)2
= 0.444
47
STOPPING RULES
Pure nodes
Maximum tree depth, or maximum number of
nodes in a tree
Because of overfitting problems
Minimum number of elements in a node
considered for splitting, or its near equivalent
Minimum number of elements that must be in a
new node
A threshold for the purity measure can be
imposed such that if a node has a purity value
higher than the threshold, no partitioning will be
attempted regardless of the number of
observations
DBMS Data Mining Solutions Supplement
/ AI Depot
48
OVERFITTING
The generated tree may overfit the training data
Too many branches, some may reflect
anomalies due to noise or outliers
Result is in poor accuracy for unseen samples
Two approaches to avoid overfitting
Prepruning: Halt tree construction early—do
not split a node if this would result in the
goodness measure falling below a threshold
Flisha Fernandez
Difficult to choose an appropriate threshold
Postpruning: Remove branches from a “fully
grown” tree—get a sequence of progressively
pruned trees
Use a set of data different from the training data to decide
which is the “best pruned tree”
49
PRUNING
Used to make a tree more general, more accurate
Removes branches that reflect noise
DBMS Data Mining Solutions Supplement
50
ORDER OF PRUNING
accuracy and size of node
accuracy gain is weighted by share of sample
small nodes tend to get removed before large ones
If several nodes have same contribution they all prune
away simultaneously
Hence more than two terminal nodes could be cut off in one
Flisha Fernandez
Prune away "weakest link" — the nodes that add least to
overall accuracy of the tree
contribution to overall tree a function of both increase in
pruning
Sequence determined all the way back to root node
need to allow for possibility that entire tree is bad
if target variable is unpredictable we will want to prune back
to root . . . the no model solution
51
CROSS-VALIDATION
Of the training sample, give learner only a subset
Gives us a measure of the quality of the decision
trees produced based on their splitting algorithm
Remember that more examples lead to better
estimates
AIXploratorium
Ex: Training on the first 15 games, testing on the last
5
52
Flisha Fernandez
ALGORITHMS
53
DECISION TREE METHODS
ID3 and C4.5 Algorithms
Classification and Regression Trees (CART)
Chi Square Automatic Interaction Detection
(CHAID)
Segments a dataset using 2-way splits
anderson.ucla.edu
Developed by Ross Quinlan
Segments a dataset using chi square tests to create
multi-way splits
And many others
54
ID3 ALGORITHM
Developed by Ross Quinlan
Iterative Dichotomiser 3
Based on Occam’s Razor
Prefers smaller decision trees over larger ones
Does not always produce the smallest tree, is
heuristic
Summarized as follows:
Wikipedia
Take all unused attributes and count their entropy
concerning test samples
Choose attribute for which entropy is minimum
Make node containing that attribute
55
ID3 ALGORITHM
A = The Attribute that best classifies examples
Decision Tree attribute for Root = A
For each possible value, vi, of A,
Wikipedia
Create a root node for the tree
If all examples are positive, return single-node tree root, with label =
+
If all examples are negative, return single-node tree root, with label =
If number of predicting attributes is empty, then return single node
tree root, with label = most common value of the target attribute in
the examples
Otherwise Begin
Add a new tree branch below Root, corresponding to the test A = vi.
Let Examples(vi), be the subset of examples that have the value vi for A
If Examples(vi) is empty
Then below this new branch add a leaf node with label = most common target
value in the examples
Else below this new branch add the subtree ID3 (Examples(vi), Target_Attribute,
Attributes – {A})
End
Return Root
56
EXAMPLE DATA
Color
Shape
Class
Medium
Blue
Brick
Yes
Small
Red
Sphere
Yes
Large
Green
Pillar
Yes
Large
Green
Sphere
Yes
Small
Red
Wedge
No
Large
Red
Wedge
No
Large
Red
Pillar
No
http://www.cse.unsw.edu.au/
Size
57
CHOOSING ATTRIBUTES
The order in which attributes are chosen
determines how complicated the tree is
ID3 uses information theory to determine the
most informative attribute
A measure of the information content of a message is
the inverse of the probability of receiving the
message:
information1(M) = 1/probability(M)
http://www.cse.unsw.edu.au/
Taking logs (base 2) makes information correspond to
the number of bits required to encode a message:
information(M) = -log2(probability(M))
58
INFORMATION
The information content of a message should be
related to the degree of surprise in receiving the
message
Messages with a high probability of arrival are
not as informative as messages with low
probability
Learning aims to predict accurately, i.e. reduce
surprise
Probabilities are multiplied to get the probability
of two or more things both/all happening. Taking
logarithms of the probabilities allows information
to be added instead of multiplied
http://www.cse.unsw.edu.au/
59
ENTROPY
Different messages have different probabilities of
arrival
Overall level of uncertainty (termed entropy) is:
-Σi Pi log2Pi
Frequency can be used as a probability estimate
e.g. if there are 5 positive examples and 3 negative
examples in a node the estimated probability of
positive is 5/8 = 0.625
http://www.cse.unsw.edu.au/
60
LEARNING
Learning tries to reduce the information content
of the inputs by mapping them to fewer outputs
Hence we try to minimize entropy
http://www.cse.unsw.edu.au/
61
SPLITTING CRITERION
Work out entropy based on distribution of classes
Trying splitting on each attribute
Work out expected information gain for each
attribute
Choose best attribute
http://www.cse.unsw.edu.au/
62
EXAMPLE DATA
Color
Shape
Class
Medium
Blue
Brick
Yes
Small
Red
Sphere
Yes
Large
Green
Pillar
Yes
Large
Green
Sphere
Yes
Small
Red
Wedge
No
Large
Red
Wedge
No
Large
Red
Pillar
No
http://www.cse.unsw.edu.au/
Size
63
EXAMPLE
Initial decision tree is one node with all
examples.
There are 4 positive examples and 3 negative
examples
http://www.cse.unsw.edu.au/
i.e. probability of positive is 4/7 = 0.57; probability of
negative is 3/7 = 0.43
Entropy for Examples is: – (0.57 * log 0.57) – (0.43 *
log 0.43) = 0.99
Evaluate possible ways of splitting
64
EXAMPLE
Size
Color
Shape
Class
Medium
Blue
Brick
Yes
Small
Red
Sphere
Yes
Large
Green
Pillar
Yes
Large
Green
Sphere
Yes
No
No
No
There are two large positives examples and two large
negative examples.
The probability of positive is 0.5
The entropy for Large is: – (0.5 * log 0.5) – (0.5 * log
0.5) = 1
http://www.cse.unsw.edu.au/
Small
Red
Wedge
Try split on size which
Large
Red
Wedge
has three values:
Large
Red
Pillar
large, medium and small
There are four instances with size = large.
65
EXAMPLE
There is one small
positive and one small
negative
Shape
Class
Medium
Blue
Brick
Yes
Small
Red
Sphere
Yes
Large
Green
Pillar
Yes
Large
Green
Sphere
Yes
Small
Red
Wedge
No
Large
Red
Wedge
No
Large
Red
Pillar
No
Size
Color
Shape
Class
Medium
Blue
Brick
Yes
Small
Red
Sphere
Yes
Large
Green
Pillar
Yes
Large
Green
Sphere
Yes
Small
Red
Wedge
No
Large
Red
Wedge
No
Large
Red
Pillar
No
Entropy for Small is:
– (0.5 * log 0.5) – (0.5 * log 0.5) = 1
There is only one
medium positive and
no medium negatives
Color
Entropy for Medium is 0
Expected information
for a split on Size is:
http://www.cse.unsw.edu.au/
Size
66
EXAMPLE
The expected information gain for Size is:
Checking the Information Gains on color and
shape:
Color has an information gain of 0.52
Shape has an information gain of 0.7
Therefore split on shape
Repeat for all subtrees
http://www.cse.unsw.edu.au/
0.99 – 0.86 = 0.13
67
OUTPUT TREE
http://www.cse.unsw.edu.au/
68
WINDOWING
ID3 can deal with very large data sets by
performing induction on subsets or windows onto
the data
2.
3.
4.
Select a random subset of the whole set of training
instances
Use the induction algorithm to form a rule to
explain the current window
Scan through all of the training instances looking
for exceptions to the rule
Add the exceptions to the window
http://www.cse.unsw.edu.au/
1.
Repeat steps 2 to 4 until there are no exceptions
left
69
NOISY DATA
http://www.cse.unsw.edu.au/
Frequently, training data contains "noise" (examples
which are misclassified)
In such cases, one is like to end up with a part of the
decision tree which considers say 100 examples, of
which 99 are in class C1 and the other is apparently
in class C2
If there are any unused attributes, we might be able
to use them to elaborate the tree to take care of this
one case, but the subtree we would be building would
in fact be wrong, and would likely misclassify real
data
Thus, particularly if we know there is noise in the
training data, it may be wise to "prune" the decision
tree to remove nodes which, statistically speaking,
seem likely to arise from noise in the training data
A question to consider: How fiercely should we prune?
70
PRUNING ALGORITHM
Approximate expected error assuming that we
prune at a particular node
Approximate backed-up error from children
assuming we did not prune
If expected error is less than backed-up error,
prune
http://www.cse.unsw.edu.au/
71
EXPECTED ERROR
If we prune a node, it becomes a leaf labelled, C
Laplace error estimate:
http://www.cse.unsw.edu.au/
S is the set of examples in a node
k is the number of classes
N examples in S
C is the majority class in S
n out of N examples in S belong to C
72
BACKED-UP ERROR
Probabilities can be estimated by relative
frequencies of attribute values in sets of
examples that fall into child nodes
http://www.cse.unsw.edu.au/
Let children of Node be Node1, Node2, etc
73
PRUNING EXAMPLE
http://www.cse.unsw.edu.au/
74
ERROR CALCULATION FOR PRUNING
Left child of b has class frequencies [3, 2], error = 0.429
Right child has error of 0.333
Static error estimate E(b) is 0.375, calculated using the
Laplace error estimate formula, with N=6, n=4, and k=2
Backed-up error is:
http://www.cse.unsw.edu.au/
(5/6 and 1/6 because there are 4+2=6 examples handled by node b, of
which 3+2=5 go to the left subtree and 1 to the right subtree
Since backed-up estimate of 0.413 is greater than static
estimate of 0.375, we prune the tree and use the static error of
0.375
75
C4.5 ALGORITHM
An extension of the C4.5 algorithm
Statistical classifier
Uses concept of Information Entropy
Wikpedia
76
C4.5 REFINEMENTS OVER ID3
Splitting criterion is Information Gain Ratio
Post-pruning after induction of trees, e.g. based
on test sets, in order to increase accuracy
Allows for attributes that have a whole range of
discrete or continuous values
Handles training data with missing attribute
values by replacing them with the most common
or the most probable value
informatics.sussex.ac.uk
77