Decision Trees

Download Report

Transcript Decision Trees

Data Mining amd
Knowledge Acquisition
— Chapter 5 —
BIS 541
2012/2013 Spring
1
Chapter 7. Classification and Prediction











What is classification? What is prediction?
Issues regarding classification and prediction
Classification by decision tree induction
Bayesian Classification
Classification by Neural Networks
Classification by Support Vector Machines (SVM)
Classification based on concepts from association rule
mining
Other Classification Methods
Prediction
Classification accuracy
Summary
2
Supervised vs. Unsupervised
Learning

Supervised learning (classification)



Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
New data is classified based on the training set
Unsupervised learning (clustering)


The class labels of training data is unknown
Given a set of measurements, observations, etc.
with the aim of establishing the existence of classes
or clusters in the data
3
Prediction Problems: Classification vs.
Numeric Prediction



Classification
 predicts categorical class labels (discrete or nominal)
 classifies data (constructs a model) based on the
training set and the values (class labels) in a
classifying attribute and uses it in classifying new data
Numeric Prediction
 models continuous-valued functions, i.e., predicts
unknown or missing values
Typical applications
 Credit/loan approval:
 Medical diagnosis: if a tumor is cancerous or benign
 Fraud detection: if a transaction is fraudulent
 Web page categorization: which category it is
4
Classification—A Two-Step Process


Model construction: describing a set of predetermined classes
 Each tuple/sample is assumed to belong to a predefined class,
as determined by the class label attribute
 The set of tuples used for model construction is training set
 The model is represented as classification rules, decision trees,
or mathematical formulae
Model usage: for classifying future or unknown objects
 Estimate accuracy of the model
 The known label of test sample is compared with the
classified result from the model
 Accuracy rate is the percentage of test set samples that are
correctly classified by the model
 Test set is independent of training set, otherwise over-fitting
will occur
 If the accuracy is acceptable, use the model to classify data
tuples whose class labels are not known
5
Classification Process (1): Model
Construction
Training
Data
NAME
M ike
M ary
B ill
Jim
D ave
Anne
RANK
YEARS TENURED
A ssistan t P ro f
3
no
A ssistan t P ro f
7
yes
P ro fesso r
2
yes
A sso ciate P ro f
7
yes
A ssistan t P ro f
6
no
A sso ciate P ro f
3
no
Classification
Algorithms
Classifier
(Model)
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
6
Classification Process (2): Use the
Model in Prediction
Classifier
Testing
Data
Unseen Data
(Jeff, Professor, 4)
NAME
Tom
M erlisa
G eo rg e
Jo sep h
RANK
YEARS TENURED
A ssistan t P ro f
2
no
A sso ciate P ro f
7
no
P ro fesso r
5
yes
A ssistan t P ro f
7
yes
Tenured?
7
Chapter 7. Classification and Prediction











What is classification? What is prediction?
Issues regarding classification and prediction
Classification by decision tree induction
Bayesian Classification
Classification by Neural Networks
Classification by Support Vector Machines (SVM)
Classification based on concepts from association rule
mining
Other Classification Methods
Prediction
Classification accuracy
Summary
8
Issues Regarding Classification and Prediction
(1): Data Preparation

Data cleaning


Relevance analysis (feature selection)


Preprocess data in order to reduce noise and handle
missing values
Remove the irrelevant or redundant attributes
Data transformation

Generalize and/or normalize data
9
Issues regarding classification and prediction
(2): Evaluating Classification Methods






Predictive accuracy
Speed and scalability
 time to construct the model
 time to use the model
Robustness
 handling noise and missing values
Scalability
 efficiency in disk-resident databases
Interpretability:
 understanding and insight provided by the model
Goodness of rules
 decision tree size
 compactness of classification rules
10
Chapter 7. Classification and Prediction











What is classification? What is prediction?
Issues regarding classification and prediction
Classification by decision tree induction
Bayesian Classification
Classification by Neural Networks
Classification by Support Vector Machines (SVM)
Classification based on concepts from association rule
mining
Other Classification Methods
Prediction
Classification accuracy
Summary
11
Classification by Decision Tree
Induction



Decision tree
 A flow-chart-like tree structure
 Internal node denotes a test on an attribute
 Branch represents an outcome of the test
 Leaf nodes represent class labels or class distribution
Decision tree generation consists of two phases
 Tree construction
 At start, all the training examples are at the root
 Partition examples recursively based on selected attributes
 Tree pruning
 Identify and remove branches that reflect noise or outliers
Once the tree is build
 Use of decision tree: Classifying an unknown sample
12
Training Dataset
This
follows an
example
from
Quinlan’s
ID3
Han Table
7.1
age
<=30
<=30
31…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
income student credit_rating
high
no fair
high
no excellent
high
no fair
medium
no fair
low
yes fair
low
yes excellent
low
yes excellent
medium
no fair
low
yes fair
medium
yes fair
medium
yes excellent
medium
no excellent
high
yes fair
medium
no excellent
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
Original data from DMPML Ch 4 Sec 4.3 pp 8-9,89-94
13
Output: A Decision Tree for “buys_computer”
age?
5 cases <=30
student?
14 cases
4 cases
overcast
31..40
>40 5 cases
yes
credit rating?
no
yes
excellent
fair
no
yes
no
yes
14
In Practice
passed
Ii


Current
Oi
time
For the i th data,
 at time I, input information is known
 At time O, output is asigned (yes/no)
For all data object in the training data set
(i=1..14) both input and output are known
15
In Practice
passed
Current
In



future
On
time
For a new customers n,
 at the current time I, input information is known
 But O, output is not known
Yet to be classified as (yes or no) before its actual buying
behavioris realized
Value of a data mining study to predict buying behavior
beforehand
16
ID3 Algorithm for Decision Tree Induction


ID3 algorithm Quinlan (1986)
 Tree is constructed in a top-down recursive divide-and-conquer
manner
 At start, all the training examples are at the root
 Attributes are categorical (if continuous-valued, they are
discretized in advance)
 Examples are partitioned recursively based on selected attributes
 Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)
Conditions for stopping partitioning
 All samples for a given node belong to the same class
 There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf
 There are no samples left
17
Entropy of a Simple Event


average amount of information to predict event s
Expected information of an event s is
 I(s)= -log2ps=log2(1/p)
 Ps is probability of event s
 entropy is high for lower probable events and low
otherwise
 as ps goes to zero



as ps goes to one



Event s becomes rare and rare and hence
İts entropy approachs to infinity
Event s becomes more and more certain and hence
İts entropy approachs to zero
Hence entropy is a measure of randomness. disorder
rareness of an event
18
Computational Formulas
entropy
Entropy=-log2p
2
1
0 0.25 0.5
1
Log21=0. log22=1
Log20.5=log21/2=log22-1=-log22=-1
Log20.25=-2
Computational formula
Logax=logbx/logba
a and b any basis
Choosing b as e
It is more convenient to
compute logarithms of any basis
Logax=logex/logea
or in particular
Log2x=logex/loge2 or
Dropping e
Log2x= lnx/ln2
probability
19
Entropy of an simple event









Entropy can be interpreted as the number of bits to encode that
event
if p=1/2,
-log21/2= 1 bit is required to encode that event
 0 and 1 with equal probability
if p=1/4
-log21/4= 2 bits are required to encode that event
 00 01 10 11 each are equally likely only one represent the specific
event
if p=1/8
-log21/8= 3 bits are required to encode that event
 000 001 010 011 100 101 110 111 each are equally likely only
one represent the specific event
if p=1/3
-log21/3=-(loge1/3)/loge2=1.58 bits are required to encode that event
20
Composite events








Consider the four events A, B, C, D
A:tossing a fair coin where
 PH = PT = ½
B:tossing an unfair coin where
 PH = ¼ PT = 3/4
C:tossing an unfair coin where
 PH = 0.001 PT = 0.999
D:tossing an unfair coin where
 PH = 0.0 PT = 1.0
Which of these events is more certain
How much information is needed to guess the result of
each toes in A to D
What is the expected information to predict the outcome
of events A B C D respectively?
21
Entropy of composite events








Average or expected information is highest when each event is
equally likely
 Event A
Expected information required to guess falls as the probability of
head becomes either 0 or 1
as PH goes to 0 or PT goes to 1: moving to case D
The composite event of toesing a coin is more and more certain
So for case D no information is needed as the answer is already
known as tail
What is the expected information to predict the outcome of that
event when probability of head is p in general
 ent(S)= p[-log2p]+(1-p)[-log2(1-p)]
 ent(S)= -plog2p-(1-p)log2(1-p)
The lower the entropy the higher the information content of that
event
Weighted average of simple events weighted by their probailities
22
Examples

When the event is certain:
 pH = 1,pT= 0 or pH = 0, pT = 1
 ent(S)= -1log2(1)-0log20= -1*0-0*log20=0



Note that: limx0+ xlog2x=0
For a fair coin pH = 0.5 ,pT= 0.5
 ent(S)= -(1/2)log2(1/2)-(1/2)log21/2

= -1/2(-1) -1/2(-1)=1
 Ent(S) is 1: p = 0.5 1-p=0.5
if head or tail probabilities are unequal
 entropy is between 0 and 1
23
P head versus entropy for the event of toesing a coin
Entropy= -plog2p-(1-p)log2(1-p)
1
0
1
0.5
Probability of head
24
In general



Entropy is a measure of (im)purity of an sample
variable S is defined as
Ent(S) = sSps(-log2ps)
= -sS pslog2ps
 s is a value of S an element of sample space
 ps is its estimated or subjective probability of
any sS
 Note that sps = 1
25
Information needed to classify an object









class entropy is computed in a similar manner
Entropy is 0 if all members of S belong to the same class
 no information to classify an object
entropy of a partition is the weighted average of all
entropies of all classes
Total number of objects: S
There are two classes C1 and C2

with cardinalities S1 and S2
I(S1,S2)=-(S1/S)*log2(S1/S)-(S2/S)*log2(S1/S2)
Or in general with m classes C1,C2,…,Cm
I(S1,S2…Sm)=-(S1/S)*log2(S1/S)-(S2/S)*log2(S1/S2)-..
-(Sm/S)*log2(Sm/S)
 Probability of an objects belonging to class i:Pi=Si/S
26
Example cont.: At the root of the tree



There are 14 cases S = 14
 9 yes denoted by Y, S1 = 9, py = 9/14
 5 no denoted by N, S2 = 5, pn = 5/14
Y s and N s are almost equally likely
How much information on the average is needed to
classify a person as buyer or not
 Without knowing any characteristics such as




age income …
I(S1,S2)=(S1/S)(-log2S1/S) +(S2/S)(-log2S2/S)
I(9,5)=(9/14)(-log29/14) + (5/14)(-log25/14)
=(9/14)(0.637) + (5/14)(1.485) = 0.940 bits of
information
 close to 1
27
Expected information to classify with an
attribute (1)


An attribute A with n distinct values as a1,a2,..,an
partition the dataset into n distinct parts
 Si,j is number of objects in partition i (i=1..n)
 with a class Cj (j=1..m)
Expected information to classify an object
knowing the value of attribute A is the
 weighted average of entropies of all partitions
 Weighted by the frequency of that partition i
28
Expected information to classify with an
attribute (2)









Ent(A) =I(A) = mj=1(ai/S) *I(Si,1..Si,m)
=(a1/S)*I(S1,1..S1,m)+(a2/S)*I(S2,1..S2,m)+…
+(an/S)*I(Sn,1..Sn,m)
I(Si,1..Si,m) entropy of any partition i
I(Si,1..Si,m)=mj=1(Sij/ai)(-log2Sij/ai)
=-(Si1/ai)*log2Si1/ai-(Si2/ai)*log2Si2/ai...-(Sim/ai)*log2Sim/ai
ai = mj=1Sij = nuber of objects in each partition
Here sij/ai is the probability of class j in partition i
29
Information Gain


Gain in information using distinct values of
attribute A is the reduction in entropy or
information need to classify an object
Gain(A) = I(S1,..,Sm) – I(A)
average information without knowing A –
 Average information with knowing A
Eg. Knowing such characteristics as:
 Age interval, income interval
 How much help to classify a new object?
Can information gain be negative?
Is it always greater then or equal to zero?




30
Another Example




Pick up a student at random in BU
What is the chance that she is staying in dorm?
 Initially we have no specific information about her
If I ask initials
 Does it help us in predicting the probability of her
staying in dorm.
 No
If I ask her adress and record the city
 Does it help us in predcting the chance of her staying
in dorm

Yes
31
Attribute selection at the root



There are four attributes
 age, income, student, credit rating
Which of these provıdes the highest informatıon
in classıfying a new customer or equivalently
Which of these results in hıghest information gaın
32
Testing age at the root node
<=30
14 cases
9Y
5N
31..40
5 cases
2Y
3N
I(2,3)=0.971
Dec N
4 cases
4Y
0N
I(4,0) =0
Dec:Y
>40
5 cases
3Y
2N
I(3,2)=0.971
Dec Y
Accuricy: 10/14, Entropy(age)=5/14*I(2,3)+4/14*I(4,0)+5/14*I(3,2)
Entropy(age) = 5/14(-3/5log23/5-2/5log22/5)
+ 4/14((-4/4log24/4-0/4log20/4)
+ 5/14(-3/5log23/5-2/5log22/5)
= 0.694
Gain(age) = 0.940 – 0.694 = 0.246
33
Expected information for age <=30

If age is <=30
 Information need to classify a new customer :
 I(2,3)=0.971 bits as the training data tells that
 Knowing that age <=30







with 0.4 probability a customer buys
but with 0.6 probability she dose not
I(2,3)=-(3/5)log23/5-(2/5)log22/5= 0.971
= 0.6*0.734 + 0.4*1.322 = 0.971
But what is the weight of age range <=30
5 out of 14 samples are in that range
(5/14)*I(2,3) is the weighted information need to
classify a customer as buyer or not
34
Information gain by age






gain(age)= I(Y,N)-Entropy(age)
= 0.940 – 0.694
=0.246 is the information gain
Or reduction of entropy to classify a new object
Knowing the age of the customer increases our
ability to classify her as buyer or not or
Help us to predict her buying behavior
35
Attribute Selection by Information
Gain Computation
Class Y: buys_computer = “yes”
 Class N: buys_computer = “no”
 I(Y, N) = I(9, 5) =0.940
 Compute the entropy for age:

age
Yi
<=30
2
30…40 4
>40
3
age
<=30
<=30
31…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
Ni I(pi, ni)
3 0.971
0 0
2 0.971
income student credit_rating
high
no
fair
high
no
excellent
high
no
fair
medium
no
fair
low
yes fair
low
yes excellent
low
yes excellent
medium
no
fair
low
yes fair
medium
yes fair
medium
yes excellent
medium
no
excellent
high
yes fair
medium
no
excellent
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
5
4
I (2,3) 
I ( 4,0)
14
14
5

I (3,2)  0.694
14
E (age) 
5
I (2,3) means “age <=30” has 5
14
out of 14 samples, with 2 yes’es
and 3 no’s. Hence
Gain(age)  I (Y , N )  E (age)  0.246
Similarly,
Gain(income)  0.029
Gain( student )  0.151
Gain(credit _ rating )  0.048
36
Testing income at the root node
low
4 cases
3Y
1N
I(3,1)=?
Dec Y
14 cases
9Y
5N
medium
6 cases
4Y
2N
I(4,2) =?
Dec: Y
high
4 cases
2Y
2N
I(2,2)=1
Dec: ?
Accuricy:9/14
Entropy(income) = 4/14(-3/4log23/4-1/4log21/4)
+ 6/14((-2/6log23/4-4/6log22/6)
+ 4/14(-2/4log22/4-2/4log22/4)
= 0.914
Gain(income) = 0.940 – 0914.=0.026
37
Testing student at the root node
no
7 cases
3Y
4N
I(3,4)=?
Dec N
14 cases
9Y
5N
yes
7 cases
6Y
1N
I(6,1)=?
Dec: Y
Accuricy:10/14
Entropy(student) = 7/14(-3/7log23/7-4/7log24/7)
+ 7/14((-6/7log26/7-1/7log21/7)
= 0.789
Gain(student) = 0.940 –0.789 0.151
38
Testing credit rating at the root node
faır
8 cases
6Y
2N
I(6,2)=?
Dec Y
14 cases
9Y
5N
excellent
6 cases
3Y
3N
I(3,3)=1
Dec: ?
Accuricy:9/14
Entropy(student) = 8/14(-2/6log22/6-4/6log24/6)+
+ 6/14((-3/6log23/6-3/6log23/6)
= 0.892
Gain(credit rating) = 0.940 – 0892=0.048
39
Comparıng gaıns








İnformation gains for attrıbutes at the root node:
Gain(age) = 0.246
Gain(age) = 0.026
Gain(age) = 0.151
Gain(age) = 0.048
Age provıdes the highest gain in information
Age ıs choosen as the attribute at the root node
Branch acording to the distinct values of age
40
After Selecting age
first level of the tree
age?
5 cases <=30
Continue
14 cases
4 cases
overcast
31..40
>40 5 cases
yes
continue
41
Attrıbute selectıon at age <=30 node
low
1Y
0N
5 cases
2Y
3N
no
hıgh
income
medıum
0Y
3N
0Y
2N
5 cases
2Y
3N
student
yes
2Y
0N
1Y
1N
faır
1Y
2N
5 cases
2Y
3N
credit
excellent
1Y
1N
Informatıon gaın for student ıs the hıghest as knowıng the
Customers beıng a student or not provıdes perfect ınformatıon to classıfy her buyıng behavıor
42
Attrıbute selectıon at age >40 node
low
1Y
1N
5 cases
3Y
2N
no
hıgh
income
medıum
1Y
1N
0Y
0N
5 cases
3Y
2N
yes
student
2Y
1N
2Y
1N
faır
3Y
0N
5 cases
3Y
2N
credit
excellent
0Y
2N
Informatıon gaın for credıt ratıng ıs the hıghest as knowıng the
Customers beıng a student or not provıdes perfect ınformatıon to classıfy her buyıng behavıor
43
Exercise

Calculate all information gains in the second level
of the tree that is after branching by the distinct
values of age
44
Output: A Decision Tree for “buys_computer”
age?
5 cases <=30
4 cases
overcast
31..40
>40 5 cases
yes
credit rating?
student?
3N
no
no
14 cases
2Y
yes
2N
excellent
yes
no
Accuricy 14/14 on training set
3N
fair
yes
45
Advantage and Disadvantages of Decision
Trees


Advantages:
 Easy to understand and map nicely to a production rules
 Suitable for categorical as well as numerical inputs
 No statistical assumptions about distribution of attributes
 Generation and application to classify unknown outputs is very
fast
Disadvantages:
 Output attributes must be categorical
 Unstable: slight variations in the training data may result in
different attribute selections and hence different trees
 Numerical input attributes leads to complex threes as attribute
splits are usually binary
 Not suitable for non rectangler regions such as regions separated
by linear or nonlnear combination of attributes

By lines ( in 2 dimensions) planes( in 3 dimensions) or in general by
hyperplanes (n dimensions)
46
A classification problem in that decision trees
are not suitable to classify
income
A
o
o
o
x
x
o
x
o
o
The two classes X and O
are separated by line AA
x
x
o A
x
x
x
x
Decision trees
are not suitabe
For this problem
age
47
Other Attribute Selection Measures

Gain Ratio

Gini index (CART, IBM IntelligentMiner)




All attributes are assumed continuous-valued
Assume there exist several possible split values for
each attribute
May need other tools, such as clustering, to get the
possible split values
Can be modified for categorical attributes
48
Gain Ratio







Add another attribute transaction TID
for each observation TID is different
E(TID)= (1/14)*I(1,0)+(1/14)*I(1,0)+ (1/14)*I(1,0)...+
(1/14)*I(1,0)=0
gain(TID)= 0.940-0=0.940
the highest gain so TID is the test attribute
 which makes no sense
use gain ratio rather then gain
Split information: measure of the information value of
split:
 without considering class information
 only number and size of child nodes
 A kind of normalization for information gain
49





Split information = (-Si/S)log2(Si/S)
 information needed to assign an instance to one of
these branches
Gain ratio = gain(S)/split information(S)
in the previous example
Split info and gain ratio for TID:
 split info(TID) = [(1/14)log2(1/14)]*14=3.807
 gain ratio(TID) = (0.940-0.0)/3.807=0.246
Split info for age:I(5,4,5)=
 (5/14)log25/14+ (4/14)log24/14
+(5/14)log25/14=1.577
 gain ratio(age) = gain(age)/split info(age)

= 0.247/1.577=0.156
50
Gini Index (IBM IntelligentMiner)

If a data set T contains examples from n classes, gini index,
n
gini(T) is defined as
gini(T )  1
p2

j 1

where pj is the relative frequency of class j in T.
If a data set T is split into two subsets T1 and T2 with sizes
N1 and N2 respectively, the gini index of the split data
contains examples from n classes, the gini index gini(T) is
defined as
gini split (T ) 

j
N 1 gini( )  N 2 gini( )
T1
T2
N
N
The attribute provides the smallest ginisplit(T) is chosen to
split the node (need to enumerate all possible splitting
points for each attribute).
51
Gini index (CART) Example

Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no”
2
2
9 5
gini ( D)  1        0.459
 14   14 

Suppose the attribute income partitions D into 10 in D1: {low, medium}
 10 
4
and 4 in D2
giniincome{low,medium} ( D)   Gini ( D1 )   Gini ( D1 )
 14 
 14 
but gini{medium,high} is 0.30 and thus the best since it is the lowest

D1:{medium,high}, D2:{low} gini index: 0.300

D1:{low,high}, D2:{medium} gini index: 0.315

Highest gini is for D1:{low,medium}, D2:{high}
52
Extracting Classification Rules from Trees

Represent the knowledge in the form of IF-THEN rules

One rule is created for each path from the root to a leaf

Each attribute-value pair along a path forms a conjunction

The leaf node holds the class prediction

Rules are easier for humans to understand

Example
IF age = “<=30” AND student = “no” THEN buys_computer = “no”
IF age = “<=30” AND student = “yes” THEN buys_computer = “yes”
IF age = “31…40”
THEN buys_computer = “yes”
IF age = “>40” AND credit_rating = “excellent” THEN buys_computer =
“yes”
IF age = “<=30” AND credit_rating = “fair” THEN buys_computer = “no”
53
Approaches to Determine the Final
Tree Size

Separate training (2/3) and testing (1/3) sets

Use cross validation, e.g., 10-fold cross validation

Use all the data for training


but apply a statistical test (e.g., chi-square) to
estimate whether expanding or pruning a node may
improve the entire distribution
Use minimum description length (MDL) principle

halting growth of the tree when the encoding is
minimized
54
Enhancements to basic decision
tree induction

Allow for continuous-valued attributes



Dynamically define new discrete-valued attributes that
partition the continuous attribute value into a discrete
set of intervals
Handle missing attribute values

Assign the most common value of the attribute

Assign probability to each of the possible values
Attribute construction


Create new attributes based on existing ones that are
sparsely represented
This reduces fragmentation, repetition, and replication
55
Avoid Overfitting in Classification


Overfitting: An induced tree may overfit the training data
 Too many branches, some may reflect anomalies due
to noise or outliers
 Poor accuracy for unseen samples
Two approaches to avoid overfitting
 Prepruning: Halt tree construction early—do not split a
node if this would result in the goodness measure
falling below a threshold
 Difficult to choose an appropriate threshold
 Postpruning: Remove branches from a “fully grown”
tree—get a sequence of progressively pruned trees
 Use a set of data different from the training data to
decide which is the “best pruned tree”
56




If the error rate of a subtree is higher then the
error obtained by replacing the tree with its most
frequent leaf or branch
 prune the subtree
How to estimate the prediction error
do not use training samples
 pruning always increase error of the training
sample
estimate error based on test set
 cost complexity or reduced error pruning
57
Example

Labour negotiation data
 dependent variable or class to be predicted:


independent variables:





acceptability of contract: good or bad
duration,
wage increase 1th year: <%2.5, >%2.5
working hours per week: <36, >36
health plan contribution: none, half, full
Figure 6.2 WF shows a branch of a d.tree
58
wage increase 1th year
>2.5
<2.5
working hours per week
<=36
1 Bad
1 good
>36
health plan contribution
none
4 bad
2 good
a
half
1 bad
1good
b
full
4 bad
2 good
c
59
AnswerTree

Variables









measurement levels
case weights
frequency variables
Growing methods
Stopping rules
Tree parameters
 costs, prior probabilities, scores and profits
Gain summary
Accuracy of tree
Cost-complexity pruning
60
Variables





Categorical Variables
 nominal or ordinal
Continuous Variables
All grouping method accept all types of variables
 QUEST requires that tatget variable be nominal
Target and predictor variables
 target variable (dependent variable)
 predictor (independent variables)
Case weight and frequency variables
61
Case weight and frequency variables



CASE WEIGHT VARIABLES
unequal treatment to the cases
Ex: direct marketing



all responders but %1 nonresponders(10,000)



10,000 households respond
and 1,000,000 do not respond
case weight 1 for responders and
case weight 100 for nonresponders
FREQUENCY VARIABLES
 count of a record representing more than one
individual
62
Tree-Growing Methods




CHAID
 Chi-squared Automatic Interaction Detector
 Kass (1980)
Exhaustive CHAID
 Biggs, de Ville, Suen (1991)
CR&T
 Classification and Regression Trees
 Breiman, Friedman, Olshen, and Stone (1984)
QUEST
 Quick, Unbiased, Efficient Statistical Tree
 Loh, Shih (1997)
63
CHAID




evaluate all values of a potential predictor
merges values that are judged to be statistically
homogenous
target variable
 continuous F test
 categorical Chi-square test
not binary: produce more than two categories at
any particular level
 wider tree than do the binary methods
 works for all types of variables
 case weights and frequency variables
 missing values as a single valid category
64
Exhaustive CHAID





CHAID may not find the optimal split for a variable as
it stop merging categories
 all are statistically different
continue merging until two super categories are left
examine series of merges for the predictor
 set of categories giving strongest association with
the target
 computes and adjusted p values for that
association
best split for each predictor
 choose predictor based on p values
otherwise identical to CHAID
 longer to compute
 safer to use
65
CART


Binary tree growing algorithm
 may not present efficiently
partitions data into two subsets
 same predictor variable may be used several
times at different levels
 misclassification costs
 prior probability distribution
 Computation can take a long time with large
data sets
 Surrogate splitting for missing values
66
Steps in the CART analysis




at the root node t=1, search for a split s*
 giving the largest decrease in impurity
 (s*,1) = max sS(s,1)
split 1 as t=2 and t=3 using s*
repeat the split searching process in t=2 and t=3
continue until one of the stopping rules is met
67
Stoping rules






all cases have identical values for all predictors
the node becomes pure; all cases have the same
value of the target variable
the depth of the tree has reached its prespecified
maximum value
the number of cases in a node less than a
prespecified minimum parent node size
split at node results in producing a child node
with cases less than a prespecified min child node
size
for CART only: max decrease in impurity is less
68
QUEST


variable selection and split point selection
separately
computationally efficient than CART
69
Classification in Large Databases



Classification—a classical problem extensively studied by
statisticians and machine learning researchers
Scalability: Classifying data sets with millions of examples
and hundreds of attributes with reasonable speed
Why decision tree induction in data mining?




relatively faster learning speed (than other classification
methods)
convertible to simple and easy to understand
classification rules
can use SQL queries for accessing databases
comparable classification accuracy with other methods
70
Scalable Decision Tree Induction
Methods in Data Mining Studies




SLIQ (EDBT’96 — Mehta et al.)
 builds an index for each attribute and only class list and
the current attribute list reside in memory
SPRINT (VLDB’96 — J. Shafer et al.)
 constructs an attribute list data structure
PUBLIC (VLDB’98 — Rastogi & Shim)
 integrates tree splitting and tree pruning: stop growing
the tree earlier
RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti)
 separates the scalability aspects from the criteria that
determine the quality of the tree
 builds an AVC-list (attribute, value, class label)
71