Decision Tree Generation Algorithm: ID3

Download Report

Transcript Decision Tree Generation Algorithm: ID3

Data Mining:
Classification
Classification
• What is Classification?
– Classifying tuples in a database
– In training set E
• each tuple consists of the same set of multiple attributes
as the tuples in the large database W
• additionally, each tuple has a known class identity
– Derive the classification mechanism from the
training set E, and then use this mechanism to
classify general data (in W)
Learning Phase
• Learning
– Training data are analyzed by a classification algorithm
– The class label attribute is credit_rating
– The classifier is represented in the form of classification rules
Testing Phase
• Testing (Classification)
– Test data are used to estimate the accuracy of the classification rules
– If the accuracy is considered acceptable, the rules can be applied to
the classification of new data tuples
Classification by Decision Tree
A top-down decision tree generation algorithm: ID-3 and its
extended version C4.5 (Quinlan’93): J.R. Quinlan, C4.5
Programs for Machine Learning, Morgan Kaufmann, 1993
Decision Tree Generation
• At start, all the training examples are at the root
• Partition examples recursively based on
selected attributes
• Attribute Selection
– Favoring the partitioning which makes the majority
of examples belong to a single class
• Tree Pruning (Overfitting Problem)
– Aiming at removing tree branches that may lead to
errors when classifying test data
• Training data may contain noise, …
Another Examples
1
2
3
4
5
6
7
8
9
10
11
Eye
Black
Black
Black
Black
Brown
Brown
Blue
Blue
Blue
Blue
Brown
Hair
Black
White
White
Black
Black
White
Gold
Gold
White
Black
Gold
Height
Short
Tall
Short
Tall
Tall
Short
Tall
Short
Tall
Short
Short
Oriental
Yes
Yes
Yes
Yes
Yes
Yes
No
No
No
No
No
• After the analysis, can you classify the
following patterns?
– (Black, Gold, Tall)
– (Blue, White, Short)
• Example distributions
Black
Black
Short
+
Brown
Blue
Black
Tall
+
+
─
White White
Short
Tall
+
+
Gold
Tall
?
─
+
?
Gold
Short
─
─
─
Decision Tree
Decision Tree
Decision Tree Generation
• Attribute Selection (Split Criterion)
– Information Gain (ID3/C4.5/See5)
– Gini Index (CART/IBM Intelligent Miner)
– Inference Power
• These measures are also called goodness
functions and used to select the attribute to
split at a tree node during the tree
generation phase
Decision Tree Generation
• Branching Scheme
– Determining the tree branch to which a sample
belongs
– Binary vs. K-ary Splitting
• When to stop the further splitting of a node
– Impurity Measure
• Labeling Rule
– A node is labeled as the class to which most
samples at the node belongs
Decision Tree Generation
Algorithm: ID3
ID: Iterative Dichotomiser
(7.1)  Entropy
Decision Tree Algorithm: ID3
Decision Tree Algorithm: ID3
Decision Tree Algorithm: ID3
Decision Tree Algorithm: ID3
yes
Decision Tree Algorithm: ID3
Another Example
Another Example
Decision Tree Generation
Algorithm: ID3
Decision Tree Generation
Algorithm: ID3
Decision Tree Generation
Algorithm: ID3
Gini Index
• If a data set T contains examples from n classes,
gini index, gini(T), is defined as
n
2
j 1
j
gini( T )  1   p
where pj is the relative frequency of class j in T.
• If a data set T is split into two subsets T1 and T2
with sizes N1 and N2 respectively, the gini index of
the split data contains examples from n classes, the
gini index, gini(T), is defined as
gini
split
(T ) 
N
N
1
gini( T 1 ) 
N
N
2
gini( T 2 )
Inference Power of an Attribute
• A feature that is useful in inferring the
group identity of a data tuple is said to have
a good inference power to that group
identity.
• In Table 1, given attributes (features)
“Gender”, “Beverage”, “State”, try to find
their inference power to “Group id”
Generating Classification Rules