Decision Tree Classification - North Dakota State University

Download Report

Transcript Decision Tree Classification - North Dakota State University

Basics of Decision Trees
 A flow-chart-like hierarchical tree structure
– Often restricted to a binary structure
 Root: represents the entire dataset
 A node without children is called a leaf node. Otherwise is
called an internal node.
– Internal nodes: denote a test on an attribute
– Branch (split): represents an outcome of the test
• Univariate split (based on a single attribute)
• Multivariate split
– Leaf nodes: represent class labels or class distribution
 Most decision tree generation consists of two phases
– Tree construction
– Tree pruning
These notes contain
NDSU confidential &
Proprietary material.
Patents pending on
bSQ, Ptree technology
Example (training dataset)
Record ID
Salary
Age
Employment
Group
1
30K
30
Self
C
2
40K
35
Industry
C
3
70K
50
Academia
C
4
60K
45
Self
B
5
70K
30
Academia
B
6
60K
35
Industry
A
7
60K
35
Self
A
8
70K
30
Self
A
9
40K
45
Industry
C
 Three predictor attributes: salary, age, employment
 Class label attribute: group
Example (univariate split)
Salary
<= 50K
> 50K
Age
Group C
<= 40
> 40
Employment
Academia, Industry
Group B
Group C
Self
Group A
 Each internal node of a decision
tree is labeled with a predictor
attribute, called the splitting
attribute
 each leaf node is labeled with a
class label.
 Each edge originating from an
internal node is labeled with a
splitting predicate that involves
only the node’s splitting attribute.
 The combined information about
splitting attributes and splitting
predicates at a node is called the
split criterion
Building Decision Trees
Basic algorithm (a greedy algorithm)
– Tree is constructed in a top-down recursive divideand-conquer manner
– At start, all the training examples are at the root
– Attributes are categorical (if continuous-valued, they
are discritized in advance)
– Examples are partitioned recursively based on
selected attributes
– Test attributes are selected on the basis of a heuristic
or statistical measure (e.g., information gain)
Tree-Building Algorithm
Input: dataset S, split-selection-method CL
Output: decision tree for S
Top-Down Decision Tree Induction Schema (Binary
Splits):
BuildTree(dataset S, split-selection-method CL)
(1) If (all points in S are in the same class) then return;
(2) Using CL to evaluate splits for each attribute
(3) Use best split found to partition S into S1 and S2;
(4) BuildTree(S1, CL)
(5) BuildTree(S2, CL)
 We can build the whole tree by calling:
BuildTree(dataset TrainingData, split-selection-method CL)
Split Selection
Information gain / Gain ratio (ID3/C4.5)
– All attributes are assumed to be categorical
– Can be modified for continuous-valued attributes
Gini index (IBM IntelligentMiner)
– All attributes are assumed continuous-valued
– Assume there exist several possible split values for
each attribute
– May need other tools, such as clustering, to get the
possible split values
– Can be modified for categorical attributes
Information Gain (ID3)




T – Training set; S – any set of cases
freq(Ci, S) – the number of cases that belong to class Ci
|S| -- the number of cases in set S
Information of set S is defined:
k
freq(C j , S )
freq(C j , S )
info( S )   
 log2
j 1
|S|
|S|
 consider a similar measurement after T has been partitioned in accordance with
the n outcomes of a test X. The expected information requirement can be found
as the weighted sum over the subsets, as
n |T |
infoX (T )   i  infoTi 
i1 | T |
gain(X) = info(T) – infoX(T)
 The quantity
measures the information that is gained by partitioning T in accordance with the
test X. The gain criterion, then, selects a test to maximize this information gain.
Gain Ratio (C4.5)
 Gain criterion has a serious deficiency – it has a strong bias in
favor of tests with many outcomes.
 We have
n |T |
|T |
splitinfo X   
i 1
i
|T |
 log2
i
|T |
This represents the potential information generated by dividing T
into n subsets, whereas the information gain measures the
information relevant to classification that arises from the same
division.
 Then,
Gain ratio(X) = gain(X)/split info(X)
expresses the proportion of information generated by the split that
is useful, i.e., that appears helpful for classification.
Gini Index (IBM IntelligentMiner)
 If a data set T contains examples from n classes, gini index,
gini(T) is defined as gini(T ) 1 n p 2

j 1
j
where pj is the relative frequency of class j in T.
 If a data set T is split into two subsets T1 and T2 with sizes N1 and
N2 respectively, the gini index of the split data contains examples
from n classes, the gini index gini(T) is defined as
N1 gini( )  N 2 gini( )
(
T
)

gini split
T1
T2
N
N
 The attribute provides the smallest ginisplit(T) is chosen to split the
node (need to enumerate all possible splitting points for each
attribute).
Pruning Decision Trees
 Why prune?
– Overfitting
• Too many branches, some may reflect anomalies due to noise or outliers
• Result is in poor accuracy for unseen samples
 Two approaches to prune
– Prepruning: Halt tree construction early—do not split a node if this
would result in the goodness measure falling below a threshold
• Difficult to choose an appropriate threshold
– Postpruning: Remove branches from a “fully grown” tree—get a
sequence of progressively pruned trees
• Use set of data different from training data to decide “best pruned tree”
 Well-known Pruning Methods:
– Reduced Error Pruning Pessimistic Error Pruning
– Minimum Error Pruning Critical Value Pruning
– Cost-Complexity Pruning Error-Based Pruning
Extracting Rules from Decision Trees
 Represent the knowledge in the form of IF-THEN rules
 One rule is created for each path from the root to a leaf
 Each attribute-value pair along a path forms a
conjunction
 The leaf node holds the class prediction
 Rules are easier for humans to understand
 Example
– IF salary=“>50k” AND age=“>40” THEN group=“C”
Why Decision Tree Induction
 Compared to a neural network or a Bayesian classifier, a decision
tree is easily interpreted/comprehended by humans
 While training neural networks can take large amounts of time and
thousands of iterations, inducing decision trees is efficient and is
thus suitable for large training sets
 Decision tree generation algorithms do not require additional
information besides that already contained in the training data
 Decision trees display good classification accuracy compared to
other techniques
 Decision tree induction can use SQL queries for accessing
databases
Tree Quality Measures
Accuracy
Complexity
– Tree size
– Number of leaf nodes
Computational speed
 Scalability: Classifying data sets with millions of
examples and hundreds of attributes with reasonable
speed
Some Published Methods (can Ptrees improve
speed/accuracy of these?)
 SLIQ (EDBT’96 — Mehta et al.)
– builds an index for each attribute and only class list and the current attribute
list reside in memory
 SPRINT (VLDB’96 — J. Shafer et al.)
– constructs an attribute list data structure
 PUBLIC (VLDB’98 — Rastogi & Shim)
– integrates tree splitting and tree pruning: stop growing the tree earlier
 RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti)
– separates the scalability aspects from the criteria that determine the quality
of the tree
– builds an AVC-list (attribute, value, class label)
 BOAT
 CLOUDS
Other Issues
 Allow for continuous-valued attributes
– Dynamically define new discrete-valued attributes that partition the
continuous attribute value into a discrete set of intervals
 Handle missing attribute values
– Assign the most common value of the attribute
– Assign probability to each of the possible values
 Attribute (feature) construction
– Create new attributes based on existing ones that are sparsely represented
– This reduces fragmentation, repetition, and replication




Incremental tree induction
Integration of data warehousing techniques
Different data access methods
Bias in split selection
Decision Tree Induction Using P-trees
Basic Ideas
– Calculate information gain, gain ratio or gini index by
using the count information recorded in P-trees.
– P-tree generation replaces sub-sample set creation.
– Use P-tree to determine if all the samples are in the
same class.
– Without additional database scan
Using P-trees to Calculate
Information Gain/Gain Ratio
C – class label attribute
Ps – P-tree of set S
Freq(Cj, S) = rc{Ps ^ Pc(Vcj)}
|S| = rc{Ps}
|Ti| = rc{PT^P(VXi)}
|T| = rc{PT}
So every formula of Information Gain and Gain
Ratio can be calculated directly using P-trees.
P-Classifier versus ID3
Classification cost with respect to the dataset size