DM – classification algos

Download Report

Transcript DM – classification algos

Data Mining Algorithms
Classification
Classification Outline
Goal: Provide an overview of the classification
problem and introduce some of the basic
algorithms
Classification Problem Overview
Classification Techniques
– Regression
– Distance
– Decision Trees
– Rules
– Neural Networks
Classification Problem
Given a database D={t1,t2,…,tn} and a set
of classes C={C1,…,Cm}, the Classification
Problem is to define a mapping f:DgC
where each ti is assigned to one class.
Actually divides D into equivalence
classes.
Prediction is similar, but may be viewed as
having infinite number of classes.
Classification vs. Prediction
Classification:
– predicts categorical class labels
– classifies data (constructs a model) based on the
training set and the values (class labels) in a
classifying attribute and uses it in classifying new data
Prediction:
– models continuous-valued functions, i.e., predicts
unknown or missing values
Typical Applications:
–
–
–
–
credit approval
target marketing
medical diagnosis
treatment effectiveness analysis
Classification Examples
Teachers classify students’ grades as A,
B, C, D, or F.
Predict when a disaster will strike
Identify individuals with credit risks.
Speech recognition
Pattern recognition
Classification Ex: Grading
If x >= 90 then grade
=A.
If 80<=x<90 then grade
=B.
If 70<=x<80 then grade
=C.
If 60<=x<70 then grade
=D.
If x<50 then grade =F.
x
<90
>=90
x
<80
x
<70
x
<50
F
A
>=80
B
>=70
C
>=60
D
Classification Ex: Letter
Recognition
View letters as constructed from 5 components:
Letter A
Letter B
Letter C
Letter D
Letter E
Letter F
Classification Techniques
Approach:
1. Create specific model by evaluating
training data (or using domain experts’
knowledge).
2. Apply model developed to new data.
Classes must be predefined
Most common techniques use Decision
Trees, Neural Networks, or are based on
distances or statistical methods.
Classification—A 2 Step Process
Model construction: describing a set of
predetermined classes
– Each tuple/sample is assumed to belong to a
predefined class, as determined by the class
label attribute
– The set of tuples used for model construction:
training set
– The model is represented as classification rules,
decision trees, or mathematical formulae
Classification—A 2 Step Process
Model usage: for classifying future or
unknown objects
– Estimate accuracy of the model
The known label of test sample is compared
with the classified result from the model
Accuracy rate is the percentage of test set
samples that are correctly classified by the
model
Test set is independent of training set,
otherwise over-fitting will occur
Classification Process (1): Model
Construction
Training
Data
NAME
Mike
Mary
Bill
Jim
Dave
Anne
RANK
YEARS REGULAR
Assistant Prof
3
no
Assistant Prof
7
yes
Professor
2
yes
Associate Prof
7
yes
Assistant Prof
6
no
Associate Prof
3
no
Classification
Algorithms
Classifier
(Model)
IF rank = ‘professor’
OR years > 6 then
“REGULAR” = ‘yes’
Classification Process (2): Use the
Model in Prediction
Classifier
Testing
Data
Unseen Data
(Jeff, Professor, 4)
NAME
Tom
Merlisa
George
Joseph
RANK
YEARS REGULAR
Assistant Prof
2
no
Associate Prof
7
no
Professor
5
yes
Assistant Prof
7
yes
REGULAR?
YES
Supervised vs. Unsupervised
Learning
Supervised learning (classification)
– Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
– New data is classified based on the training set
Unsupervised learning (clustering)
– The class labels of training data is unknown
– Given a set of measurements, observations, etc. with
the aim of establishing the existence of classes or
clusters in the data
Issues in Classification
Missing Data
– Ignore
– Replace with assumed value
Measuring Performance
– Classification accuracy on test data
– Confusion matrix
– OC Curve
Height Example Data
Name
Kristina
Jim
Maggie
Martha
Stephanie
Bob
Kathy
Dave
Worth
Steven
Debbie
Todd
Kim
Amy
Wynette
Gender
F
M
F
F
F
M
F
M
M
M
F
M
F
F
F
Height
1.6m
2m
1.9m
1.88m
1.7m
1.85m
1.6m
1.7m
2.2m
2.1m
1.8m
1.95m
1.9m
1.8m
1.75m
Output1
Short
Tall
Medium
Medium
Short
Medium
Short
Short
Tall
Tall
Medium
Medium
Medium
Medium
Medium
Output2
Medium
Medium
Tall
Tall
Medium
Medium
Medium
Medium
Tall
Tall
Medium
Medium
Tall
Medium
Medium
Classification Performance
True Positive
False Positive
False Negative
True Negative
Confusion Matrix Example
Using height data example with Output1
correct and Output2 actual assignment
Actual
Assignment
Membership
Short
Medium
Tall
Short
0
4
0
Medium
0
5
3
Tall
0
1
2
Classifier Accuracy Measures
C1
C2
C1
True positive
False negative
C2
False positive
True negative
classes
buy_computer = yes
buy_computer = no
total
recognition(%)
buy_computer = yes
6954
46
7000
99.34
buy_computer = no
412
2588
3000
86.27
total
7366
2634
10000
95.52
Accuracy of a classifier M, acc(M): percentage of test set tuples that are
correctly classified by the model M
– Error rate (misclassification rate) of M = 1 – acc(M)
– Given m classes, CMi,j, an entry in a confusion matrix, indicates # of
tuples in class i that are labeled by the classifier as class j
Alternative accuracy measures (e.g., for cancer diagnosis)
sensitivity = t-pos/pos
/* true positive recognition rate */
specificity = t-neg/neg
/* true negative recognition rate */
precision = t-pos/(t-pos + f-pos)
accuracy = sensitivity * pos/(pos + neg) + specificity * neg/(pos + neg)
– This model can also be used for cost-benefit analysis
April 4, 2016
18
Evaluating the Accuracy of a Classifier
or Predictor (I)
Holdout method
– Given data is randomly partitioned into two independent sets
Training set (e.g., 2/3) for model construction
Test set (e.g., 1/3) for accuracy estimation
– Random sampling: a variation of holdout
Repeat holdout k times, accuracy = avg. of the accuracies
obtained
Cross-validation (k-fold, where k = 10 is most popular)
– Randomly partition the data into k mutually exclusive subsets,
each approximately equal size
– At i-th iteration, use Di as test set and others as training set
– Leave-one-out: k folds where k = # of tuples, for small sized data
– Stratified cross-validation: folds are stratified so that class dist. in
each fold is approx. the same as that in the initial data
April 4, 2016
19
Metrics for Performance
Evaluation
Focus on the predictive capability of a model
– Rather than how fast it takes to classify or build
models, scalability, etc.
Confusion Matrix:
PREDICTED CLASS
Class=Yes Class=No
a: TP (true positive)
b: FN (false negative)
c: FP (false positive)
ACTUAL Class=Yes
CLASS
Class=No
a
b
c
d
d: TN (true negative)
Metrics for Performance
Evaluation…
PREDICTED CLASS
Class=Yes
ACTUAL
CLASS
Class=No
Class=Yes
a
(TP)
b
(FN)
Class=No
c
(FP)
d
(TN)
ad
TP  TN
Accuracy 

a  b  c  d TP  TN  FP  FN
Limitation of Accuracy
Consider a 2-class problem
– Number of Class 0 examples = 9990
– Number of Class 1 examples = 10
If model predicts everything to be class 0,
accuracy is 9990/10000 = 99.9 %
– Accuracy is misleading because model does
not detect any class 1 example
Cost Matrix
PREDICTED CLASS
C(i|j)
ACTUAL
CLASS
Class=Yes
Class=No
Class=Yes
C(Yes|Yes)
C(No|Yes)
Class=No
C(Yes|No)
C(No|No)
C(i|j): Cost of misclassifying class j e.g.. as class i
Computing Cost of Classification
Cost
Matrix
PREDICTED CLASS
ACTUAL
CLASS
Model
M1
ACTUAL
CLASS
PREDICTED CLASS
+
-
+
150
40
-
60
250
Accuracy = 80%
Cost = 3910
C(i|j)
+
-
+
-1
100
-
1
0
Model
M2
ACTUAL
CLASS
PREDICTED CLASS
+
-
+
250
45
-
5
200
Accuracy = 90%
Cost = 4255
Cost vs Accuracy
Count
PREDICTED CLASS
Class=Yes
ACTUAL
CLASS
Class=No
Class=Yes
a
b
Class=No
c
d
Accuracy is proportional to cost
if
1. C(Yes|No)=C(No|Yes) = q
2. C(Yes|Yes)=C(No|No) = p
N=a+b+c+d
Accuracy = (a + d)/N
Cost
PREDICTED CLASS
Class=Yes
ACTUAL
CLASS
Class=No
Cost = p (a + d) + q (b + c)
= p (a + d) + q (N – a – d)
Class=Yes
p
q
= q N – (q – p)(a + d)
Class=No
q
p
= N [q – (q-p)  Accuracy]
Cost-Sensitive Measures
a
Precision (p) 
ac
a
Recall (r) 
ab
2rp
2a
F - measure (F) 

r  p 2a  b  c



Precision is biased towards C(Yes|Yes) & C(Yes|No)
Recall is biased towards C(Yes|Yes) & C(No|Yes)
F-measure is biased towards all except C(No|No)
wa  w d
Weighted Accuracy 
wa  wb wc  w d
1
1
4
2
3
4
Statistical Based Algorithms Regression
Assume data fits a predefined function
Determine best values for regression
coefficients c0,c1,…,cn.
Assume an estimate : y = c0+c1x1+…+cnxn+e
Estimate error using mean squared error for
training set:
Linear Regression Poor Fit
Classification Using
Regression
Division: Use regression function to
divide area into regions.
Prediction: Use regression function to
predict a class membership function.
Input includes desired class.
Division
Prediction
Bayesian Classification: Why?
Probabilistic learning: Calculate explicit probabilities for
hypothesis, among the most practical approaches to
certain types of learning problems
Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is
correct. Prior knowledge can be combined with observed
data.
Probabilistic prediction: Predict multiple hypotheses,
weighted by their probabilities
Standard: Even when Bayesian methods are
computationally intractable, they can provide a standard
of optimal decision making against which other methods
can be measured
Bayesian Theorem
Given training data D, posteriori probability of a
hypothesis h, P(h|D) follows the Bayes theorem
P
(
D
|
h
)
P
(
h
)
P(h | D) 
P( D)
MAP (maximum posteriori) hypothesis
h
 arg max P(h | D)  arg max P(D | h)P(h).
MAP hH
hH
Practical difficulty: require initial knowledge of
many probabilities, significant computational
cost
Bayesian classification
The classification problem may be
formalized using a-posteriori probabilities:
P(C|X) = prob. that the sample tuple
X=<x1,…,xk> is of class C.
E.g. P(class=N |
outlook=sunny,windy=true,…)
Idea: assign to sample X the class label C
such that P(C|X) is maximal
Estimating a-posteriori probabilities
Bayes theorem:
P(C|X) = P(X|C)·P(C) / P(X)
P(X) is constant for all classes
P(C) = relative freq of class C samples
C such that P(C|X) is maximum =
C such that P(X|C)·P(C) is maximum
Problem: computing P(X|C) is unfeasible!
Naïve Bayesian Classification
Naïve assumption: attribute independence
P(x1,…,xk|C) = P(x1|C)·…·P(xk|C)
If i-th attribute is categorical:
P(xi|C) is estimated as the relative freq of
samples having value xi as i-th attribute in
class C
If i-th attribute is continuous:
P(xi|C) is estimated thru a Gaussian density
function
Computationally easy in both cases
Play-tennis example: estimating P(xi|C)
Outlook Temperature Humidity Windy Class
sunny hot
high
false
N
sunny hot
high
true
N
overcast hot
high
false
P
rain
mild
high
false
P
rain
cool
normal false
P
rain
cool
normal true
N
overcast cool
normal true
P
sunny mild
high
false
N
sunny cool
normal false
P
rain
mild
normal false
P
sunny mild
normal true
P
overcast mild
high
true
P
overcast hot
normal false
P
rain
mild
high
true
N
outlook
P(sunny|p) = 2/9
P(sunny|n) = 3/5
P(overcast|p) = 4/9
P(overcast|n) = 0
P(rain|p) = 3/9
P(rain|n) = 2/5
temperature
P(hot|p) = 2/9
P(hot|n) = 2/5
P(mild|p) = 4/9
P(mild|n) = 2/5
P(cool|p) = 3/9
P(cool|n) = 1/5
humidity
P(high|p) = 3/9
P(high|n) = 4/5
P(normal|p) = 6/9
P(normal|n) = 2/5
P(p) = 9/14
windy
P(n) = 5/14
P(true|p) = 3/9
P(true|n) = 3/5
P(false|p) = 6/9
P(false|n) = 2/5
Play-tennis example: classifying X
An unseen sample X = <rain, hot, high,
false>
P(X|p)·P(p) =
P(rain|p)·P(hot|p)·P(high|p)·P(false|p)·P(p)
= 3/9·2/9·3/9·6/9·9/14 = 0.010582
P(X|n)·P(n) =
P(rain|n)·P(hot|n)·P(high|n)·P(false|n)·P(n)
= 2/5·2/5·4/5·2/5·5/14 = 0.018286
Sample X is classified in class n (don’t
play)
Overview of Naive Bayes
The goal of Naive Bayes is to work out whether a new
example is in a class given that it has a certain
combination of attribute values. We work out the
likelihood of the example being in each class given
the evidence (its attribute values), and take the highest
likelihood as the classification.
Bayes Rule: E- Event has occurred
P[ E | H ].P[ H ]
P[ H | E ] 
P[ E ]
P[H] is called the prior probability (of the
hypothesis).
P[H|E] is called the posterior probability (of the
hypothesis given the evidence)
39
Overview of Naive Bayes
For each class, k, work
out:
P[ E | H k ].P[ H k ]
P[ H k | E ] 
P[ E ]
• Our Hypotheses are:
• H1: ‘the example is in class A’
• H2: ‘the example is in class B’ etc.
• Our Evidence is the attribute values of a particular new example that is
presented:
• E1=x : ‘the example has value x for attribute A1’
• E2=y : ‘the example has value y for attribute A2’
• ...
• En=z : ‘the example has value z for attribute An’
• Note that, assuming the attributes are equally important and
independent, we estimate the joint probability of that
combination of attribute values as:
P[ E | H k ]  P[ E1  x | H k ]  P[ E2  y | H k ]  ...  P[ En  z | H k ]
• The goal is then to find the hypothesis (i.e. the class k) for which the
value of P[Hk|E] is at a maximum.
4
Overview of Naive Bayes
• For categorical variables we use simple proportions.
P[Ei=x|Hk] =
no. of training egs in class k having value x for attribute Ai
number of training examples in class k
• For continuous variables we assume a normal
(Gaussian) distribution, and use the mean () and standard
deviation () to compute the conditional probabilities.
P[Ei =x |Hk] =
1
e
2  k

x  k
2 k2
41
Worked Example 1
Take the following training data, from bank loan
applicants:
ApplicantID
City
Children
Income
Status
1
Delhi
Many
Medium
DEFAULTS
2
Delhi
Many
Low
DEFAULTS
3
Delhi
Few
Medium
PAYS
4
Delhi
Few
High
PAYS
•
•
•
•
•
P[City=Delhi | Status = DEFAULTS] = 2/2 = 1
P[City=Delhi | Status = PAYS] = 2/2 = 1
P[Children=Many | Status = DEFAULTS] = 2/2 = 1
P[Children=Few | Status = DEFAULTS] = 0/2 = 0
etc.
42
Worked Example 1
Summarizing, we have the following probabilities:
... given
DEFAULTS
... given PAYS
City=Delhi
2/2 = 1
2/2 = 1
Children=Few
0/2 = 0
2/2 = 1
Children=Many
2/2 = 1
0/2 = 0
Income=Low
1/2 = 0.5
0/2 = 0
Income=Medium
1/2 = 0.5
1/2 = 0.5
0/2 = 0
1/2 = 0.5
Probability of...
Income=High
and P[Status = DEFAULTS] = 2/4 = 0.5
P[Status = PAYS] = 2/4 = 0.5
The probability of Income=Medium given the applicant DEFAULTs =
the number of applicants with Income=Medium who DEFAULT
divided by the number of applicants who DEFAULT
= 1/2 = 0.5
43
Worked Example 1
Now, assume a new example is presented where
City=Delhi, Children=Many, and Income=Medium:
First, we estimate the likelihood that the example is a defaulter, given
its attribute values: P[H1|E] = P[E|H1].P[H1]
(denominator omitted*)
P[Status = DEFAULTS | Delhi,Many,Medium] =
P[Delhi|DEFAULTS] x P[Many|DEFAULTS] x
P[Medium|DEFAULTS] x P[DEFAULTS] =
1 x 1 x 0.5 x 0.5
= 0.25
Then we estimate the likelihood that the example is a payer, given its
attributes:
P[H2|E] = P[E|H2].P[H2]
(denominator
omitted*)
P[Status = PAYS | Delhi,Many,Medium] =
P[Delhi|PAYS] x P[Many|PAYS] x
P[Medium|PAYS] x P[PAYS] =
1 x 0 x 0.5 x 0.5
=0
As the conditional likelihood of being a defaulter is higher (because
0.25 > 0), we conclude that the new example is a defaulter.
44
Worked Example 1
Now, assume a new example is presented where
City=Delhi, Children=Many, and Income=High:
First, we estimate the likelihood that the example is a defaulter, given
its attribute values:
P[Status = DEFAULTS | Delhi,Many,High] =
P[Delhi|DEFAULTS] x P[Many|DEFAULTS] x P[High|DEFAULTS] x
P[DEFAULTS] = 1 x 1 x 0 x 0.5
=0
Then we estimate the likelihood that the example is a payer, given its
attributes:
P[Status = PAYS | Delhi,Many,High] =
P[Delhi|PAYS] x P[Many|PAYS] x P[High|PAYS] x
P[PAYS] =
1 x 0 x 0.5 x 0.5 = 0
As the conditional likelihood of being a defaulter is the same as that for
being a payer, we can come to no conclusion for this example.
45
Worked Example 2
Take the following training data, for credit card authorizations:
TransactionID
1
2
3
4
5
6
7
8
9
10
Income
Very High
High
Medium
High
Very High
Medium
High
Medium
High
Low
Credit
Excellent
Good
Excellent
Good
Good
Excellent
Bad
Bad
Bad
Bad
Decision
AUTHORIZE
AUTHORIZE
AUTHORIZE
AUTHORIZE
AUTHORIZE
AUTHORIZE
REQUEST ID
REQUEST ID
REJECT
CALL POLICE
Assume we’d like to determine how to classify a new transaction,
with Income = Medium and Credit=Good.
46
Worked Example 2
Our conditional probabilities are:
Probability of...
Income=Very High
Income=High
Income=Medium
Income=Low
Credit=Excellent
Credit=Good
Credit=Bad
... given
... given
AUTHORIZE REQUEST ID
2/6
0/2
2/6
1/2
2/6
1/2
0/6
0/2
3/6
0/2
3/6
0/2
0/6
2/2
... given
... given
REJECT CALL POLICE
0/1
0/1
1/1
0/1
0/1
0/1
0/1
1/1
0/1
0/1
0/1
0/1
1/1
1/1
Our class probabilities are:
P[Decision = AUTHORIZE] = 6/10
P[Decision = REQUEST ID] = 2/10
P[Decision = REJECT] = 1/10
P[Decision = CALL POLICE] = 1/10
47
Worked Example 2
Our goal is now to work out, for each class, the conditional probability
of the new transaction (with Income=Medium & Credit=Good) being in
that class. The class with the highest probability is the classification we
choose.
Our conditional probabilities (again, ignoring Bayes’s
denominator) are:
P[Decision = AUTHORIZE | Income=Medium &
Credit=Good]
= P[Income=Medium|Decision=AUTHORIZE] x P[Credit=Good|Decision=AUTHORIZE] x
P[Decision=AUTHORIZE]
= 2/6 x 3/6 x 6/10 = 36/360 = 0.1
P[Decision = REQUEST ID | Income=Medium &
Credit=Good]
= P[Income=Medium|Decision=REQUEST ID] x P[Credit=Good|Decision=REQUEST ID] x
P[Decision=REQUEST ID]
= 1/2 x 0/2 x 2/10 = 0
48
Worked Example 2
P[Decision = REJECT | Income=Medium & Credit=Good]
= P[Income=Medium|Decision=REJECT] x
P[Credit=Good|Decision=REJECT] x P[Decision=REJECT]
= 0/1 x 0/1 x 1/10 = 0
P[Decision = CALL POLICE | Income=Medium &
Credit=Good]
= P[Income=Medium|Decision=CALL POLICE] x
P[Credit=Good|Decision=CALL POLICE] x
P[Decision=CALL POLICE]
= 0/1 x 0/1 x 1/10 = 0
The highest of these probabilities is the first, so we
conclude that the decision for our new transaction
should be AUTHORIZE.
49
Weaknesses
• Naive Bayes assumes that variables are equally
important and that they are independent which is often
not the case in practice.
• Naive Bayes is damaged by the inclusion of redundant
(strongly dependent) attributes. e.g. if people with high
income have expensive houses, then including both
income and house-price in the model would unfairly
multiply the effect of having low income.
• Sparse data: If some attribute values are not present in the
data, then a zero probability for P[E|H] might exist. This
would lead P[H|E] to be zero no matter how high P[E|H] is
for other attribute values. Small positive values which
estimate the so-called ‘prior probabilities’ are often used
to correct this.
50
Classification Using Distance
Place items in class to which they are
“closest”.
Must determine distance between an
item and a class.
 Classes represented by
– Centroid: Central value.
– Medoid: Representative point.
– Individual points
Algorithm: KNN
K Nearest Neighbor (KNN):
Training set includes classes.
Examine K items near item to be
classified.
New item placed in class with the most
number of close items.
O(q) for each tuple to be classified.
(Here q is the size of the training set.)
Classification Using Decision
Trees
Partitioning based: Divide search space
into rectangular regions.
Tuple placed into class based on the
region within which it falls.
DT approaches differ in how the tree is
built: DT Induction
Internal nodes associated with attribute
and arcs with values for that attribute.
Algorithms: ID3, C4.5, CART
Decision Tree
Given:
– D = {t1, …, tn} where ti=<ti1, …, tih>
– Database schema contains {A1, A2, …, Ah}
– Classes C={C1, …., Cm}
Decision or Classification Tree is a tree
associated with D such that
– Each internal node is labeled with attribute, Ai
– Each arc is labeled with predicate which can
be applied to attribute at parent
– Each leaf node is labeled with a class, Cj
Comparing DTs
Balanced
Deep
DT Issues
Choosing Splitting Attributes
Ordering of Splitting Attributes
Splits
Tree Structure
Stopping Criteria
Training Data
Pruning
DECISION TREES
An internal node represents a test on an attribute.
A branch represents an outcome of the test, e.g.,
Color=red.
A leaf node represents a class label or class label
distribution.
At each node, one attribute is chosen to split
training examples into distinct classes as much
as possible
A new case is classified by following a matching
path to a leaf node.
Training Set
Outlook
sunny
sunny
overcast
rain
rain
rain
overcast
sunny
sunny
rain
sunny
overcast
overcast
rain
Tempreature
hot
hot
hot
mild
cool
cool
cool
mild
cool
mild
mild
mild
hot
mild
Humidity W indy Class
high
false
N
high
true
N
high
false
P
high
false
P
normal false
P
normal true
N
normal true
P
high
false
N
normal false
P
normal false
P
normal true
P
high
true
P
normal false
P
high
true
N
Example
Outlook
sunny
overcast
humidity
high
N
normal
P
rain
windy
P
true
false
N
P
Building Decision Tree
Top-down tree construction
– At start, all training examples are at the root.
– Partition the examples recursively by choosing
one attribute each time.
Bottom-up tree pruning
– Remove subtrees or branches, in a bottom-up
manner, to improve the estimated accuracy on
new cases.
Use of decision tree: Classifying an unknown
sample
– Test the attribute values of the sample against the
decision tree
Training Dataset
age
<=30
<=30
31…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
income
high
high
high
medium
low
low
low
medium
low
medium
medium
medium
high
medium
student
no
no
no
no
yes
yes
yes
no
yes
yes
yes
no
yes
no
credit_rating
fair
excellent
fair
fair
fair
excellent
excellent
fair
fair
fair
excellent
excellent
fair
excellent
Output: A Decision Tree for
“Credit Rating”
age?
<=30
student?
overcast
30..40
yes
>40
credit rating?
no
yes
excellent
fair
no
yes
no
yes
Algorithm for Decision Tree Induction
Basic algorithm (a greedy algorithm)
– Tree is constructed in a top-down recursive divide-andconquer manner
– At start, all the training examples are at the root
– Attributes are categorical
– Examples are partitioned recursively based on selected
attributes
– Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)
Conditions for stopping partitioning
– All samples for a given node belong to the same class
– There are no remaining attributes for further partitioning
– majority voting is employed for classifying the leaf
– There are no samples left
Choosing the Splitting Attribute
At each node, available attributes are
evaluated on the basis of separating the
classes of the training examples. A
Goodness function is used for this purpose.
Typical goodness functions:
– information gain (ID3/C4.5)
– information gain ratio
– gini index
Which attribute to select?
A criterion for attribute selection
Which is the best attribute?
– The one which will result in the smallest
tree
– Heuristic: choose the attribute that
produces the “purest” nodes
Popular impurity criterion: information gain
– Information gain increases with the average
purity of the subsets that an attribute
produces
Strategy: choose attribute that results in
greatest information gain
Information Gain (ID3/C4.5)
Select the attribute with the highest information
gain
Assume there are two classes, P and N
– Let the set of examples S contain p elements of class P
and n elements of class N
– The amount of information, needed to decide if an
arbitrary example in S belongs to P or N is defined as
p
p
n
n
I ( p, n)  
log 2

log 2
pn
pn pn
pn
Information Gain in Decision Tree
Induction
Assume that using attribute A a set S will be
partitioned into sets {S1, S2 , …, Sv}
– If Si contains pi examples of P and ni examples of N,
the entropy, or the expected information needed to
classify objects in all subtrees Si is

pi  ni
E ( A)  
I ( pi , ni )
i 1 p  n
The encoding information that would be gained
by branching on A
Gain( A)  I ( p, n)  E ( A)
Example: attribute “Outlook”
“Outlook” = “Sunny”:
info([2,3])  entropy(2/5,3/5)  2 / 5 log( 2 / 5)  3 / 5 log(3 / 5)  0.971 bits
“Outlook” = “Overcast”:
Note: this is
info([4,0])  entropy(1,0)  1log(1)  0 log(0)  0 bits normally not
defined.
“Outlook” = “Rainy”:
info([3,2])  entropy(3/5,2/5)  3 / 5 log(3 / 5)  2 / 5 log( 2 / 5)  0.971 bits
Expected information for attribute:
info([3,2],[4,0],[3,2])  (5 / 14)  0.971  (4 / 14)  0  (5 / 14)  0.971
 0.693 bits
Computing the information gain
Information gain: information before
splitting – information after splitting
gain(" Outlook" )  info([9,5] ) - info([2,3] , [4,0], [3,2])  0.940 - 0.693
 0.247 bits
Information gain for attributes from weather
data:
gain(" Outlook" )  0.247 bits gain(" Temperatur e" )  0.029 bits
gain(" Humidity" )  0.152 bits
gain(" Windy" )  0.048 bits
Continuing to split
gain(" Temperatur e" )  0.571 bits
gain(" Humidity" )  0.971 bits
gain(" Windy" )  0.020 bits
The final decision tree
Note: not all leaves need to be pure; sometimes
identical instances have different classes
 Splitting stops when data can’t be split any further
Highly-branching attributes
Problematic: attributes with a large
number of values (extreme case: ID code)
Subsets are more likely to be pure if there
is a large number of values
Information gain is biased towards choosing
attributes with a large number of values
This may result in overfitting (selection of an
attribute that is non-optimal for prediction)
Another problem: fragmentation
The gain ratio
Gain ratio: a modification of the information gain
that reduces its bias on high-branch attributes
Gain ratio takes number and size of branches
into account when choosing an attribute
– It corrects the information gain by taking the intrinsic
information of a split into account
– Also called split ratio
Intrinsic information: entropy of distribution of
instances into branches
– (i.e. how much info do we need to tell which branch
an instance belongs to)
Gain Ratio
Gain ratio should be
– Large when data is evenly spread
– Small when all data belong to one branch
Gain ratio normalizes info gain by this
reduction:
|S |
|S |
IntrinsicInfo(S , A)   i log i .
|S| 2 | S |
Gain
(
S
,
A
)
GainRatio(S , A) 
.
IntrinsicInfo(S , A)
Computing the gain ratio
Example: intrinsic information for ID code
info ([1,1, ,1)  14  (1 / 14  log 1 / 14)  3.807 bits
Importance of attribute decreases as
intrinsic information gets larger
Example of gain ratio:
gain(" Attribute" )
gain_ratio (" Attribute" ) 
intrinsic_ info(" Attribute" )
Example:
0.940 bits
gain_ratio (" ID_code") 
 0.246
3.807 bits
Gain ratios for weather data
Outlook
Temperature
Info:
0.693
Info:
0.911
Gain: 0.940-0.693
0.247
Gain: 0.940-0.911
0.029
Split info:
info([5,4,5])
1.577
Split info:
info([4,6,4])
1.362
Gain ratio:
0.247/1.577
0.156
Gain ratio:
0.029/1.362
0.021
Humidity
Windy
Info:
0.788
Info:
0.892
Gain: 0.940-0.788
0.152
Gain: 0.940-0.892
0.048
Split info: info([7,7]) 1.000
Split info:
info([8,6])
0.985
Gain ratio: 0.152/1
Gain ratio:
0.048/0.985
0.049
0.152
Classification Using Rules
Perform classification using If-Then rules
Classification Rule: r = <a,c>
Antecedent, Consequent
May generate from from other techniques
(DT, NN) or generate directly.
Algorithms: Gen, RX, 1R, PRISM
Extracting Classification Rules from Trees
Represent the knowledge in the form of IF-THEN rules
One rule is created for each path from the root to a leaf
Each attribute-value pair along a path forms a conjunction
The leaf node holds the class prediction
Rules are easier for humans to understand
Example
IF age = “<=30” AND student = “no” THEN buys_computer =
“no”
IF age = “<=30” AND student = “yes” THEN buys_computer =
“yes”
IF age = “31…40”
THEN buys_computer = “yes”
IF age = “>40” AND credit_rating = “excellent” THEN
buys_computer = “yes”
IF age = “>40” AND credit_rating = “fair” THEN buys_computer =
“no”
Generating Rules Example
Generating Rules Example
Avoid Overfitting in Classification
The generated tree may overfit the training data
– Too many branches, some may reflect anomalies due to
noise or outliers
– Result is in poor accuracy for unseen samples
Two approaches to avoid overfitting
– Prepruning: Halt tree construction early—do not split
a node if this would result in the goodness measure
falling below a threshold
Difficult to choose an appropriate threshold
– Postpruning: Remove branches from a “fully grown”
tree—get a sequence of progressively pruned trees
Use a set of data different from the training data to
decide which is the “best pruned tree”
Approaches to Determine
the Final Tree Size
Separate training (2/3) and testing (1/3) sets
Use cross validation, e.g., 10-fold cross validation
Use all the data for training
– but apply a statistical test (e.g., chi-square) to estimate
whether expanding or pruning a node may improve the
entire distribution
Use minimum description length (MDL) principle:
– halting growth of the tree when the encoding is
minimized
Enhancements to basic
decision tree induction
Allow for continuous-valued attributes
– Dynamically define new discrete-valued attributes that
partition the continuous attribute value into a discrete
set of intervals
Handle missing attribute values
– Assign the most common value of the attribute
– Assign probability to each of the possible values
Attribute construction
– Create new attributes based on existing ones that are
sparsely represented
– This reduces fragmentation, repetition, and replication
Decision Tree vs. Rules
Tree has implied
order in which
splitting is performed.
Tree created based
on looking at all
classes.
Rules have no
ordering of
predicates.
Only need to look at
one class to generate
its rules.