Classification

Download Report

Transcript Classification

Classification
Spring 2003
1
Classification task
• Input: a training set of tuples, each labeled
with one class label
• Output: a model (classifier) that assigns a
class label to each tuple based on the other
attributes
• The model can be used to predict the class
of new tuples, for which the class label is
missing or unknown
Spring 2003
2
What is Classification
• Data classification is a two-step process
– first step: a model is built describing a
predetermined set of data classes or concepts
– second step: the model is used for classification
• Each tuple is assumed to belong to a
predefined class, as determined by one of
the attributes, called the class label attribute
• Data tuples are also referred to as samples,
examples, or objects
Spring 2003
3
Train and test
• The tuples (examples, samples) are divided
into training set + test set
• Classification model is built in two steps:
– training - build the model from the training set
– test - check the accuracy of the model using test
set
Spring 2003
4
Train and test
• Kind of models:
– if - then rules
– logical formulae
– decision trees
• Accuracy of models:
– the known class of test samples is matched
against the class predicted by the model
– accuracy rate = % of test set samples correctly
classified by the model
Spring 2003
5
Training step
Classification
algorithm
training
data
Age
20
18
40
50
35
30
32
40
Spring 2003
Car Type
Combi
Sports
Sports
Family
Minivan
Combi
Family
Combi
Risk
High
High
High
Low
Low
High
Low
Low
Classifier
(model)
if age < 31
or Car Type =Sports
then Risk = High
6
Test step
Classifier
(model)
test
data
Age
27
34
66
44
Spring 2003
Car Type
Sports
Family
Family
Sports
Risk
High
Low
High
High
Risk
High
Low
Low
High
7
Classification (prediction)
Classifier
(model)
new
data
Age
27
34
55
34
Spring 2003
Car Type
Sports
Minivan
Family
Sports
Risk
Risk
High
Low
Low
High
8
Classification vs.
Prediction
• There are two forms of data analysis that
can be used to extract models describing
data classes or to predict future data trends:
– classification: predict categorical labels
– prediction: models continuous-valued functions
Spring 2003
9
Comparing Classification
Methods (1)
• Predictive accuracy: this refers to the ability of the
model to correctly predict the class label of new or
previously unseen data
• Speed: this refers to the computation costs
involved in generating and using the model
• Robustness: this is the ability of the model to
make correct predictions given noisy data or data
with missing values
Spring 2003
10
Comparing Classification
Methods (2)
• Scalability: this refers to the ability to construct
the model efficiently given large amount of data
• Interpretability: this refers to the level of
understanding and insight that is provided by the
model
• Simplicity:
– decision tree size
– rule compactness
• Domain-dependent quality indicators
Spring 2003
11
Problem formulation
Given records in the database with class
label – find model for each class.
Age
20
18
40
50
35
30
32
40
Spring 2003
Car Type
Combi
Sports
Sports
Family
Minivan
Combi
Family
Combi
Risk
High
High
High
Low
Low
High
Low
Low
Age < 31
Car Type
is sports
High
High
Low
12
Classification techniques
•
•
•
•
•
•
•
Decision Tree Classification
Bayesian Classifiers
Neural Networks
Statistical Analysis
Genetic Algorithms
Rough Set Approach
k-nearest neighbor classifiers
Spring 2003
13
Classification by Decision
Tree Induction
• A decision tree is a tree structure, where
– each internal node denotes a test on an attribute,
– each branch represents the outcome of the test,
– leaf nodes represent classes or class distributions
Age < 31
N
Y
Car Type
is sports
High
High
Spring 2003
Low
14
Decision Tree Induction (1)
• A decision tree is a class discriminator that
recursively partitions the training set until each
partition consists entirely or dominantly of
examples from one class.
• Each non-leaf node of the tree contains a split
point, which is a test on one or more attributes and
determines how the data is partitioned
Spring 2003
15
Decision Tree Induction (2)
• Basic algorithm: a greedy algorithm that
constructs decision trees in a top-down recursive
divide-and-conquer manner.
• Many variants:
– from machine learning (ID3, C4.5)
– from statistics (CART)
– from pattern recognition (CHAID)
• Main difference: split criterion
Spring 2003
16
Decision Tree Induction (3)
• The algorithm consists of two phases:
 Build an initial tree from the training data such that
each leaf node is pure
 Prune this tree to increase its accuracy on test data
Spring 2003
17
Tree Building
• In the growth phase the tree is built by recursively
partitioning the data until each partition is either
"pure" (contains members of the same class) or
sufficiently small.
• The form of the split used to partition the data
depends on the type of the attribute used in the
split:
 for a continuous attribute A, splits are of the form
value(A)<x where x is a value in the domain of A.
 for a categorical attribute A, splits are of the form
value(A)X where Xdomain(A)
Spring 2003
18
Tree Building Algorithm
Make Tree (Training Data T)
{
Partition(T)
}
Partition(Data S)
{
if (all points in S are in the same class) then
return
for each attribute A do
evaluate splits on attribute A;
use best split found to partition S into S1 and S2
Partition(S1)
Partition(S2)
}
Spring 2003
19
Tree Building Algorithm
• While growing the tree, the goal at each node is to
determine the split point that "best" divides the
training records belonging to that leaf
• To evaluate the goodness of the split some
splitting indices have been proposed
Spring 2003
20
Split Criteria
• Gini index (CART, SPRINT)
– select attribute that minimize impurity of a split
• Information gain (ID3, C4.5)
– to measure impurity of a split use entropy
– select attribute that maximize entropy reduction
• 2 contingency table statistics (CHAID)
– measures correlation between each attribute and
the class label
– select attribute with maximal correlation
Spring 2003
21
Gini index (1)
Given a sample training set where each record
represents a car-insurance applicant. We want to
build a model of what makes an applicant a high or
low insurance risk.
Classifier
(model)
Training set
RID
0
1
2
3
4
5
Spring 2003
Age
23
17
43
68
32
20
Car Type
family
sport
sport
family
truck
family
Risk
high
high
high
low
low
high
The model built can be used to
screen future insurance applicants
by classifying them into the High
or Low risk categories
22
Gini index (2)
SPRINT algorithm:
Partition(Data S) {
if (all points in S are of the same class) then
return;
for each attribute A do
evaluate splits on attribute A;
Use best split found to partition S into S1 and S2
Partition(S1);
Partition(S2);
}
Initial call: Partition(Training Data)
Spring 2003
23
Gini index (3)
• Definition:
gini(S) = 1 - pj2
where:
• S is a data set containing examples from n classes
• pj is a relative frequency of class j in S
• E.g. two classes, Pos and Neg, and dataset S with p
Pos-elements and n Neg-elements.
ppos= p/(p+n)
pneg = n/(n+p)
gini(S) = 1 - ppos2 - pneg2
Spring 2003
24
Gini index (4)
• If dataset S is split into S1 and S2, then splitting index
is defined as follows:
giniSPLIT(S) = (p1+ n1)/(p+n)*gini(S1) +
(p2+ n2)/(p+n)* gini(S2)
where p1, n1 (p2, n2) denote p1 Pos-elements and n1
Neg-elements in the dataset S1 (S2), respectively.
• In this definition the "best" split point is the one with
the lowest value of the giniSPLIT index.
Spring 2003
25
Example (1)
Training set
RID
0
1
2
3
4
5
Spring 2003
Age
23
17
43
68
32
20
Car Type
family
sport
sport
family
truck
family
Risk
high
high
high
low
low
high
26
Example (1)
Attribute list for ‘Age’
Age
17
20
23
32
43
68
Spring 2003
RID
1
5
0
4
2
3
Risk
high
high
high
low
high
low
Attribute list for ‘Car Type’
Car Type
family
sport
sport
family
truck
family
RID
0
1
2
3
4
5
Risk
high
high
high
low
low
high
27
Example (2)
• Possible values of a split point for Age attribute are:
Age17, Age20, Age23, Age32, Age43,
Age68
Tuple count
Age<=17
Age>17
High
1
3
Low
0
2
G(Age<=17) = 1- (12+02) = 0
G(Age>17) = 1- ((3/5)2+(2/5)2) = 1 - (13/25)2 = 12/25
GSPLIT = (1/6) * 0 + (5/6) * (12/25) = 2/5
Spring 2003
28
Example (3)
Tuple count
Age<=20
Age>20
High
2
2
Low
0
2
G(Age<=20) = 1- (12+02) = 0
G(Age>20) = 1- ((1/2)2+(1/2)2) = 1/2
GSPLIT = (2/6) * 0 + (4/6) * (1/8) = 1/3
Tuple count
Age<=23
Age>23
High
3
1
Low
0
2
G(Age23) = 1- (12+02) = 0
G(Age>23) = 1- ((1/3)2+(2/3)2) = 1 - (1/9) - (4/9) = 4/9
GSPLIT = (3/6) * 0 + (3/6) * (4/9) = 2/9
Spring 2003
29
Example (4)
Tuple count
Age<=32
Age>32
High
3
1
Low
1
1
G(Age32) = 1- ((3/4)2+(1/4)2) = 1 - (10/16) = 6/16 = 3/8
G(Age>32) = 1- ((1/2)2+(1/2)2) = 1/2
GSPLIT = (4/6)*(3/8) + (2/6)*(1/2) = (1/8) + (1/6)=14/48= 7/24
The lowest value of GSPLIT is for Age23, thus we
have a split point at Age=(23+32) / 2 = 27.5
Spring 2003
30
Example (5)
Decision tree after the first split of the example set:
Age27.5
Risk = High
Spring 2003
Age>27.5
Risk = Low
31
Example (6)
Attribute lists are divided at the split point.
Attribute lists for Age27.5:
Age
17
20
23
RID
1
5
0
Risk
high
high
high
Car Type
family
sport
family
RID
0
1
5
Risk
high
high
high
Car Type
sport
family
truck
RID
2
3
4
Risk
high
low
low
Attribute lists for Age>27.5
Age
32
43
68
Spring 2003
RID
4
2
3
Risk
low
high
low
32
Example (7)
Evaluating splits for categorical attributes
We have to evaluate splitting index for each of the 2N
combinations, where N is the cardinality of the categorical
attribute.
Tuple count
Car type= {sport}
Car type ={family]
Car type = {truck}
High Low
1
0
0
1
0
1
G(Car type {sport}) = 1 – 12 – 02 = 0
G(Car type {family}) = 1 – 02 – 12 = 0
G(Car type {truck}) = 1 – 02 – 12 = 0
Spring 2003
33
Example (8)
G(Car type  { sport, family }) = 1 - (1/2)2 - (1/2)2 = 1/2
G(Car type  { sport, truck }) = 1/2
G(Car type  { family, truck }) = 1 - 02 - 12 = 0
GSPLIT(Car type  { sport }) = (1/3) * 0 + (2/3) * 0 = 0
GSPLIT(Car type  { family }) = (1/3) * 0 + (2/3)*(1/2) = 1/3
GSPLIT(Car type  { truck }) = (1/3) * 0 + (2/3)*(1/2) = 1/3
GSPLIT(Car type  { sport, family}) = (2/3)*(1/2)+(1/3)*0= 1/3
GSPLIT(Car type  { sport, truck}) = (2/3)*(1/2)+(1/3)*0= 1/3
GSPLIT(Car type  { family, truck }) = (2/3)*0+(1/3)*0=0
Spring 2003
34
Example (9)
The lowest value of GSPLIT is for Car type 
{sport}, thus this is our split point. Decision tree
after the second split of the example set:
Age27.5
Age>27.5
Risk = High
Car type  {sport}
Risk = High
Spring 2003
Car type  {family, truck}
Risk = Low
35
Information Gain (1)
• The information gain measure is used to select the
test attribute at each node in the tree
• The attribute with the highest information gain (or
greatest entropy reduction) is chosen as the test
attribute for the current node
• This attribute minimizes the information needed to
classify the samples in the resulting partitions
Spring 2003
36
Information Gain (2)
• Let S be a set consisting of s data samples.
Suppose the class label attribute has m distinct
values defining m classes, Ci (for i=1, ..., m)
• Let si be the number of samples of S in class Ci
• The expected information needed to classify a
given sample is given by
I(s1, s2, ..., sm) = -  pi log2(pi)
where pi is the probability that an arbitrary sample
belongs to class Ci and is estimated by si/s.
Spring 2003
37
Information Gain (3)
• Let attribute A have v distinct values, {a1, a2, ...,
av}. Attribute A can be used to partition S into {S1,
S2, ..., Sv}, where Sj contains those samples in S
that have value aj of A
• If A were selected as the test attribute, then these
subsets would correspond to the branches grown
from the node containing the set S
Spring 2003
38
Information Gain (4)
• Let sij be the number of samples of the class Ci in
a subset Sj. The entropy, or expected information
based on the partitioning into subsets by A, is
given by:
E(A1, A2, ...Av) = (s1j + s2j +...+smj)/s*
* I(s1j, s2j, ..., smj)
• The smaller the entropy value, the greater the
purity of the subset partitions.
Spring 2003
39
Information Gain (5)
• The term (s1j + s2j +...+smj)/s acts as the weight of
the jth subset and is the number of samples in the
subset (i.e. having value aj of A) divided by the
total number of samples in S. Note that for a given
subset Sj,
I(s1j, s2j, ..., smj) = -  pij log2(pij)
where pij = sij/|Sj| and is the probability that a
sample in Sj belongs to class Ci
Spring 2003
40
Information Gain (6)
The encoding information that would be gained by
branching on A is
Gain(A) = I(s1, s2, ..., sm) – E(A)
Gain(A) is the expected reduction in entropy
caused by knowing the value of attribute A
Spring 2003
41
Example (1)
RID
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Spring 2003
Age
<=30
<=30
31..40
>40
>40
>40
31..40
<=30
<=30
>40
<=30
31..40
31..40
>40
Income
high
high
high
medium
low
low
low
medium
low
medium
medium
medium
high
medium
student
no
no
no
no
yes
yes
yes
no
yes
yes
yes
no
yes
no
credit_rating
fair
excellent
fair
fair
fair
excellent
excellent
fair
fair
fair
excellent
excellent
fair
excellent
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
42
Example (2)
• Let us consider the following training set of
tuples taken from the customer database.
• The class label attribute, buys_computer, has two
distinct values (yes, no), therefore, there are two
classes (m=2).
C1 correspond to yes – s1 = 9
C2 correspond to no - s2 = 5
I(s1, s2)=I(9, 5)= - 9/14log29/14 – 5/14log25/14=0.94
Spring 2003
43
Example (3)
• Next, we need to compute the entropy of each
attribute. Let start with the attribute age
for age=‘<=30’
s11=2
s21=3
I(s11, s21) = 0.971
for age=’31..40’
s12=4
s22=0
I(s12, s22) = 0
for age=‘>40’
s13=2
Spring 2003
s23=3
I(s13, s23) = 0.971
44
Example (4)
The entropy of age is,
E(age)=5/14 *I(s11, s21) +4/14* I(s12, s22) +
+ 5/14* I(s13, s23) = 0.694
The gain in information from such a partitioning
would be:
Gain(age) = I(s1, s2) – E(age) = 0.246
Spring 2003
45
Example (5)
•We can compute
Gain(income)=0.029,
Gain(student)=0.151, and
Gain(credit_rating)=0.048
Since age has the highest information gain amont the
attributes, it is selected as the test atribute. A node is
created and labeled with age, and branches are
grown for each of the attribute’s values.
Spring 2003
46
Example (6)
age
<=30
buys_computers:
yes, no
31..40
>40
buys_computers:
yes, no
buys_computers: yes
Spring 2003
47
Example (7)
age
<=30
student
>40
31..40
credit_rating
yes
no
no
Spring 2003
yes
excellent
yes
no
fair
yes
48
Entropy vs. Gini index
•Gini index tends to isolate
the largest class from all other
classes
•Entropy tends to fin groups
of classes that add up to 50%
of the data
class A 40
class B 30
class C 20
class D 10
class A 40
class B 30
class C 20
class D 10
yes
if age < 40
class A 40
Spring 2003
no
class B 30
class C 20
class D 10
yes
class A 40
class D 10
if age < 65
no
class B 30
class C 20
49
Tree pruning
• When a decision tree is built, many of the branches
will reflect anomalies in the training data due to
noise or outliers.
• Tree pruning methods typically use statistical
measures to remove the least reliable branches,
generally resulting in faster classification and an
improvement in the ability of the tree to correctly
classify independent test data
Spring 2003
50
Tree pruning
• Prepruning approach (stopping): a tree is
‘pruned’ by halting its construction early (i.e. by
deciding not to further split or partition the subset
of training samples). Upon halting, the node
becomes a leaf. The leaf hold the most frequent
class among the subset samples
• Postpruning approach (pruning): removes
branches from a ‘fully grown’ tree. A tree node is
pruned by removing its branches. The lowest
unpruned node becomes a leaf and is labeled by
the most frequent class among its former branches
Spring 2003
51
Extracting Classification Rules
from Decision Trees
• The knowledge represented in decision trees can
be extracted and represented in the form of
classification IF-THEN rules.
• One rule is created for each path from the root to a
leaf node
• Each attribute-value pair along a given path forms
a conjunction in the rule antecedent; the leaf node
holds the class prediction, forming the rule
consequent
Spring 2003
52
Extracting Classification Rules
from Decision Trees
• The decision tree of Example (7) can be converted to
classification rules:
IF age=‘<=30’ AND student=‘no’ THEN buys_computers=‘no’
IF age=‘<=30’ AND student=‘yes’ THEN buys_computers=‘yes’
IF age=’31..40’
THEN buys_computers=‘yes’
IF age=‘>40’ AND credit_rating=‘excellent’
THEN buys_computers=‘no’
IF age=‘>40’ AND credit_rating=‘fair’
THEN buys_computers=‘yes’
Spring 2003
53
Other Classification Methods
• There is a number of classification methods in the
literature:
–
–
–
–
–
Bayesian classifiers
Neural-network classifiers
K-nearest neighbor classifiers
Association-based classifiers
Rough and fuzzy sets
Spring 2003
54
Classification Based on Concepts
from Association Rule Mining
• We may apply „quantitative rule mining”
approach to discover classification rules –
associative classification
• It mines rules of the form condset  y, where
condset is a set of items (or attribute-value pairs)
and y is a class label.
Spring 2003
55
Bayesian classifers
• Bayesian classifier is a statistical classifier. It can predict the
probability that a given sample belongs to a particular class.
• Bayesian classification is based on Bayes theorem of aposteriori probability.
• Let X is a data sample whose class label is unknown. Each
sample is represented n-dimensional vector, X=(x1, x2, ...,
xn).
• The classification problem may be formulated using aposteriori probabilities as follows: determine P(C|X), the
probability that the sample X belongs to a specified class C.
• P(C|X) is the a-posteriori probability of C conditioned on X.
Spring 2003
56
Bayesian classifers
• Example:
Given a set of samples describing credit applicants
P(Risk=low|Age=38, Marital_Status=divorced,
Income=low, children=2) is the probability that a credit
applicant X=(38, divorced, low, 2) is the low credit risk
applicant.
• The idea of Bayesian classification is to assign to a new
unknown sample X the class label C such that P(C| X) is
maximal.
Spring 2003
57
Bayesian classifers
• The main problem is how to estimate a-posteriori
probability P(C|X)?
• By Bayes theorem:
P(C|X) = (P(X|C) * P(C))/P(X),
where P(C) is the apriori probability of C, that is the
probability that any given sample belongs to the class C,
P(X|C) is the a-posteriori probability of X conditioned on
C, and P(X) is the apriori probability of X.
• In our example, P(X|C) is the probability that X=(38,
divorced, low, 2) given the class Risk=low, P(C) is the
probability of the class C, and P(X) is the probability that
the sample X=(38, divorced, low, 2).
Spring 2003
58
Bayesian classifers
• Suppose a training database D consists of n samples, and
suppose the class label attribute has m distinct values
defining m distinct classes C_i, for i = 1, ..., m.
• Let s_i denotes the number of samples of D in class C_i.
• Bayesian classifier assigns an unknown sample X to the
class C_i that maximizes P(C_i|X). Since P(X) is constant
for all classes, the class C_i for which P(C_i|X) is
maximized is the class C_i for which P(X|C_i) * P(C_i) is
maximized.
• P(C_i) may be estimated by s_i/n (relative frequency of the
class C_i), or we may assume that all classes have the
same probability P(C_1) = P(C_2) = ... = P(C_k).
Spring 2003
59
Bayesian classifers
• The main problem is how to compute P(C_i|X)?
• Given a large dataset with many predictor attributes, it
would be very expensive to compute P(C_i|X), therefore,
to reduce the cost of computing P(C_i|X), the assumption
of class conditional independence, or, in other words, the
attribute independence assumption is made.
• The assumption states that there are no dependencies
among predictor attributes, which leads to the following
formula:
P(X|C_i) = j=1k P(x_j|C_i)
Spring 2003
60
Bayesian classifers
• The probabilities P(x_1|C_i), P(x_2|C_i), ..., P(x_k|C_i)
can be estimated from the dataset:
- If j-th attribute is categorical, then P(x_j|C_i) is estimated
as the relative frequency of samples of the class C_I
having value x_j for j-th attribute,
- If j-th attribute is continuous, then P(x_j|C_i) is estimated
through the Gaussian density function.
• Due to the class conditional independence assumption, the
Bayesian classifier is also known as the naive Bayesian
classifier.
Spring 2003
61
Bayesian classifers
• The assumption makes computation possible. Moreover,
when the assumption is satisfied, the naive Bayesian
classifier is optimal, that is it is the most accurate classifier
in comparison to all other classifiers.
• However, the assumption is seldom satisfied in practice,
since attributes are usually correlated.
• Several attempts are being made to apply Bayesian
analysis without assuming attribute independence. The
resulting models are called Bayesian networks or
Bayesian belief networks
• Bayesian belief networks combine Bayesian analysis with
causal relationships between attributes.
Spring 2003
62
k-nearest neighbor classifiers
• Nearest neighbor classifier belongs to instance-based
learning methods.
• Instance-based learning methods differ from other
classification methods discussed earlier in that they do not
build a classifier until a new unknown sample needs to be
classified.
• Each training sample is described by n-dimensional vector
representing a point in an n-dimensional space called
pattern space. When a new unknown sample has to be
classified, a distance function is used to determine a
member of the training set which is closest to the unknown
sample.
Spring 2003
63
k-nearest neighbor classifiers
• Once the nearest training sample is located in the pattern space, its
class label is assigned to the unknown sample.
• The main drawback of this approach is that it is very sensitive to noisy
training samples.
• The common solution to this problem is to adopt the k-nearest
neighbor strategy.
• When a new unknown sample has to be classified, the classifier
searches the pattern space for the k training samples which are closest
to the unknown sample. These k training samples are called the k
"nearest neighbors" of the unknown sample and the most common
class label among k "nearest neighbors" is assigned to the unknown
sample.
• To find the k "nearest neighbors" of the unknown sample a
multidimensional index is used (e.g. R-tree, Pyramid tree, etc.).
Spring 2003
64
k-nearest neighbor classifiers
• Two different issues need to be addressed regarding knearest neighbor method:
– the distance function, and
– the transformation from a sample to a point in the pattern space.
• The first issue is to define the distance function. If the
attributes are numeric, most k-nearest neighbor classifiers
use Euclidean distance.
• Instead of the Euclidean distance, we may also apply other
distance metrics like Manhattan distance, maximum of
dimensions, or Minkowski distance.
Spring 2003
65
k-nearest neighbor classifiers
• The second issue is how to transform a sample to a point in
the pattern space.
• Note that different attributes may have different scales and
units, and different variability. Thus, if the distance metric
is used directly, the effects of some attributes might be
dominated by other attributes that have larger scale or
higher variability.
• A simple solution to this problem is to weight the various
attributes. One common approach is to normalize all
attribute values into the range [0, 1].
Spring 2003
66
k-nearest neighbor classifiers
• This solution is sensitive to the outliers problem since a
single outlier could cause virtually all other values to be
contained in a small subrange.
• Another common approach is to apply a standarization
transformation, such as subtracting the mean from the
value of each attribute and then dividing by its standard
deviation.
• Recently, another approach was proposed which consists in
applying the robust space transformation called DonohoStahel estimator – the estimator has some important and
useful properties that make the estimator very attractive for
different data mining applications.
Spring 2003
67
Classifier accuracy
• The accuracy of a classifier on a given test set of samples
is defined as the percentage of test samples correctly
classified by the classifier, and it measures the overall
performance of the classifier.
• Note that the accuracy of the classifier is not estimated on
the training dataset, since it would not be a good indicator
of the future accuracy on new data.
• The reason is that the classifier generated from the training
dataset tends to overfit the training data, and any estimate
of the classifier's accuracy based on that data will be
overoptimistic.
Spring 2003
68
Classifier accuracy
• In other words, the classifier is more accurate on the data
that was used to train the classifier, but very likely it will
be less accurate on independent set of data.
• To predict the accuracy of the classifier on new data, we
need to asses its accuracy on an independent dataset that
played no part in the formation of the classifier.
• This dataset is called the test set
• It is important to note that the test dataset should not to be
used in any way to built the classifier.
Spring 2003
69
Classifier accuracy
• There are several methods for estimating classifier
accuracy. The choice of a method depends on the amount
of sample data available for training and testing.
• If there are a lot of sample data, then the following simple
holdout method is usually applied.
• The given set of samples is randomly partitioned into two
independent sets, a training set and a test set (typically,
70% of the data is used for training, and the remaining
30% is used for testing)
• Provided that both sets of samples are representative, the
accuracy of the classifier on the test set will give a good
indication of accuracy on new data.
Spring 2003
70
Classifier accuracy
• In general, it is difficult to say whether a given set of
samples is representative or not, but at least we may ensure
that the random sampling of the data set is done in such a
way that the class distribution of samples in both training
and test set is approximately the same as that in the initial
data set.
• This procedure is called stratification
Spring 2003
71
Testing – large dataset
Available examples
30%
70%
Divide randomly
Training Set
used to develop one tree
Spring 2003
Test Set
check accuracy
72
Classifier accuracy
• If the amount of data for training and testing is limited, the
problem is how to use this limited amount of data for
training to get a good classifier and for testing to obtain a
correct estimation of the classifier accuracy?
• The standard and very common technique of measuring the
accuracy of a classifier when the amount of data is limited
is k-fold cross-validation
• In k-fold cross-validation, the initial set of samples is
randomly partitioned into k approximately equal mutually
exclusive subsets, called folds, S_1, S_2, ..., S_k.
Spring 2003
73
Classifier accuracy
• Training and testing is performed k times. At each
iteration, one fold is used for testing while remainder k-1
folds are used for training. So, at the end, each fold has
been used exactly once for testing and k-1 for training.
• The accuracy estimate is the overall number of correct
classifications from k iterations divided by the total
number of samples N in the initial dataset.
• Often, the k-fold cross-validation technique is combined
with stratification and is called stratified k-fold crossvalidation
Spring 2003
74
Testing – small dataset
* cross-validation
Repeat 10 times
Available examples
10%
90%
Training Set
used to develop 10 different trees
Spring 2003
Test Set
check accuracy
75
Classifier accuracy
• There are many other methods of estimating classifier
accuracy on a particular dataset
• Two popular methods are leave-one-out cross-validation
and the bootstrapping
• Leave-one-out cross-validation is simply N-fold crossvalidation, where N is the number of samples in the initial
dataset
• At each iteration, a single sample from the dataset is left
out for testing, and remaining samples are used for
training. The result of testing is either success or failure.
• The results of all N evaluations, one for each sample from
the dataset, are averaged, and that average represents the
final accuracy estimate.
Spring 2003
76
Classifier accuracy
• Bootstrapping is based on the sampling with replacement
• The initial dataset is sampled N times, where N is the total
number of samples in the dataset, with replacement, to
form another set of N samples for training.
• Since some samples in this new "set" will be repeated, so it
means that some samples from the initial dataset will not
appear in this training set. These samples will form a test
set.
• Both mentioned estimation methods are interesting
especially for estimating classifier accuracy for small
datasets. In practice, the standard and most popular
technique of estimating a classifier accuracy is stratified
tenfold cross-validation
Spring 2003
77
Requirements
• Focus on mega-induction
• Handle both continous and categorical data
• No restriction on:
– number of examples
– number of attributes
– number of classes
Spring 2003
78
Applications
•
•
•
•
•
•
Treatment effectiveness
Credit Approval
Store location
Target marketing
Insurrence company (fraud detection)
Telecommunication company (client
classification)
Spring 2003
79