Transcript Slides

Mining Changes of Classification
by Correspondence Tracing
Ke Wang
Senqiang Zhou
Simon Fraser University
{wangk, szhoua}@cs.sfu.ca
Wai Chee Ada Fu
Jeffrey Xu Yu
The Chinese University of Hong Kong
[email protected]
[email protected]
The Problem



Mining changes of classification patterns as
the data changes
What we have: old classifier and new data
What we want: the changes of classification
characteristics in the new data
Example
members with a large family  shop frequently.
members with a large family shop infrequently.
Targets

State the changes explicitly


Simply comparing old and new classifiers
does not work
Distinguish the important changes,
otherwise users could be over-whelmed

Interested in the changes causing the
change of classification accuracy.
Related Work



[GGR99]: transfer two classifiers into a
common specialization and the changes
are measured by the amount of work
required for such transfer.
Human users hardly measure the
changes in this way
Have not addressed the primary goal:
accuracy
Related Work (Cont.)

[LHHX00]: extract changes by requiring
the new classifier to be similar to the
old one.


Using the same splitting attributes or the
same splitting in the decision tree
construction.
Put a severe restriction on mining
important changes.
Our Approach: Basic Idea

For each old rule o, trace the corresponding
new rules in the new classifier through the
examples they both classify
Example: n1 and n2 are corresponding rules of o.
New rule n1
New rule n2
Old rule o
New data set
The Algorithm


Input: old classifier and new data
Output: the changes of classification
patterns
Build new
classifier
Step 1
Identify
corresponding new
rules for each old rule
Step 2
Present
changes
Step 3
Identify The Corresponding Rules

Given an old rule o:



Collect the examples classified by o
For each such example, identify the new
rule n that classifies it
characteristic change: O <n1,n2, …, nk>
Old rule o
New rule n1
New rule n2
Any Important Changes?



Given O <n1,n2, …, nk>, which
changes are more important?
Hint: users usually are interested in the
changes causing the change of
classification accuracy.
Basic idea: measure the importance of
changes based on the estimated
accuracy of rules on future data
Pessimistic Error Estimation

Consider an old rule o that classifies N
examples ( in new data set) with E wrongly


Observed error rate: E/N
How about the error rate in the whole
population?


Given an confidence level CF, the upper bound of
error rate is UCF(N,E)
Details in [Q93]
Estimating Quantitative Change

Consider a rule pair <o, ni>, while
o(No, Eo), ni(Nn, En)
Estimate the error
rates for both rules
Calculate the
decrease of error
Quantitative
change
Calculate the
increase of accuracy
An Example
Consider <o, n1, n2> o: A4=1  C3, (N=7, E=4)
n1: A3=1A4=1  C3, (N=5, E=0) 3 classified by o
n2: A3=2A4=1  C2 , (N=6, E=0) 4 classified by o
Assume the new data set has 18 examples and CF=25%.
Consider the quantitative changes of <o, n1>
The estimated error rate of o is: UCF(7, 4) = 75%
The estimated error rate of n1 is: UCF(5, 0) = 24%
The decrease of error rate: 75% - 24% = 51%
The increase of accuracy (o,n1) = (3/18)*51%=8.5%
The increase of accuracy (o,<n1,n2>) =8.5%+12%=20.5%
Types of Changes


Global changes: both characteristics
change and quantitative change are
large.
Alternative changes: characteristics
change is large but its quantitative
change is small.
Types of Changes (Cont.)


Target changes: similar characteristics
but different classes (targets)
Interval changes: shift of boundary
points due to the emerging of new
cutting points.
Experiments

Two data sets:



German Credit Data from UCI repository
[MM96]
IPUMS Census Data [IPUMS]
Goal: to verify if our algorithm can find
the important changes “supposed” to
be found
Methodologies

For German Credit data:


For IPUMS census data:


Plant some changes to original data and
check if the proposed method finds them.
The proposed method is applied to find the
changes across years or different subpopulations.
Classifiers are built using C4.5 algorithm
Summaries of German Data

Data description:




2 classes: bad and good
20 attributes, 13 categorical
1000 examples: 700 are “good”
Changes planted:



Target change
Interval change
Etc.
Planting Target Change
Consider the examples classified by the old rule.
Personal-status = single-male, Foreign = no  Credit = good (23, 0)
Changes planted:
if ( Liable-people=1 ) then
change the class from good to bad
12 examples changed
The Changes Found
Personal-status = single-male, Foreign = no  Credit = good
Liable-people >1,
Foreign = no 
Credit = good
( = 0.54%)
Personal-status = single-male,
Liable-people <=1,
Foreign = no 
Credit = bad
( = 0.48%)
Planting Interval Change
Consider the examples classified by the old rule.
Status = 0DM, Duration > 11, Foreign = yes  Credit = bad (164, 47)
Changes planted:
Increase the Duration value by 6 (months)
for each example classified by the old rule.
164 examples changed
The Changes Found
Status = 0DM, Duration > 11, Foreign = yes  Credit = bad
Status = 0DM,
Duration > 16,
Foreign = yes 
Credit = bad
( = 1.20%)
Summaries of IPUMS Data




Take “vetstat” as class
3 data sets: 1970, 1980 and 1990.
Each data set contains the examples for
several races.
The proposed method is applied to find
the changes across years or different
sub-populations.
Interesting Changes Found
A. 1970-black vs 1990-black
35<age ≤54 Vetstat=yes
40<age ≤72, sex=male Vetstat=yes
( = 1.20%)
B. 1990-black vs 1990-chinese
40<age ≤72, sex=male Vetstat=yes
bplg =china, incss ≤5748 Verstat=no
( = 4.56%)
Conclusion




Extracting changes from potentially very
dissimilar old and new classifiers by
correspondence tracing
Ranking the importance of changes
Presenting the changes
Experiments on real-life data sets
References



[GGR99] V. Ganti, J. Gehrke, and R.
Ramakrishnan. A framework for
measuring changes in data
characteristics. In PODS, 1999
[IPUMS] http://www.ipums.umn.edu/.
[LHHX00] B. Liu,W. Hsu, H.S. Han, and
Y. Xia. Mining changes for real-life
applications. In DaWak, 2000
References (Cont.)


[MM96] C.J. Merz and P. Murphy. UCI
repository of machine learning
databases
[Q93] J.R. Quinlan. C4.5: programs for
maching learning. 1993