Classification Using Significant Rules
Download
Report
Transcript Classification Using Significant Rules
Classification Using
Statistically Significant Rules
Sanjay Chawla
School of IT
University of Sydney
(joint work with Florian Verhein and Bavani Arunasalam)
1
Overview
• Data Mining Tasks
• Associative Classifiers
• Support and Confidence for Imbalanced
Datasets
• The use of exact tests to mine rules
• Experiments
• Conclusion
2
Data Mining
• Data Mining research has settled into an
equilibrium involving four tasks
Associative
Classifier
Pattern Mining
(Association Rules)
Classification
DB
Clustering
Anomaly or Outlier
Detection
ML
3
Outlier Detection
• Outlier Detection (Anomaly Detection) can
be studied from two aspects
– Unsupervised
• Nearest Neighbor or K-Nearest Neighbor Problem
– Supervised
• Classification for imbalanced data set
– Fraud Detection
– Medical Diagnosis
4
Association Rules (Agrawal, Imielinksi and
Swami, 93 SIGMOD)
– An implication expression of the
form X Y, where X and Y are
itemsets
– Example:
{Milk, Diaper} {Beer}
• Rule Evaluation Metrics
– Confidence (c)
Items
1
Bread, Milk
2
3
4
5
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
Example:
– Support (s)
• Fraction of transactions that
contain both X and Y
TID
{Milk , Diaper } Beer
s
(Milk, Diaper, Beer )
|T|
2
0.4
5
• Measures how often items in Y
(Milk, Diaper, Beer ) 2
appear in transactions that
c
0.67
contain X
(Milk, Diaper )
3
6
From “Introduction to Data Mining”, Tan,Steinbach and Kumar
Association Rule Mining (ARM)
Task
• Given a set of transactions T, the goal of association rule
mining is to find all rules having
– support ≥ minsup threshold
– confidence ≥ minconf threshold
• Brute-force approach:
– List all possible association rules
– Compute the support and confidence for each rule
– Prune rules that fail the minsup and minconf
thresholds
Computationally prohibitive!
7
Mining Association Rules
•
Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support minsup
2. Rule Generation
– Generate high confidence rules from each
frequent itemset, where each rule is a binary
partitioning of a frequent itemset
•
Frequent itemset generation is still computationally
expensive
8
Frequent Itemset Generation
null
A
B
C
D
E
AB
AC
AD
AE
BC
BD
BE
CD
CE
DE
ABC
ABD
ABE
ACD
ACE
ADE
BCD
BCE
BDE
CDE
ABCD
ABCE
ABDE
ABCDE
ACDE
BCDE
Given d items, there
are 2d possible
candidate itemsets
9
From “Introduction to Data Mining”, Tan,Steinbach and Kumar
Reducing Number of
Candidates
• Apriori principle:
– If an itemset is frequent, then all of its subsets
must also be frequent
• Apriori principle holds due to the following
property of the support measure:
X , Y : ( X Y ) s( X ) s(Y )
– Support of an itemset never exceeds the
support of its subsets
– This is known as the anti-monotone property of
support
10
Illustrating Apriori Principle
null
A
B
C
D
E
AB
AC
AD
AE
BC
BD
BE
CD
CE
DE
ABC
ABD
ABE
ACD
ACE
ADE
BCD
BCE
BDE
CDE
Found to be
Infrequent
ABCD
Pruned
supersets
ABCE
ABDE
ACDE
BCDE
ABCDE
From “Introduction to Data Mining”, Tan,Steinbach and Kumar
11
Classification using ARM
TID
Items
Gender
1
Bread, Milk
F
2
Bread, Diaper, Beer, Eggs
M
3
Milk Diaper, Beer, Coke
M
4
Bread, Milk, Diaper, Beer
M
5
Bread, Milk, Diaper, Coke
F
In a Classification task we want to predict the
class label (Gender) using the attributes
A good (albeit stereotypical) rule is {Beer,Diaper} Male whose
support is 60% and confidence is 100%
12
Classification Based on Association
(CBA): Liu, Hsu and Ma (KDD 98)
• Mine association rules of the form Ac on
the training data
• Prune or Select rules using a heuristic
• Rank rules
– Higher confidence; higher support; smallest
antecedent
• New data is passed through the ordered
rule set
– Apply first matching rule or variation thereof
13
Several Variations
Acronym
Authors
Forum
Comments
CMAR
Li, Han, Pei
ICDM 01
Use Chi^2
--
Antonie,
Zaiane
DMKD 04
Pos and Neg
rules
FARMER
Cong et. al
SIGMOD 04
Top-K
Cong et. al
SIGMOD 05
row
enumeration
Limit no. of
rules
CCCS
Arunasalam
et. Al.
SIGKDD 06
Support-free
14
Downsides of Support (3)
• Support is biased towards the majority
class
– Eg: classes = {yes, no}, sup({yes})=90%
– minSup > 10% wipes out any rule predicting
“no”
– Suppose X no has confidence 1 and
support 3%. Rule discarded if minSup > 3%
even though it perfectly predicts 30% of the
instances in the minority class!
• In summary, support has many downsides
–especially for classification.
17
Downside of Confidence(1)
C
Conf(A C) = 20/25 = 0.8
Support(AC) = 20/100 = 0.2
Correlation between A and C:
A
A
C
20
5
25
70
5
75
90
10
100
( A, C )
20
0.89 1
( A) (C ) 0.25 0.90
Thus, when the data set is imbalanced a high support and high
confidence rule may not necessarily imply that the antecedent and
the consequent are positively correlated.
18
Downside of Confidence (2)
• Reasonable to expect that for “good rules”
the antecedent and consequent are not
independent!
• Suppose
– P(Class=Yes) = 0.9
– P(Class=Yes|X) = 0.9
19
Complement Class Support(CCS)
( A, C )
CCS ( A C )
(C )
The following are equivalent for a rule A C
1. A and C are positively correlated
2. The support of the antecedent(A) is less than CCS(AC)
3. Conf(AC) is greater than the support of Consequent(C)
20
Downsides of Confidence (3)
Another useful observation
• Higher confidence (support) for a rule in the
minority class implies higher correlation, and
lower correlation in the minority class implies
lower confidence, but neither of these apply for
the majority class.
• Confidence (support) tends to bias the
majority class.
21
Statistical Significant Rules
• Support is a computationally efficient
measure (anti-monotonic)
– Tendency to “force” a statistical interpretation
on support
• Lets start with a statistically correct
approach
– And “force” it to be computationally
efficient
22
Exact Tests
• Let the class variable be {0,1}.
• Suppose we have two rules X 1 and
Y1
• We want to determine if the two rules are
different, i.e., they have different effect on
“causing” or ‘associating” with 1 or 0
– e.g., medicine and placebo
23
Exact Tests
Table[a,b;c,d]
X
Y
1
x[a ]
m1 x[b]
0 n1 x[c] n2 m1 x[d ]
n1
n2
m1
m2
N n1 n2
m1 m2
We assume X and Y are binomial random variables with
the same parameter p
We want to determine, the probability that a specific table
24
instance occurs “purely by chance”
Exact Tests
P ( X x | X Y m1 )
P ( X x, Y m1 x)
P ( X Y m1 )
P ( X x) P (Y m1 x)
P ( X Y ) m1
n
n1 x
p (1 p ) n1 x 2 p m1 x p n2 m1 x
x
m1 x
n1 n2 m1
p (1 p ) n1 n2 m1
m1
n1 n2
x m x
1
n1 n 2
m
1
We can calculate the exact probability of a specific table instance
without resorting to asymptotic approximations.
This can be used to calculate the p-value of [a,b;c,d]
25
Fisher Exact Test
• Given a table, [a,b;c,d], Fisher Exact Test
will find the probability (p-value) of
obtaining the given table or a more
positively associated table under the
assumption that X and Y come from the
same distribution.
26
Forcing “anti-monotonic”
• We test a rule X 1 against all its immediate
generalizations {X-z 1; z in X}
• The
pvalue min { p[a, b; c, d ]}
zX
• The rule is significant if
– Pvalue < significance level (typically 0.05)
• Use a bottom up approach and only test rules whose
immediate generalizations are significant
• Webb[06] has used Fisher Exact tests for generic
association rule mining
27
Example
• Suppose we have already determined that the
rules (A = a1) 1 and (A = a2) 1 are
significant.
• Now we want to test if
– X=(A =a1) ^ (A=a2) 1 is significant
• Then we carry out a FET on X and X –{A=a1}
and X and X-{A=a2}.
• If the minimum of their p-value is less than the
significance level we keep the X 1 rule,
otherwise we discard it.
28
Ranking Rules
• We have already observed that
– high confidence rule for the majority class
may be “more” negatively correlated than the
same rule predicting the other class
– A high positively correlated rule that predicts
the minority class may have a lower
confidence than the same rule predicting the
other class
30
Experiments: Random Dataset
• Attributes independent and uniformly distributed.
• Makes no sense to find any rules – other than by chance
• However minSup=1% and minConf=0.5 mines
4149/13092 -- over 31% of all possible rules.
• Using our FET technique with standard significance level
we find only 11 (0.08%)
31
Experiments: Balanced Dataset
• Similar performance (within 1%)
32
Experiments: Balanced Dataset
• But mines only 0.06% the number of rules
• By searching only 0.5% of the search space
• And using only 0.4% of the time.
33
Experiments: Imbalanced Dataset
• Higher performance than support-confidence techniques
• Using 0.07% of the search space and time, 0.7% the
number of rules.
34
Contributions
• Strong evidence and arguments against the use
of support and confidence for imbalanced
classification
• Simple technique for using Fisher’s Exact test
for finding positively associated and statistically
significant rules.
– Uses on average 0.4% of the time, searches only
0.5% of the search space, finds only 0.06% of the
rules as support-confidence techniques.
– Similar performance on balanced datasets, higher on
imbalanced datasets.
• Parameter free (except for significance level)
35
References
• Verhein and Chawla, Classification Using
Statistically Significant Rules
http://www.it.usyd.edu.au/~chawla
• Arunasalam and Chawla, CCCS: A TopDown Associative Classifier for
Imbalanced Class Distribution [ACM
SIGKDD 2006; pp 517-522]
36