Mining Approximate Frequent Itemsets in the Presence of Noise

Download Report

Transcript Mining Approximate Frequent Itemsets in the Presence of Noise

Mining Approximate Frequent
Itemsets in the Presence of Noise
By- J. Liu, S. Paulsen, X. Sun, W. Wang, A. Nobel and J. Prins
Presentation by- Apurv Awasthi
Title Statement
This paper introduces an approach to
implement noise tolerant frequent itemset
mining of the binary matrix representation
of the database
Index

Introduction to Frequent Itemset Mining
• Frequent Itemset Mining
• Binary Matrix Representation Model
• Problems








Motivation
Proposed Model
Proposed Algorithm
AFI Mining vs. Exact Frequent Itemset Mining
Related Works
Experimental Results
Discussion
Conclusion
Introduction to Frequent Itemset Mining



Frequent pattern: a pattern (a set of items,
subsequences, substructures, etc.) that occurs
frequently in a data set
Originally developed to discover association rules
Applications
• Bio-molecular applications:
oDNA sequence analysis, protein structure
analysis
• Business applications:
oMarket basket analysis, sale campaign analysis
The Binary Matrix Representation Model



Model for representing relational databases
Rows correspond to objects
Columns correspond to attributes of the objects
• ‘1’ indicates presence
• ‘0’ indicates absence
• Frequent Itemset mining is a key
technique for analyzing such data
Apply Apriori algorithm
Item -->
I1
I2
I3
I4
I5
T1
1
0
1
1
0
T2
0
1
1
0
1
T3
1
1
1
0
1
T4
0
1
0
0
1
T5
1
0
0
0
0
Transaction
Problem with Frequent Itemset Mining
•
•
The traditional model for mining
frequent itemsets requires that every
item must occur in each supporting
transaction
NOT a practical assumption!
• Real data is typically subject to noise
• Reasons for noise
o
o
o
o
Human error
Measurement error
Vagaries of human behavior
Stochastic nature of studied biological behavior
Effect of Noise
 Fragmentation of Patterns by Noise
• Discover multiple small fragments of the true itemset
• Miss the true itemset itself!
 Example
• Exact frequent itemset
mining algorithm will miss
the main itemset ‘A’
• Observe three fragmented
itemsets – Itemset 1,2 and
3
• Fragmented itemsets may
not satisfy the minimum
support criteria and will
therefore be discarded
Mathematical Proof of Fragmentation
From: Significance and Recovery of block structures in binary matrices with noise by X. Sun & A.B. Nobel
With probability 1,
M(Y) <= 2logan− 2loga(logan)
when n is sufficiently large
i.e. in the presence of noise, only a fraction
of the initial block of 1s can be recovered
Where –
Matrix X: contains actual values recorded in the absence of any noise
Matrix Z: binary noise matrix whose entries are independent Bernoulli’s random
variable such that Z ~ Bern(p) for 0<=p<=0.5
M(Y): is the largest k such that Y contains k transactions having k common items
Y = X xor Z, a = (1 − p)−1
Motivation

The failure of classical frequent itemset mining to
detect simple patterns in the presence of random
errors (i.e. noise) compromises the ability of these
algorithms to detect association, cluster items or
build classifiers when such errors are present
Possible Solutions

Let the matrix contain a small fraction of 0s
DRAWBACK:
“Free riders” like column h
(for matrix C) and row 6
(for matrix B)
SOLUTION: Limit the number of 0s in each row and column
Proposed Model
1.

Use Approximate Frequent Itemset (AFI)
AFI characteristics
• Sub-matrix contains large fraction of 1s
• Supporting transaction should contain most of the items
i.e. number of 0s in every row must fall below user defined
threshold (єr)
• Supporting item should occur in most of the transaction
i.e. number of 0s in every column must fall below user defined
threshold (єc)
• Number of rows > minimum support
AFI

Mathematical definition
• For a given binary matrix D having I0 items and T0 transactions,
an itemset I c I0 is an approximate frequent itemset AFI(єr,єc) if
there exists a set of transactions T c T0 with |T| ≥ |T0|.minsup
such that
• Similarly, define weak AFI(є)
AFI example

A, B and C are weak AFI (0.25)

A: valid AFI(0.25,0.25)
B: weak AFI(*,0.25)
C: weak AFI(0.25,*)


Drawback of AFI



AFI criteria violates the Apriori property!
Apriori Property: all sub-itemsets of
a frequent itemset must be
frequent
But, sub-itemset of an AFI need not
be AFI e.g. A is a valid AFI for
minSupport = 4, but {b,c,e}, {b,c,d}
etc are not valid AFIs
• PROBLEM – now minimum support can not be used as a
pruning technique
• SOLUTION – a generalization of Apriori properties for
noisy conditions (called Noise Tolerant Support Pruning)
Proposed Model
1.
2.
3.
Use Approximate Frequent Itemset (AFI)
Noise Tolerant Support Pruning – to prune and
generate candidate itemsets
0/1 Extension - to count the support of a noise tolerant itemset based on the support set of its
sub-itemsets
Noise Tolerant Support Pruning

For a given єr, єc and minsup the noise tolerant
pruning support for a length-k itemset is-
Proof
0/1 Extensions




Starting from singleton itemsets, generate (k+1) itemsets
from k itemsets in sequential manner
The number of 0s allowed in the itemset grows with the
length of the itemset in a discrete manner
1 Extension
If
then the transaction set of
a (k+1) itemset I is the intersection of the transaction
sets of its length k subsets
0 Extension
If
then the transaction set of
a (k+1) itemset I is the union of the transaction sets of its
length k subsets
Proof
Proposed Algorithm
AFI vs. Exact Frequent Itemset
єr, єc = 1/3; n=8; minsup =1
AFI Mining
AFI vs. Exact Frequent Itemset
Exact Frequent
Itemset Mining
Transaction Item
T1
a,b,c
T2
a,b
T3
a,c
T4
b,c
T5
a,b,c,d
T6
d
T7
b,d
T8
a
1-candidates
Itemset Support
a
5
b
5
c
4
d
3
Freq 1-itemsets
2-candidates
Itemset Support
Freq 2-itemsets
Itemset Support
a
5
ab
3
b
5
ac
3
c
4
bc
3
MinSup = 0.5 i.e. 4 transactions
n=8
Itemset
Null
AFI vs. Exact Frequent Itemset Result
Approximate Frequent Itemset
Exact Frequent Itemset
Generates the frequent
itemset {a,b,c}
Can not generate any
frequent itemset in the
presence of noise for the
given minimum support
value
Related Works
Yang et al. (2001) proposed two error-tolerant models, termed weak
error-tolerant itemsets or weak ETI [which is equivalent to weak AFI]
and strong ETI which is [equivalent to AFI(єr,*)]
DRAWBACK
No efficient pruning technique – rely on heuristics and sampling
techniques
Do not preclude columns of 0
 Steinbach et al. (2004) proposed a “support envelope” which is a tool
for exploration and visualization of the high-level structures of
association patterns. A symmetric ETI model is proposed such that the
same fraction of errors are allowed in both rows and columns.
DRAWBACK
Implements same error co-efficient for rows and columns i.e. єr= єc
Admits only a fixed number of 0s in the itemset. Fraction of noise does
not vary with size of itemset sub-matrix

Related Works
Seppänen and Mannila (2004) proposed to mine the dense itemsets in
the presence of noise where the dense itemsets are the itemsets with a
sufficiently large sub-matrix that exceeds a given density threshold of
attributes present.
DRAWBACK
Enforces the constraint that all sub-itemsets of a dense itemset must be
frequent – will fail to identify larger itemsets that have sufficient
support because all sub-itemsets might not have enough support
Requires repeated scans of the database

Experimental Results - Scalability

Scalability
• Database of 10,000
transactions and 100
items
• Run time increases as
noise tolerance
increases
• Reducing item wise
error constraint leads
to greater reduction in
run time as compared
to transaction wise
error constraint
Experimental Results – Synthetic Data

Quality Testing for
single cluster
• Create data with an
embedded pattern
• Add noise by flipping
each entry with
probability p where
0.01 ≤ p ≤ 0.2
Experimental Results – Synthetic Data

Quality Testing for
multiple clusters
• Create data with
multiple embedded
pattern
• Add noise by flipping
each entry with
probability p where
0.01 ≤ p ≤ 0.2
Experimental Results – Real World Data

Zoo Data Set
• Database contained 101 instances and 18 attribute
• All the instances are classified into 7 classes e.g. Mammals, fish etc
Exact
Generated subsets of
animal in each class
Then found subsets of their
common features
ETI (єr)
Identified "fins" and
"domestic" as common
features
NOT necessarily true
AFI (єr,єc)
Only AFI was able to
recover 3 classes with 100%
accuracy
Discussion

Advantages
• Flexibility of placing constraints independently along rows
and columns
• Generalized Apriori technique for pruning
• Avoids repeated scans of database by using 0/1 extension
Summary


The paper outlines an algorithm for mining
approximate frequent itemsets from noisy data
It introduces
• an AFI model
• Generalized Apriori property for pruning

The proposed algorithm generates more useful
itemsets compared to existing algorithms and is also
computationally more efficient
Thank You!
Extra Slides for
Questionnaire
Applying Apriori Algorithm
Data base D
Item -->
Transaction
T1
T2
T3
T4
T5
a
b
c
d
e
1
0
0
1
1
1
1
0
0
1
1
0
0
1
1
0
1
0
0
0
0
0
1
1
0
TID
T1
T2
T3
T4
Items
a, c, d
b, c, e
a, b, c, e
b, e
1-candidates
Scan D
Min_sup=2
3-candidates
Scan D
Itemset
bce
Freq 3-itemsets
Itemset
bce
Sup
2
Freq 2-itemsets
Itemset
ac
bc
be
ce
Sup
2
2
3
2
Itemset
a
b
c
d
e
Sup
2
3
3
1
3
Counting
Itemset
ab
ac
ae
bc
be
ce
Sup
1
2
1
2
3
2
Freq 1-itemsets
Itemset
a
b
c
e
Sup
2
3
3
3
2-candidates
Itemset
ab
ac
Scan D
ae
bc
be
ce
Noise Tolerant Support Pruning - Proof
0/1 Extensions Proof

Number of zeroes allowed in an itemset grows with the length of
the itemset