Research issues on association rule mining

Download Report

Transcript Research issues on association rule mining

Research issues on association rule
mining
Loo Kin Kong
26th February, 2003
Plan





Recent trends on data mining
Association rule interestingness
Association rule mining on data streams
Research directions
Conclusion
Association rules




First proposed in [Agarwal et al. 94]
Given a database D of transactions, which contains
only binary attributes
For an itemset x, the support of x is defined as
supp(x) = fraction of D containing x
An association rule is in the form I  J, where:



IJ=
supp(I  J)  supp
supp(I  J) / supp(I)  conf
Recent trends on association rule
mining




Association rule interestingness
Association rule mining on data streams
Privacy preserving [Rizvi el al. 02]
New data structures to improve the efficiency of
finding frequent itemsets [Relue et al. 01]
Association rule interestingness –
overview

Problem with association rule mining:



Subjective approaches aim at:


Too many rules mined
Mined rules may contain redundancy or trivial rules
Minimizing human effort involved
Objective approaches aim at:

Based on some predefined interestingness measure, filter
rules that are uninteresting
Subjective approaches

Rule templates [Klemettinen et al. 94]



A rule template specifies what attributes to occur in the LHS and
RHS of a rule
e.g., any rule in the form “” & (any number of conditions)  “” is
uninteresting
By elimination [Sahar 99]


For a rule r = A  B, r’ = a  b is an ancestor rule if a  A and b
 B. r’ is said to cover r.
An ancestor rule can be classified as one of the following:




True-Not-Interesting (TNI)
Not-True-Interesting (NTI)
Not-True-Not-Interesting (NTNI)
True-Interesting (TI)
Objective approaches

Statistical / problem-specific measures


Entropy gain, lift, …
Pruning redundant rules by the maximum entropy
principle [Jaroszewica 02]
Probability

A finite probability space is a pair (S,P), in which





S is a finite non-empty set
P is a mapping P:S  [0,1], satisfying sSP(s) = 1
Each sS is called an event
P(s), also denoted by ps, is the probability of the
event s
The self-information of s is defined as I(s) = – log P(s)
Entropy

A partition U is a collection of mutually exclusive
elements whose union equals S


The measure of uncertainty that any event of a
partition U would occur is called the entropy of the
partitioning U
H(U) = – p1 log p1 – p2 log p2 – … – pN log pN


Each element contains one or more events
Where p1, ... , pN are respectively the probabilities of events
a1, ... , aN of U
H(U) is maximum if p1 = p2 = ... = pN = 1/N
The maximum entropy method (MEM)


The MEM determines the probabilities pi of the events
in a partition U, subject to various given constraints.
By MEM, when some of the pi’s are unknown, they
must be chosen to maximize the entropy of U,
subject to the given constraints.
Definitions

A constraint C is a pair C = (I, p), where:




I is an itemset
p[0,1] is the probability of I occurring in a transaction
The set of constraints generated by an association
rule I  J is defined as
C(I  J) = {(I, supp(I)), (I  J, supp(I  J))}
A rule K  J is a sub-rule of I  J if K  I
I-nonredundancy

A rule I  J is considered I-nonredundant with
respect to R, where R is a set of association rules, if:


I = , or
I(CI,J(R), I  J) is larger than some threshold, where I() is
either Iact() or Ipass(), CI,J(R) is the constraints induced by all
sub-rules of I  J in R
Pruning redundant association rules
Input: A set R of association rules
1.
For each singleton Ai in the database
2.
Ri = {  Ai}
3.
k=1
4.
For each rule I  Ai  R, |I|=k, do
5.
If I  Ai is I-nonredundant w.r.t. Ri then
6.
Ri = Ri  {I  Ai}
7.
k = k+1
8.
Goto 4
9.
R = Ri
Association rule interestingness: let’s
face it...

“Interesting” is a subjective sense...


... in fact, one may argue that there does not exist a
truly objective interestingness measure...


Domain knowledge is needed at some stage to determine
what is interesting
It is because we try to model what is interesting
... but “objective” interestingness measures are still
worth studying

Can act as a filter before any human intervention is required
Interesting or uninteresting?


Consider the association rule:
r = I  J, supp(r) = 1%, conf(r) = 100%
A question:



Do you think whether r is interesting or uninteresting?
Considering the support and/or confidence of one
single rule may not be enough to determine whether
a rule is interesting or not
So we try to compare a rule with some other rule(s)
Observation: comparing a family of
rules

For a maximal frequent itemset I:


The set of rules
I’  {i}, where i  I, I’  I \ {i}
forms a family of rules
For example, for the maximal frequent itemset {abcde},
abcd  e
conf = supp({abcde})/supp({abcd})
abc  e
conf = supp({abce})/supp({abc})
abd  e
conf = supp({abde})/supp({abd})
...
are in a family
abcd
e
abcd
abce
abde
acde
bcde
abc
abd
acd
bcd
abe
ace
ade
bce
bde
cde
ab
ac
ad
bc
bd
cd
ae
be
ce
de
a
b
c

d
e
Observation: comparing a family of
rules (cont’d)



The blue half of the lattice is obtained by appending
the item “e” to each node in the orange half
The family of rules captures how the item “e” affects
the support of the orange half of the lattice
Idea:


We may compare confidences of rules in a family to find any
“unusually” high or low confidences
We can use some statistical tests to perform the comparison;
no need for complicated statistical models (e.g., MEM)
Association rule mining on data
streams

In some new applications, data come as a continuous
“stream”



Examples:



The sheer volume of a stream over its lifetime is huge
Queries require timely answer
Stock ticks
Network traffic measurements
A method for finding approximate frequency counts
on data streams is proposed in [Manku et al. 02]
Goals of the paper

The algorithm ensures that




All itemsets whose true frequency exceeds sN are reported
(i.e., no false negative)
No itemset whose true frequency is less than (s-)N is
output
Estimated frequencies are less than the true frequencies by
at most N
Some notations:



Let N denote the current length of the stream
Let s (0,1) denote the support threshold
Let  (0,1) denote the error tolerance
The simple case: finding frequent
items


Each transaction in the stream contains only 1 item
2 algorithms were proposed, namely:



Sticky Sampling Algorithm
Lossy Counting Algorithm
Features of the algorithms:



Sampling techniques are used
Frequency counts found are approximate but error is
guaranteed not to exceed a user-specified tolerance level
For Lossy Counting, all frequent items are reported
Lossy Counting Algorithm



Incoming data stream is conceptually divided into
buckets of 1/ transactions
Counts are kept in a data structure D
Each entry in D is in the form (e, f, ), where:



e is the item
f is the frequency of e in the stream since the entry is
inserted in D
 is the maximum count of e in the stream before e is added
to D
Lossy Counting Algorithm (cont’d)
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
D  ; N  0
w  1/; b  1
e  next transaction; N  N + 1
if (e,f,) exists in D do
ff+1
else do
insert (e,1,b-1) to D
endif
if N mod w = 0 do
prune(D, b); b  b + 1
endif
Goto 3;
1.
2.
3.
4.
5.
D: The set of all counts
N: Curr. len. of stream
e: Transaction (itemset)
w: Bucket width
b: Current bucket id
function prune(D, b)
for each entry (e,f,) in D do
if f +   b do
remove the entry from D
endif
Lossy Counting

Lossy Counting guarantees that:




When deletion occurs, b  N
If an entry (e, f, ) is deleted, fe  b where fe is the actual
frequency count of e
Hence, if an entry (e, f, ) is deleted, fe  N
Finally, f  fe  f + N
The more complex case: finding
frequent itemsets



The Lossy Counting algorithm is extended to find
frequent itemsets
Transactions in the data stream contains any number
of items
Essentially the same as the case for single items,
except:



Multiple buckets ( of them say) are processed in a batch
Each entry in D is in the form (set, f, )
Transactions read in are (wisely) expanded to its subsets
Association rule mining on data
streams: food for thought

Challenges to mine from data streams





Fast update
Data are usually not permanently stored (but may be
buffered)
Fast response for queries
Minimized resources (e.g. number of counts kept)
Possible interesting problems concerning association
rule mining on data streams:


More efficient/accurate algorithms for finding association
rules on data streams
Change mining in frequency counts
The lattice structure

A bottleneck in the algorithm proposed in [Manku et
al. 02] is that it needs to expand a transaction to its
subsets for counting


For example, for a transaction {abcde}, we may need to
count the itemsets {a}, {b}, {c}, {d}, {e}, {ab}, {ac}...
Hence updates are expensive (although queries can be fast)
The lattice structure (cont’d)
abcd
e
abcd
abce
abde
acde
bcde
abc
abd
abe
acd
ace
ade
bcd
bce
bde
cde
ab
ac
ad
ae
bc
bd
be
cd
ce
de
a
b
c

d
e
Conclusion



Both association rule interestingness and mining on
data streams are challenging problems
Research on rule interestingness can make
association rule mining a more efficient tool for
knowledge discovery
Association rule mining on data streams is an
upcoming application and a promising direction for
research
References







[Agarwal et al. 94] R. Agarwal and R. Srikant. Fast Algorithms for
Mining Association Rules. VLDB94.
[Jaroszewica 02] S. Jaroszewica and D.A. Simovici. Pruning Redundant
Association rules Using Maximum Entropy Principle. PAKDD02.
[Klemettinen et al. 94] Mika Klemettinen et al. Finding Interesting
Rules from Large Sets of Discovered Association Rules. CIKM94.
[Manku et al. 02] G. S. Manku and R. Motwani. Approximate Frequency
Counts over Data Streams. VLDB02.
[Relue et al. 01] R. Relue, X. Wu and H Huang. Efficient Runtime
Generation of Association Rules. CIKM01.
[Rizvi el al. 02] S. J. Rizvi and J. R. Haritsa. Maintaining Data Privacy in
Association Rule Mining. VLDB02.
[Sahar 99] Sigal Sahar. Interestingness Via What Is Not Interesting.
KDD99.
Q&A