Dynamically Maintaining Frequent Items Over A Data Stream

Download Report

Transcript Dynamically Maintaining Frequent Items Over A Data Stream

Finding Recent Frequent
Itemsets Adaptively over
Online Data Streams
J. H, Chang and W.S. Lee, in Proc. Of the
9th ACM International Conference on
Knowledge Discovery and Data Ming, 2003.
Adviser: Jia-Ling Koh
Speaker: Shu-Ning Shin
Date: 2004.8.12
1
Introduction
• This paper proposes a method of finding
recent frequent itemsets:
– Significant itemsets are maintained by a prefixtree lattice structure called monitoring lattice.
– Decaying the old occurrence count of each
itemset as time goes by.
– Minimize the number of significant itemsets:
• delayed-insertion
• pruning operations
2
Preliminaries (1)
• Data Stream can be defined:
–
–
–
–
I={i1, i2, …, in}:a set of current items.
e:itemset, a set of item.
Tid:transaction id, Tk generate at the kth turn.
Dk=<T1, T2, …, Tk>, When new transaction Dk
is generated.
– |D|k:the number of transactions in Dk.
– Ck(e):the number of transactions in Dk that
contain the itemset e.
– Sk(e):Support of itemset e in Dk.
3
Preliminaries (2)
• Decay rate:the reducing rate of a weight
for a fixed decay-unit.
• d=b-(1/h), (b>1, h≧1, b-1≦d<1)
– decay-unit:the chunk of information to be
decayed together.
– decay-base b:the amount of weight
reduction per a decay-unit and greater than 1.
– decay-base-life h:defined by the number of
decay-units that makes the current weight be
b-1.
4
Preliminaries (3)
• The total number of transactions |D|k in
the current data stream Dk:
1

if k  1
| D |k  
| D | k 1 d  1 if k  2
– The value of |D|k converges to 1/(1-d) as the
value k increases infinitely.
• The count Ck(e) of an itemset e in the
current data stream Dk:
1 if e  Tk
C k (e)  C k 1 (e)  d  Wk (e), Wk (e)  
0 otherwise
5
Count Estimation of
an itemset (1)
• The maximum possible count of an
itemset is estimated by the minimum
value among the maximum possible
counts of all of its subsets.
6
Count Estimation of
an itemset (2)
• Definition 1:
– P(e) :a set of itemset e’s subsets
– Pm (e) :a set of e’s m-subsets
– PmC (e) : a set of counts for e’s m-subsets
• Definition 2:
– Union-itemset e1  e2 is composed of all items
that are members of either e1 or e2.
– Intersection-itemset e1  e2 is composed of all
items that are members of both e1 and e2.
7
Count Estimation of
an itemset (3)
• exclusively distributed (LED): the
items of an itemset appear together in as
many transactions as possible.
• most exclusively distributed (MED):
the items of an itemset appear exclusively
as many transactions as possible.
• The maximum count of n-itemset e:
C
max
(e)  min( P (e))
C
n1
8
Count Estimation of
an itemset (4)
• Two itemsets e1, e2:
 max( 0, C (e1 )  C (e2 )  C (e1  e2 )) if e1  e2  
C (e1  e2 )  
if e1  e2  
max( 0, C (e1 )  C (e2 ) | D |)
The minimum count of Cmin(e) can be
min
•
estimated by (n-1)-subset union:
C min( (e)  max({ C min ( i   j ) |  i ,  j  Pn1 (e) and i  j})
i  j  e
• Estimation error:
– E(e)=Cmax(e)-Cmin(e)
9
estDec Method (1)
• Every node in a monitoring lattice
maintains a triple (cnt, err, MRtid) for
its corresponding itemset e:
– cnt:count of e.
– err:maximum error count of e
– Mrtid: the most recent transacrion id
that contain e
10
estDec Method (2)
• estDec Method is composed of four
phase:
– Phase
– Phase
– Phase
– Phase
phase
Ⅰ:parameter updating phase
Ⅱ:count updating phase
Ⅲ:Delayed insertion phase
Ⅳ:frequent itemset selection
11
estDec Method (3)
Phase I:|D|k is updated.
• Phase II:the counts of those itemsets in ML that
appear in Tk are updated.
– Sprn:threshold for pruning.
– If a 1-itemset is pruned from ML, it is impossible to
estimate its count later.
12
estDec Method (4)
• Phase III: Find new itemset that has high possibility to
become frequent. Two cases insert new itemset to a ML:
– new 1-itemset, the cnt of 1-itemset is actual.
– Itemset e Cmax(e)/|D|k ≧ Sins, Sins:threshold for
delayed-insertion.
• cntt_for_subsets=(1-d|e|-1)/(1-d)
• max_xnt_before_subsets=Sins*(|D|k-(|e|-1))*d|e|-1)
• Cupper(e)=Max_xnt_before_subsets+ Cntt_for_subsets
13
estDec Method (5)
• Phase IV:produces all current frequent
itemsets in ML.
– itemset e is frequent if its current
support (cnt * d(k-MRtid))/|D|k is greater
than Smin
– its current support error:
• (err*d(k-MRtid))/|D|k
14
estDec Method (6)
• Force-pruning operation:
– all insignificant itemsets in ML can be
pruned
– perform when the current size of ML
reaches a threshold.
15
Experimental (1)
• Performance of the estDec method for the data set
T10.I4.D1000K
– Sins is denoted p%, the actual value=Smin*p%.
– Force-pruning operation perform in every 1,000
transactions.
– (a) memory usage (b) performance time of Phases I~III
(c) performance time of Phases IV
16
Experimental (2)
• Accuracy of mining result
– Average support error
• ASE(RestDec|RdApriori)
17
Experimental (3)
• The adaptability of the estDec method for the
change of information in a data stream.
– Coverage rate CR(X)
• |R|:total nmber of frequent itemdets in ML
18