frequent itemset

Download Report

Transcript frequent itemset

A Sliding Window Method for
Finding Recently Frequent Itemsets
over Online Data Streams
Joong Hyuk Chang and won Suk Lee, Proc. of the 9’th
ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining (SIGKDD’03)
Adviser: Jia-Ling Koh
Speaker: Yu-ting Kung
Introduction
• Most of mining algorithms or frequency
approximation algorithm for a data
stream don’t able to extract the recent
change of information in a data stream
adaptively.
2
Introduction (Cont.)
• In this paper,
– Propose a sliding window method of finding
recently frequent itemsets over an online
data stream
3
Sliding Window Method
• Idea:
– Define significant itemset:
• An itemset whose current support is greater than
or equal to an error parameter  is a significant
itemset
– Monitoring only significant itemsets
4
SW Method (Cont.)
• Two different phases
– Window initialization phase:
• Actives while the number of transactions generated so
far in a data stream is less than or equal to a predefined
window size.
• Insert new transaction in CTL (current transaction list)
• No extracted transation
– Window sliding phase:
• Actives after the CTL becomes full
• Insert new transaction in CTL (current transaction list)
• The oldest transaction is extracted from the CTL
5
SW Method (Cont.)
•
Five steps:
1. Appending a transaction
2. Counting updating and insertion of new
itemsets
3. Extracting a transaction
4. Pruning of itemsets
5. Frequent itemset selection
6
Step1: Appending a transaction
• Content
– The transaction Tk is appended to the
current transaction list CTL
7
Step2: Counting updating and
insertion of new itemsets
• Content
– For an itemset e that appears in the Tk
with an entry (e, f, t):
f: count of the itemset
t: TID which makes the itemset
be newly inserted into the
monitoring lattice
Case 1 its corresponding node is in the monitoring lattice:
 e.f = e.f + 1
Case 2 its corresponding node isn’t in the monitoring lattice:
 e is inserted into the monitoring lattice with (e, 1, k)
8
Step3: Extracting a transaction
• When this step is done?
– Only in the window sliding phase
• Content
– Extract the oldest transaction in CTL
– Update the entry (e, f, t) of this node in the
monitoring lattice:
If t <= wfirst
e.f = e.f -1;
Wfirst : the TID of the first
transaction of the current
window
Else
e.f = e.f;
9
Step4: Pruning of itemsets
• Therom:
– Given an error parameter  , the maximum
possible count Ckmax (e) of an itemset with its
entry (e, f, t) is found as follows:
C
max
k
if t  w first
f
(e)  
 f  (t  w first )    otherwise
10
Step4: Pruning of itemsets
• When this step is done?
– Periodically or when it is needed
• Content
– For an itemset e with entry (e, f, t) in the
monitoring lattice:

max
w
If Ck (e) D k   ,
Then it can be regarded as an insignificant itemset
Prune it !!
11
Step5: Frequent itemset selection
• When this step is done?
– The up-to-date set of recently frequent
itemsets is requested.
• Content
– For an itemset e with an entry (e, f, t) in the
monitoring lattice:
max
w
If its Ck (e) D  Smin ,
k
 it is a frequent itemset !!
12
For Example
• Data Stream
Tid
1
Tid
A
B
Items
1
2
3
4
5
A
B
D
A
B
A
B
A
Tid
2
3
4
5
6
7
8
9
10
A
B
D
A
B
A
B
A
A
C
A
E
A
C
E
A
E
Items
Items
(a) D1
Tid
1
(c) D10
(b) D5
1
2
3
4
5
6
7
8
9
A
B
D
A
B
A
B
A
A
C
A
E
A
C
E
Items
1
0
1
1
A
E
A
D
Tid
1
2
3
4
5
6
7
8
9
A D
B
A
B
A
B
A
A
C
A
E
A
C
E
1
0
1
1
1
2
1
3
1
4
1
5
A
E
A
D
B
A
B
A
Items
(d) D11
(e) D15
13
For Example (Cont.)
• Initial value
–
–
–
–
Smin = 0.5
 = 0.25 (0.5 x Smin)
Window size = 10
Step4 is performed in every 5 transactions.
14
Step1,2
e
f
t
A
B
AB
1
1
1
1
1
1
(a) After T1 (AB)
Step1,2
e
f
t
A
B
AB
D
1
1
1
1
1
1
1
2
f
t
A
B
AB
D
2
2
2
1
1
1
1
2
Step1,2
(b) After T2 (D)
Step1,2
(b.2) After T4 (AB)
e
Step1,2
(b.1) After T3 (AB)
Step4
(b.3) After T5 (A)
(b.3) After prning
D is pruned from the
monitoring lattice, becasue
Step1,2
recursively
(C) After T10 (AE)
15
e
f
t
A
B
C
D
E
AB
AC
AE
AD
9
3
2
1
3
3
2
2
1
1
1
6
11
7
1
6
7
11
Step1,2
Step3
Step4
(d) After T11 (AD)
Step1,2,3,4
(e) After Step4 for T15
16
Experiment Result
• Data souce
– T5.I4.D1000K-I
– T5.I4.D1000K-II
17
Experiment (Cont.)
• Memory usage in the window sliding
phase
18
Experiment (Cont.)
• Average support error
– Measure the relative accuracy of the proposed
method
– When two sets of mining results R1  {( ei , S1 (ei )) | S1 (ei )  Smin }
and R  {(e , S (e )) | S (e )  S } are given for the same
data set, the average support error ASE(R2|R1)
is defined:
2
2
2
ASE ( R2 | R1 ) 
2
2
2
 S (e
1 m
em R1  R1  R2
min
)
{| S (e
em R1  R2
2
m
)  S1 (em ) |} 
 S (e
2
em R2  R1  R2
m
)
R1
19
Experiment (Cont.)
• Average support error of the mining result of the
proposed method with respect to that of the
Apriori algorithm on the transactions within the
current window
20
Experiment (Cont.)
• The average processing time(Step1-Step4)
of the sliding window method in each
interval
21
Experiment (Cont.)
• The average processing time for
Step5
22
Experiment (Cont.)
• The memory usage of the window
sliding phase by varying the size of
the window
23
Experiment (Cont.)
• The average processing time of the
sliding window method by varying the
size of a window
24
Conclusion
•
The result of the proposed method guarantees
the following:
–
–
–
All itemsets whose true supports are greater than or
equal to a minimum support Smin are found
No itemset whose true support is less than (Smin-  )
is found as a recently frequent itemset
For each itemeset, the difference between its
estimated support and its true support is less than 
25