Mining Closed Frequent Itemsets in Data Streams

Download Report

Transcript Mining Closed Frequent Itemsets in Data Streams

CFI-Stream: Mining Closed
Frequent Itemsets in Data
Streams
Nan Jiang,Le Gruenwald
SIGKDD’06
報告者:林靜怡
2006/10/04
Introduction




mining Closed frequent itemsets
computes and maintains closed
itemsets online and incrementally
perform the closure checking
output the current closed frequent
itemsets in real time based on users’
specified thresholds
Definition




D:data stream
I = { i 1 , i 2 , …, in } :a set of n elements,
called items
T: subsets of all the transactions
X: subsets of all the items appearing
in a data stream
Definition




C(X):the smallest closed set containing
X
Definition 1
An itemset X is said to be closed if and
only if C(X)= f(g(X)) = f•g(X) = X
Algorithm





CFI-Stream algorithm
DIrect Update (DIU) tree
perform the closure checking online
over a data stream sliding window
Conditions need to check for closed
itemsets
check when performing addition and
deletion operations on the DIU tree
DIU tree


maintain the current closed itemsets
k levels in the DIU tree, each level i
stores the closed i-itemsets
DIU tree

Each node in the DIU tree stores



a closed itemset
its current support information
links to its parent and children nodes
Add a Transaction to the DIU Tree
T1:original transaction set
t:new arrived transaction
 Conditions to Check for Closed Itemsets
(1)
t is in the T1, if the largest itemset X it
contains is not currently in the DIU tree
->check for all X’s subsets Y, which are in
T1
(2)
when t is not in T1, for each its subset
Y, if Y is in T1, we need to check
Closure Checking for Addition
1
C,D
2
A,B
3
A,B,C
4
A,B,C
ABC
C 32
2
1
3
AB
21
11
CD
CD
Delete a Transaction in DIU
Tree
Conditions to Check for Closed Itemsets
 When the number of the transactions
with same itemset of X is equal to zero,
if Y is a subset of X, and Y is a closed
itemset in the original transaction set

Closure Checking for Deletion
1
C,D
2
A,B
3
A,B,C
4
A,B,C
C
23
AB
ABC
2
32
1
CD
Experiment

Synthetic datasets
T10.I6.D100K and T5.I4.D100K
Experiment
Experiment