Finding Maximal Frequent Itemsets over Online Data Streams

Transcript Finding Maximal Frequent Itemsets over Online Data Streams

Finding Maximal Frequent
Itemsets over Online Data
Streams Adaptively
Daesu Lee,Wonsuk Lee
IEEE,ICDM’05
報告者：林靜怡
2006/05/05
Introduction



Confine the memory usage of a data
mining process
estDec
- prefix tree
estDec+
- CP-tree
CP-tree




Compressed-prefix tree
Effectively used in finding frequent or
maximal frequent itemsets
a node of a CP-tree can maintain the
information of several itemsets together
size of a CP-tree can be flexibly
controlled by merging or splitting nodes
CP-tree(Conti.)

Two consecutive nodes by a prefix tree
are merged in a CP-tree when the
current support difference between
their corresponding itemsets is less than
or equal to a merging gap threshold
δ (0,1)
Definition








Pk：a prefix tree
S ：a subtree of Pk
v：the itemset represented by the root of S
j：an itemset represented by a node of S
δ：merging gap threshold
|S|：number of nodes in S
：the total number of transactions in the current data stream
e
e
subtree S is a mergeable subtree and compressed
into a node of a CP-tree Pk that is equivalent to Qk
CP-node structure


A node m of CP-tree Qk maintains four entries
m(τ, π, , )
Item-listτ：
- m.τ[1]：root node of S，the shortest
itemset of the node m and denoted by
- m.τ[|S|]：the right-most leaf node in the
lowest level of S，the longest itemset of
the node m and denoted by
CP-node structure



Parent-index list π
- y’s parent is x
-
Largest count
- the current count of the shortest itemset
Smallest count
- the current count of the longest itemset
CP-tree
|10-9|/10 = 0.1
0.1 <= 0.2
|10-5|/10 = 0.5
0.5 > 0.2
Merged-count Estimation




Given the item-list
of a node m in
a CP-tree
：the itemset represented by ij (1<=j<=n)
the current counts of the remaining itemsets can be
estimated by a formula
f(m, j)：a count estimation function that can model
the count
in terms of the counts
the shortest and longest itemsets
and
of
CP-tree Maintenance



：the parent node of m
Node-merge
- a new significant itemset e is identified by
the inserting-count estimation process, so
that a new node for the itemset needs to be
inserted as a child of the node m.
Node-split
-
Finding Maximal Frequent
Itemsets


estDec+ Method
Adaptive Memory Utilization
estDec+ Method




Parameter updating phase
Count updating & node restructuring phase
Itemset insertion phase
Maximal frequent itemset selection phase

Parameter updating phase
The total number of transactions in the current
data stream
is updated.

Count updating & node restructuring phase
：m’s parent
prune split merge -
Itemset insertion phase



insert any new significant itemset
which has not been maintained in
any insignificant item whose current
support is less than
is filtered out in the
transaction
=>
Traversed
to find out any new significant
itemset induced by the items in
Maximal frequent itemset
selection phase

retrieves all the currently frequent or
maximal frequent itemsets by traversing the
monitoring tree

Force-pruning
- all the nodes whose largest counts
are less than
- performed periodically
Adaptive Memory Utilization



In order to minimize the estimation
error caused by the merged-count
estimation, it is very important to keep
the value ofδas small as possible.
The size of a CP-tree is inversely
proportional to the value of δ.
the value of should be dynamically
changed in the parameter update phase
Adaptive Memory Utilization




：the upper bound of desired
memory usage
：the lower bound of desired
memory usage
：Confined memory space
：current memory usage
Experiment






Data sets：T10.I4.D1000K and WebLog
1.8 GHz Pentium PC
512 MB Memory
Linux 7.3
= 0.001
Count estimation function f(m,j)
Experiment
Experiment
Experiment

Finding Maximal Frequent Itemsets over Online Data Streams

Transcript Finding Maximal Frequent Itemsets over Online Data Streams

Directory