Transcript Document

LCM ver.3: Collaboration of
Array, Bitmap and Prefix Tree for
Frequent Itemset Mining
National Institute of Informatics, JAPAN
Takeaki Uno
Masashi Kiyomi National Institute of Informatics, JAPAN
Hiroki Arimura Hokkaido University, JAPAN
20/Aug/2005 Open Source Data Mining ’05
Computation of Pattern Mining
For frequent itemsets and closed itemsets,
enumeration methods are almost optimal
TIME
=
#iterations
coding
technique
× time of an iteration +
I/O
(#iterations)
is not
so larger
Goal: clarify
feature
of than (#solutions) • frequency counting,
• data structure
• enumeration
algorithms
(linearly
bounded!)
reconstruction,
•
real-world
data
sets
• linear time in #solution
• closure
operation,
For what cases(parameter), which technique
is good?
• polynomial delay
• pruning, ...
“theoretical intuitions/evidences” are important
We focus on data structure and computation in an iteration
Motivation
• Some data structures have been proposed for storing huge
datasets, and accelerate the computation (frequency counting)
• Each has its own advantages and disadvantages
1. Bitmap
1
1
1
1
2. Prefix Tree
0
0
1
1
1
0
0
0
1
1
0
1
1
1
0
1
Good: dense data, large support
Bad: sparse data, small support
c
e
b
a
Good: non-sparse data, structured data
Bad: sparse data, non-structured data
d
c
e
3. Array List (with
deletion of duplications)
g
1
4
5
1
a
a
c
a
b
c d e f
d
c e f
Datasets have both
dense part and sparse part
Good: non-dense data
Bad: very dense data
How can we fit?
Observations
transactions
• Usually, databases satisfy power law
 the part of few items is dense,
and the rest is very sparse
d
e
sparse n
s
e
items
rec. depth
• Using reduced conditional databases, in almost all iterations, the
size of the database is very small
 Quick operations for
small database are very efficient
...
Idea of Combination
• Use bitmap and array lists for dense and sparse parts
• Use prefix tree of constant size for frequency counting
c items
transactions
• Choose a constant c
• F = c items of largest frequency
• Split each transaction T in two parts,
dense part composed of items in F
sparse part composed of items not in F
d
e
sparse n
s
e
• Store dense part by bitmap, and sparse part by array list
We can take all their advantages
items
Complete Prefix Tree
c
We use complete prefix tree:
prefix tree including all patterns
b
d
d
c
a
d
d
c
d
d
b
c
d
d
Complete Prefix Tree
0111
We use complete prefix tree:
prefix tree including all patterns
0011
1001
0101
0001
Parent of a pattern is obtained by
clearing the highest bit
(Ex. 010110  000110 )
 no pointer is needed
1101
1001
0110
0010
1110
1010
0100
0000
1111
1100
1000
Ex) transactions
{a,b,c}, {a,c,d}, {c,d}
We construct the complete prefix tree
for dense part of the transactions
If c is small, then its size 2c is not huge
Complete Prefix Tree
0111
We use complete prefix tree:
prefix tree including all patterns
 Any prefix tree is its subtree
0011
1001
0101
0001
0010
We construct the complete prefix tree
for dense part of the transactions
If c is small, then its size 2c is not huge
1110
1010
0100
0000
1101
1001
0110
Parent of a pattern is obtained by
clearing the highest bit
(Ex. 010110  000110 )
 no pointer is needed
1111
1100
1000
Ex) transactions
{a,b,c}, {a,c,d}, {c,d}
Ex) transactions
{a,b,c,d}, {a}, {a,d}
Frequency counting
• Frequency of a pattern (vertex)
= # descendant leaves
0111
0111
0011
0011
1001
0101
0001
• Occurrence by adding item i
= patterns with ith bit = 1
1111
1101
1001
0110
0010
1110
1010
0100
0000
1100
1000
 Bottom up sweep is good
2
Linear time in the size of prefix tree
1
2
3
“Constant Size” Dominates
• How much iterations input “constant size database” ?
support
Small supp.(1M solutions) Large supp. (1K solutions)
constant size
database
strategy
changes
constant size
database
strategy
changes
pumsb
99.9%
0.1%
pumsb*, connect, chess,
accidents
99%
0.1 – 0.5% 95 - 99%
0.2 - 1%
kosarak, mushroom, BMSWebView2, T40I10D100K
90-99%
1 - 2%
30-90%
2 - 4%
Retail, BMS-pos, T10I4D100K 30-90%
3 - 5%
30-90%
3 - 5%
99%
“Small iterations” dominate computation time,
“Strategy change” is not a heavy task
0.2%
More Advantages
• Reconstruction of prefix trees is a heavy task
 complete prefix tree needs no reconstruction
• Coding prefix trees is not easy
 complete prefix tree is easy to be coded
• Radix sort used for detecting the identical transactions is heavy
when data is dense
 Bitmaps for dense parts accelerate the radix sort
For Closed/Maximal Itemsets
• Compute the closure/maximality by storing the previously
obtained itemsets
 No additional function is needed
• Depth-first search (closure extension type)
 Need prefix of each itemsets
prefix
prefix
0111
prefix
prefix
0011
1001
0101
By taking intersection/weighted union
of prefixes at each node of the prefix
tree, we can compute efficiently (from
LCM v2)
0001
1101
1001
0110
prefix
0010
1110
prefix
1010
0100
0000
1111
prefix
1100
1000
Experiments
We applied the data structure to LCM2
CPU, memory, OS: Pentium4 3.2GHz, 2GB memory, Linux
Compared with: FP-growth, afopt, Mafia, Patriciamine, kDCI,
nonodrfp, aim2, DCI-closed
(All these marked high scores at competition FIMI04)
14 datasets of FIMI repository
Memory usage decreased to half, for dense datasets,
but not for sparse datasets
Experimental Results
Experimental Results
Discussion and Future Work
• Combination of bitmaps and array lists
reduces memory space efficiently, for dense datasets
• Using prefix trees for constant number of item is sufficient
for speeding up frequency counting, for non-sparse datasets,
• The data structure is orthogonal to other methods for closed/maximal
itemset mining, maximality check, pruning, closure operations etc.
Future work: other pattern mining problems
• Bitmaps and prefix trees are not so efficient for semi-structured data (semistructure gives huge variations, hardly represented by bits, and to be shared)
• Simplify the techniques so that they can be applied easily
• Stable memory allocation (no need to dynamic allocation)