Document

Transcript Document

Approach to Data Mining from
Algorithm and Computation
Takeaki Uno, ETH Switzerland, NII Japan
Hiroki Arimura, Hokkaido University, Japan
Frequent Pattern Mining
・ Data mining is an important tool for analysis of data in many
scientific and industrial areas
・ The aim of data mining is to
“find interesting, or valuable something”
・ But, we don’t know what is interesting, nor is valuable
・ So, we give some criteria that would be satisfied by interesting
or valuable something, and find all patterns satisfying them.
Image of Pattern Mining
・ Pattern mining is a problem of find all patterns from the given
(possibly structured) database satisfying the given constraints
family
databases
name name
name name
person
name age phone
person
C
C
C
C
name
XML database
name
name age
O
H
O
H
N
person
H
H
family
person
name
C
H
H
person
extract
interesting patterns
H
H
C CH
H
chemical compounds
C
H
H
O
O
C
Frequent pattern mining is an enumeration problem of all patterns
appearing frequently, at least given threshold in the database
Approach from…
・ In real world, the inputs database is usually huge, and the output
patterns are so huge, thus efficient computation is very important
・ Many research are done, but many of them are based on “database,
data engineering, modeling”, not algorithm.
Ex.) How to make data compressed, how to execute query fast,
which model is good, etc…
・ Here we see want to separate the problems;
from algorithmic view, what is important?, what we can do?
Distinguish the Focus, Problems
・ “my algorithm is very fast for these datasets”,
- but the data is very artificial, or including few items…
・ “the algorithm might not work for huge datasets, if it”,
- difficult to be fast for both small and huge
・ We would like to distinguish the techniques and problems;
- scalability
- I/O
- Huge datasets
- Data compression
・ The techniques would be “orthogonal”
From the Algorithm Theory…
・ Here we focus only on the case that the input fits the memory
TIME
=
#iterations
× time of an iteration +
- scalability: output sensitive computation time
(bad, if long time for small output)
- memory use should depend only on input size
- computation time for an iteration
- reduce the input of each iteration
(from bottom wideness)
I/O
This is so
important!!!
Bottom Wideness
・ Enumeration algorithms usually have recursive tree structures,
there are many iterations in deeper levels
Procedure to reduce
input of recursive calls
Size
= time
Bottom Wideness
・ Enumeration algorithms usually have recursive tree structures,
there are many iterations in deeper levels
Procedure to reduce
input of recursive calls
Size
= time
Total computation time will be half only by one reduction for input
Bottom Wideness
・ Enumeration algorithms usually have recursive tree structures,
there are many iterations in deeper levels
Procedure to reduce
input of recursive calls
Size
= time
Total computation time will be half only by one reduction for input
Recursively reduce the input  computation time is much reduced
Advantage of Bottom Wideness
・ Suppose that the recursion tree has iterations exponentially many
on lower levels (ex. (2 × #level i) ≦ #level i+1
recursion
tree
amortized computation time is
O(1) for each output !!
O(n3)
O(1)
Advantage of Bottom Wideness
・ Suppose that the recursion tree has iterations exponentially many
on lower levels (ex. (2 × #level i) ≦ #level i+1
recursion
tree
amortized computation time is
O(n) for each output !!
O(n5)
O(n)
Computation time for each output depends only on bottom levels:
 reduce the computation time on lower levels by reduction of input
Frequent Itemset Mining
・ Transaction database D :
a database composed of transactions defined on itemset E
i.e., ∀t ∈D, t ⊆E
1,2,5,6,7
- basket data
2,3,4,5
- links of web pages
D＝ 1,2,7,8,9
- words in documents
1,7,9
2,7,9
・ A subset P of E is called an itemset
2
occurrence of P : a transaction in D including P
denotation Occ(P) of P : set of occurrences of P
・ |Occ(P)| is called frequency of P
denotation of {1,2}
＝ {1,2,5,6,7,9}, {1,2,7,8,9}
Frequent Itemset
・ Given a minimum support θ,
Frequent itemset: an itemset s.t. (frequency) ≧ θ
(a subset of items, which is included in at least θ transactions)
Ex.)
1,2,5,6,7,9
2,3,4,5
T ＝ 1,2,7,8,9
1,7,9
2,7,9
2
patterns included in
at least 3 transactions
{1} {2} {7} {9}
{1,7} {1,9}
{2,7} {2,9} {7,9}
{1,7,9} {2,7,9}
Techniques for Efficient Mining
・ There are many techniques for fast mining
- apriori
- backtracking
for search strategy
- down project
- pruning by infrequent subset
- bitmap
- occurrence deliver
- FP-tree (trie, prefix tree)
- filtering (unification)
- conditional (projected) database
- trimming of database
for speeding up iterations
for database reduction
& bottom wideness
Search Strategies
・ Frequent Itemsets form
a connected component on itemset lattice
- Apriori algorithms generate
itemsets level-by-level
 pruning by infrequent subsets
 much memory use
- Backtracking algorithms generate
itemset in depth-first manner
 small memory use
 match down project, etc.
1,2,3,4
1,2,3
1,2
1,2,4
1,3
1
1,4
1,3,4
2,3
2
2,3,4
2,4
3
φ
3,4
4
frequent
Apriori uses long time much memory when output is large
Backtracking
apriori
Set k := 0, Ok := {φ}
While (Ok≠φ) {
for each P∪e, P∈Ok {
if all P∪e-f ∈ Ok then
compute Occ(P∪e)
if |Occ(P∪e)| ≧θthen Ok+1  P∪e
}
k := k+1
}
Backtrack (P, Occ(P)) backtracking
for each e>tail(P) {
compute Occ(P∪e)
if |Occ(P∪e)| ≧θthen
Backtrack ( P∪e, Occ(P∪e) )
}
1,2,3,4
1,2,3
1,2
1,2,4
1,3
1
1,4
1,3,4
2,3
2
2,3,4
2,4
3
3,4
4
φ
frequent
Speeds Iteration
Bottleneck in iteration is computing Occ(P∪e)
- down project: Occ(P∪e) = Occ(P∪e) ∩ Occ(e)
 O(||D≧P||): the size of database of the part larger than tail(P)
- pruning by infrequent subset
 |P| search query + O(c × ||D≧P||)
m
||D||
- bitmap: compute Occ(P∪e) ∩ Occ(e) by AND operation
 (n -tail(P)) × m/32 operations
n
- occurrence deliver: comp. Occ(P∪e) for all e by one scan of D(P)≧P
 O(||D(P)≧P||) : D(P) is transactions including P
bitmap is slow if database is sparse, pruning is slow for huge output
occurrence deliver is fast if threshold (minimum support) is small
Occurrence Deliver
・ Compute the denotations of P ∪{i} for all i’s at once,
A 1 A2
1,2,5,6,7,9
2,3,4,5
D＝ 1,2,7,8,9
1,7,9
2,7,9
P = {1,7}
2
Check the frequency for all
items to be added in linear
time of the database size
A5 A6 7
A9
B
2 3 4 5
C 1 C2
7 C8 C9
D 1
7
D9
7
9
E
2
F
2
Generating the recursive calls in reverse
direction, we can re-use the memory
Database Reductions
Conditional database is to reduce database by unnecessary items
and transactions, for deeper levels
1,3,5
1,3,5
1,3,5
1,2,5,6
1,4,6
1,2,6
3,5
3,5
3,5
5,6
filtering
6
Remove infrequent items, 6
items included in all
θ=3
1
FP-tree, prefix tree
3
5
5
6
6
Remove infrequent items,
automatically unified
filtering
3,5 ×3
5,6
6 ×2
Unify same transactions
Linear time
O(||D||log ||D||) time
Compact if database is
dense and large
Summary of Techniques
・ Database is dense and large even for bottom levels of
computation  support is large
・ #output solutions is huge  support is small
Prediction:
- apriori will be slow when support is small
- conditional database is fast when support is small
- bitmap will be slow for sparse datasets
- FP-tree will be bit slow for sparse datasets,
and fast for large support
Results from FIMI 04 (sparse datasets)
bitmap
bitmap
apriori
apriori
FP-tree
cond.
O(n) vs O(nlogn)
cond.
O(n) vs O(nlogn)
Conditional database is good, bitmap is slow
FP-tree  large support, occurrence deliver  small support
Results on Dense Datasets
bitmap
bitmap
apriori
apriori
FP-tree
cond.
Apriori is still slow for middle supports,
FP-tree is good
FP-tree,
cond
#nodes in FP-tree
= (||D|| filtered)/6
Summary on Computation
・ We can understand the reason of efficiency from algorithmic view
- reduce the input of each iteration according to bottom wideness
- reduce the computation time for an iteration
(probably, combination of conditional database, patricia tree, and
occurrence deliver will be good)
・ We can observe similarly other pattern mining problems,
sequence mining, string mining, tree mining, graph mining, etc.
Next we see closed pattern which represents some similar patterns,
we begin with itemsets
Modeling: Closed Itemsets [Pasquier et. al. 1999]
・ Usually, #frequent itemsets is huge, when we mine in depth
 we want to decrease itemsets in some way
・ There are many ways for this task, ex., giving some scores,
group similar itemsets, looking at the other parameters, etc.
But, we would like to approach from theory
Here we introduce closed patterns
・ Consider the itemsets having the same denotations
 we would say “they have the same information”
・ we focus only on the maximal among them, called closed pattern
(= intersection of occurrences in the denotation)
Example of Closed Itemset
1,2,5,6,7,9
2,3,4,5
T ＝ 1,2,7,8,9
1,7,9
2,7,9
2
patterns included in
at least 3 transactions
{1} {2} {7} {9}
{1,7} {1,9}
{2,7} {2,9} {7,9}
{1,7,9} {2,7,9}
・ In general,
#frequent itemsets ≦ #frequent closed itemsets
Especially, “<<” holds if database has some structures
(Databases with some structure tend to have huge frequent itemsets,
thus this is an advantage)
Difference of #itemsets
1000
retail
10
1
#freq closed
#freqset
10
#freq closed
1
#freqset
0.1
5%
00
%
0.
01
%
0.
02
%
0.
0.
04
%
08
%
0.
16
0.
32
%
0.1
#solutions(1000)
100
0.
#solutions(1000)
100
BMSWebView2
1000
support
0.16% 0.08% 0.04% 0.02% 0.01%
#frequent itemsets << #closed itemsets
when threshold θis small
support
Closure Extension of Itemset
・ Usual backtracking does not work for closed itemsets,
because there are possibly big gap between closed itemsets
1,2,3,4
・ On the other hand, any closed itemset
is obtained from another by
1,2,3 1,2,4
1,3,4 2,3,4
“add an item and
take closure (maximal)”
1,2 1,3 1,4 2,3 2,4 3,4
- closure of P is the closed itemset
1
having the same denotation to P,
and computed by taking intersection of Occ(P)
3
2
φ
This is an adjacency structure defined on closed itemsets, thus
we can perform graph search on it, with using memory
4
PPC extension
・ Closure extension gives us an acyclic adjacency structure for us,
but it’s not enough to get a memory efficient algorithm
(we need to store discovered itemsets in memory)
・ We introduce PPC extension to obtain tree structure
PPC extension
A closure extension P’ obtained from P+e is a PPC extension
 prefixes of P’ and P are the same (smaller than e)
・ Any closed itemset is a PPC extension of
just one other closed itemset
Example of PPC Extension
closure extension
ppc extension
D＝
1,2,5,6,7,9
2,3,4,5
1,2,7,8,9
1,7,9
2,7,9
2
・ closure extension
 acyclic
・ ppc extension
 tree
φ
{2}
{7,9}
{1,7,9}
{2,5}
{1,2,7,9}
{2,7,9}
{2,3,4,5}
{1,2,7,8,9}
{1,2,5,6,7,9}
Example of PPC Extension
D＝
1,2,5,6,7,9
2,3,4,5
1,2,7,8,9
1,7,9
2,7,9
2
φ
{2}
{7,9}
{1,7,9}
(1,2,7,9), (1,2,7), (1,2) ⊆
{1,2,5,6,7,9}, {1,2,7,8,9}
(1,7,9), (1,7), (1) ⊆
{1,7,9},
{1,2,5,6,7,9}, {1,2,7,8,9}
{2,5}
{1,2,7,9}
{2,7,9}
{2,3,4,5}
{1,2,7,8,9}
{1,2,5,6,7,9}
For Efficient Computation
・ Computation of closure takes long time
・ We use database reduction, from the fact that
“if P’ is PPC extension by P+e, and P’’ is PPC extension by P’+f
then e < f ”, thus prefix is used only for intersection!
1,2,5,6,7,9
2,3,4,5
1,2,5,7,8,9
1,5,6,7
2,5,7,9
2,3,5,6,7,8
e=5
1,2,5,6,7,9
2,5
1,2,5,7,9
1,5,6,7
2,5,7,9
2,5,6,7
1,2,5,6,7,9
2,5,7,9 × 2
5,6,7 × 2
Experiment: vs. Frequent Itemset (sparse)
BMS WebView2
10000000000
BMS pos
100000000
#freqset
closedset
time/1000
0.1
0.06 0.06 0.04 0.02 0.001 0.006
support %
#freqset
10000
time/1000
#closedset
freqset
1000000
freqset
time/#itemsets
time/#itemsets
100000000
10000000
1000000
100000
10000
1000
100
10
1
0.1
0.01
0.001
0.0001
time/1000
100
closedset
1
#closedset
time/1000
0.01
0.0001
0.2 0.1 0.1 0.06 0.02 0.02 0.01
Computation time/itemset is very stable
There is no big difference of computation time
support %
Experiment: vs. Frequent Itemset (dense)
accidents
freqset
#freqset
time/1000
closedset
#closedset
time/1000
1
2
3
4
5
6
7
support %
1E+11
connect
1E+09
1000000
0
100000
1000
time/#itemsets
time/#itemsets
10000000
1000000
100000
10000
1000
100
10
1
0.1
0.01
freqset
#freqset
time/1000
closedset
10
0.1
0.001
0.00001
#closedset
time/1000
80 70 60 50 40 30 20
Computation time/itemset is very stable
There is no big difference of computation time
support %
Compare to Other Methods
・ There are roughly two methods to enumerate closed patterns
frequent pattern base: enumerate all frequent patterns, and
output only closed ones (+ some pruning),
check closedness by keeping all discovered itemsets in memory
closure base: compute closed pattern by closure, and avoid the
duplication by keeping all discovered itemsets in memory
If #solution is small, frequent pattern base is fast, since search
for checking closedness takes very short time
vs. Other Implementations (sparse)
Large minimum support  frequent pattern base
Small minimum support  PPC extension
vs. Other Implementations (dense)
Small minimum support  PPC extension and database
reduction are good
Extend Closed Patterns
・ There are several mining problems for which we can introduce
closed patterns (union of occurrences is unique!!)
- Ranked trees (labeled trees without siblings of the same label)
- Motifs (string with wildcards)
A
AB??EF?H
B
A
C
B
A
A
ABCDEFGH
ABZZEFZH
For these problems, PPC extension also works similarly, with
conditional database and occurrence deliver
Conclusion
・ We overviews the techniques on frequent pattern mining as
enumeration algorithms, and show that
- complexity of one iteration and bottom wideness are important
・ We show that closed pattern is probably a valuable model,
and can be enumerated efficiently
ABCD
ACBD
Future works
・ Develop efficient algorithms and implementations for other basic
mining problems
・ Extend the class of problems in which closed patterns work well

Document

Transcript Document

Directory