Transcript PP-tree

From Path Tree To Frequent Patterns:
A Framework for Mining Frequent
Patterns
Yabo Xu, Jeffrey Xu Yu, Guimei Liu, Hongjun Lu, Proc. of the
2002 IEEE International Conference on Data Mining (ICDM’02)
Adviser:Jia-Ling Koh
Speaker: Yu-ting Kung
Introduction

In this paper, the main tasks (for a multi-user
environment) are:
1.
2.
3.
4.
Constructing an initial tree for a transactional
database (in memory)
Mining using the tree constructed in memory
Converting in-memory tree  a disk-based tree
Loading a portion of the tree on disk into main
memory for mining (mining is the same as 2)
2
Introduction(Cont.)

Data structures─PP-tree


A novel coded prefix-path tree
Two representations:
1.
2.

Memory–based pp-tree
Disk-based pp-tree
Mining algorithm─PP-Mine


Upon the memory-based pp-tree
Outperforms FP-growth
3
Transaction Database
 Example: (min_sup threshold  2 )
( a:3, b:1, c:3, d:3, e:3, f:1, g:2, h:1, i:1)
4
A Coded Prefix-Path Tree
F: a set of frequent 1-items in total order (like frequency order)
1.
2.
3.
4.
PP-tree: an order tree
Node:
labelled for a frequent item in F
Children of a node:
listed following the order
The rank N of a PP-tree: (N= 5)
the number of frequent 1-itemset
5
A Complete Prefix-Path Tree
1.
pp tree (rank N):
a PP-tree with 2 n nodes
2.
Node is encoded in:
pre-order traversal
3.
Shaded subtree:
a PP-tree
6
PP-tree Representations


Memory-based representation ─ PPM-tree
Disk-based representation ─ PPD-tree

Represented as (T , F , I , m )
T: tree structure in disk
 F: stores N frequent 1-itemset
 I: index indicating the ranges of codes in disk-pages
  m : min_sup uesd to build PPD-tree on disk


See Figure 3 (next page)
7
PP-tree Representation-Fig3
item:count
Code of range
code:count
8
How to built a PPD-tree?

Construction


A PPM-tree with  m in memory (task1)
Conversion


PPM-tree  PPD-tree
Using coding scheme
9
PP-Mine: Mining in-Memory

Based on two properties:
(ij, ik: a single item prefix-path)
( ,  ,  : a prefix-path in general which are possible empty)
1.
Property1 (push-down)
10
PP-Mine (Cont.)
2.

Property 2 (push-right)
Example: Figure 4 (next page)
11
PP-Mine (Cont.)
12
PP-Mine Algorithm: Example
13
Experiment(1)

Data Sourse



Three Algorithms to be compared




Sparse dataset─T25I20D100K(10K items)
Dense dataset ─ T40I10D1K(101 items)
PP-Mine
FP-growth
H-Mine
Compare the only mining-phase
14
Experiment Result(1)
15
Experiment Result(2)



Data Sourse─T40I10D100K(59 items)
 m = 50%
Two Algorithms to be compared



PP-Mine
FP-growth
Compare


t(FP)─the time for FP-growth to construct a FP-tree
t(PP) ─the time for PP-load to load a sub PPD-tree +
the time to construct a small PPM-tree
16
Experiment Result(2)
17
Conclusion

PP-Mine algorithm outperforms FP-tree


Reduce both I/O cost and CPU cost
PP-Mine algorithm outperforms H-mine

Minimizes counting cost
18
Coverage

Definition
A coverage of a prefix-path  -prefix is
defined as all the  -prefixes that contain
 -prefix (including  -prefix itself)
19