Transcript Slide 1

Cache-conscious Frequent Pattern
Mining on a Modern Processor
Amol Ghoting, Gregory Buehrer, and Srinivasan
Parthasarathy
Data Mining Research Laboratory, CSE
The Ohio State University
Daehyun Kim, Anthony Nguyen, Yen-Kuang Chen, and
Pradeep Dubey
Intel Corporation
Copyright 2005, Data Mining Research Lab, The Ohio State University
Roadmap
•
•
•
•
•
•
Motivation and Contributions
Background
Performance characterization
Cache-conscious optimizations
Related work
Conclusions
Copyright 2005, Data Mining Research Lab, The Ohio State University
Motivation
• Data mining applications
– Rapidly growing segment
in commerce and science
– Interactive  response
time is important
– Compute- and memoryintensive
• Modern architectures
– Memory wall
– Instruction level
parallelism (ILP)
Note: Experiment conducted
on specialized hardware
FP-Growth
2.4x
1.6x
SATURATION
Copyright 2005, Data Mining Research Lab, The Ohio State University
Contributions
• We characterize the performance and memory
access behavior of three state-of-the-art
frequent pattern mining algorithms
• We improve the performance of the three
frequent pattern mining algorithms
– Cache-conscious prefix tree
• Spatial locality + hardware pre-fetching
• Path tiling to improve temporal locality
• Co-scheduling to improve ILP on a simultaneous
multi-threaded (SMT) processor
Copyright 2005, Data Mining Research Lab, The Ohio State University
Roadmap
•
•
•
•
•
•
Motivation and Contributions
Background
Performance characterization
Cache-conscious optimizations
Related work
Conclusions
Copyright 2005, Data Mining Research Lab, The Ohio State University
Frequent pattern mining (1)
• Finds groups of items that co-occur frequently in a transactional
data set
• Example: Customer 1:
milk bread cereal
Customer 2:
milk bread eggs sugar
Customer 3:
milk bread butter
Customer 4:
eggs sugar
• Minimum support = 2
• Frequent patterns:
– Size 1: (milk), (bread), (eggs), (sugar)
– Size 2: (milk & bread), (milk & eggs), (sugar & eggs)
– Size 3: None
• Search space traversal: breadth-first or depth-first
Copyright 2005, Data Mining Research Lab, The Ohio State University
Frequent pattern mining (2)
• Algorithms under study
– FP-Growth (based on the pattern-growth
method)
• Winner of the 2003 Frequent Itemset Mining
Implementations (FIMI) evaluation
– Genmax (depth-first search space traversal)
• Maximal pattern miner
– Apriori (breadth-first search space traversal)
• All algorithms use a prefix tree as a data set
representation
Copyright 2005, Data Mining Research Lab, The Ohio State University
Roadmap
•
•
•
•
•
•
Motivation and Contributions
Background
Performance characterization
Cache-conscious optimizations
Related work
Conclusions
Copyright 2005, Data Mining Research Lab, The Ohio State University
Setup
• Intel Xeon processor
– At 2Ghz with 4GB of main memory
– 4-way 8KB L1 data cache
– 8-way 512KB L2 cache on chip
– 8-way 2MB L3 cache
• Intel VTune Performance Analyzers to collect performance data
• Implementations
– FIMI repository
• FP-Growth (Gosta Grahne and Jianfei Zhu)
• Genmax (Gosta Grahne and Jianfei Zhu)
• Apriori (Christian Borgelt)
– Custom memory managers
Copyright 2005, Data Mining Research Lab, The Ohio State University
Execution time breakdown
FPGrowth
Genmax
Apriori
Count-FPGrowth () – 61%
Count-GM () – 91%
Count-Apriori () – 70%
Project-FPGrowth () – 31%
Other – 9%
Candidate-Gen () – 25%
Other – 8%
Other – 5%
Support counting
in a prefix tree
Similar to Count-FPGrowth ()
Copyright 2005, Data Mining Research Lab, The Ohio State University
Operation mix
CountFPGrowth ()
CountGM ()
CountApriori ()
Integer ALU
operations per
instruction
0.65
0.64
0.34
Memory
operations per
instruction
0.72
0.69
0.66
Note:
Each column need not sum up to 1
Copyright 2005, Data Mining Research Lab, The Ohio State University
Memory access behavior
Count-FPGrowth()
Count-GM()
Count-Apriori()
L1 hit rate
89%
87%
86%
L2 hit rate
43%
42%
49%
L3 hit rate
39%
40%
27%
L3 misses per
instruction
0.03
0.03
0.04
CPU utilization
9%
9%
8%
Copyright 2005, Data Mining Research Lab, The Ohio State University
Roadmap
•
•
•
•
•
•
Motivation and Contributions
Background
Performance characterization
Cache-conscious optimizations
Related work
Conclusions
Copyright 2005, Data Mining Research Lab, The Ohio State University
FP-tree
r r
a:1
a:2
a:4
f:1
c:1
c:1c:3
c:2
b:1
b:1
f:1f:3
f:2
Minimum support = 3
•Index based on largest
common prefix
p:1
m:1
m:1
m:2
b:1 b:1
p:1p:2
p:1
m:1 m:1
ITEM
COUNT
NODE
POINTER
CHILD
POINTERS
PARENT
POINTER
Typically results in a compressed
data set representation
Node pointers allow for fast
searching of items
Copyright 2005, Data Mining Research Lab, The Ohio State University
FP-Growth
r
• Basic step:
– For each item in the FP-tree,
build conditional FP-tree
– Recursively mine the
conditional FP-tree
a:4
f:1
c:1
c:3
b:1
b:1
f:3
•Dynamic data structure and only two
node fields are used through the bottom
up traversal
p:1
m:2
b:1
ITEM
p:2
m:1
COUNT
•Poor spatial locality
NODE
POINTER
•Large data structure
•Poor temporal locality
•Pointer de-referencing
Conditional
FP-tree for m:
•Poor instruction level parallelism
(ILP)
We process items p, f, c,
a, b similarly
r
a:3
c:3
f:3
Copyright 2005, Data Mining Research Lab, The Ohio State University
CHILD
POINTERS
PARENT
POINTER
Mine
conditional
FP-tree
Cache-conscious prefix tree
• Improves cache line utilization
– Smaller node size + DFS allocation
• Allows the use of hardware cache line pre-fetching for bottom-up
traversals
r
Header lists (for Node pointers)
a
c
f
b
c
b
a
b
c
f
m
p
ITEM
PARENT POINTER
f
p
o
o
o
o
o
o
o
o
o
o
o
o
COUNT
CHILD POINTERS
m
p
b
m
NODE POINTER
Count lists (for Node counts)
a
b
c
f
m
p
DFS allocation
Copyright 2005, Data Mining Research Lab, The Ohio State University
o
o
o
o
o
o
o
o
o
o
o
o
Path tiling to improve
temporal locality
r
• DFS order enables
breaking the tree
into tiles based on
node addresses
– These tiles can
partially overlap
• Maximize tree reuse
– All tree accesses
are restructured to
operate on a tileby-tile basis
Tile 1
Tile 2
Tile N-1
Copyright 2005, Data Mining Research Lab, The Ohio State University
Tile N
Improving ILP using SMT
Thread1
Thread2
r
• Simultaneous multi-threading (SMT)
Same data but
different
computation
– Intel hyper-threading (HT) maintains two hardware
contexts on chip
– Improves instruction level parallelism (ILP)
• When one thread waits, the other thread can use CPU
resources
• Identifying independent threads is not good enough
• Unlikely to hide long cache miss latency
• Can lead to cache interference (conflicts)
• Solution: Restructure multi-threaded computation to
reuse cache on a Tile
tile-by-tile
basis Tile N-1 Tile N
1 Tile 2
Copyright 2005, Data Mining Research Lab, The Ohio State University
Speedup for FP-Growth
(Synthetic data set)
4000
4500
5000
5500
Copyright 2005, Data Mining Research Lab, The Ohio State University
Speedup for FP-Growth
(Real data set)
For FP-Growth,
L1 hit rate improves from 89% to 94%
L2 hit rate improves from 43% to 98%
50000
Speedup
Genmax – up to 4.5x
Apriori – up to 3.5x
58350
66650
75000
Copyright 2005, Data Mining Research Lab, The Ohio State University
Roadmap
•
•
•
•
•
•
Motivation and Contributions
Background
Performance characterization
Cache-conscious optimizations
Related work
Conclusions
Copyright 2005, Data Mining Research Lab, The Ohio State University
Related work (1)
• Data mining algorithms
– Characterizations
• Self organizing maps
– Kim et al. [WWC99]
• C4.5
– Bradford and Fortes [WWC98]
• Sequence mining, graph mining, outlier detection,
clustering, and decision tree induction
– Ghoting et al. [DAMON05]
– Memory placement techniques for association rule
mining
• Considered the effects of memory pooling and coarse
grained spatial locality on association rule mining
algorithms in a serial and parallel setting
– Parthasarathy et al. [SIGKDD98,KAIS01]
Copyright 2005, Data Mining Research Lab, The Ohio State University
Related work (2)
• Data base algorithms
– DBMS on modern hardware
• Ailamaki et al. [VLDB99,VLDB2001]
– Cache sensitive search trees and B+-trees
• Rao and Ross [VLDB99,SIGMOD00]
– Prefetching for B+-trees and Hash-Join
• Chen et al. [SIGMOD00,ICDE04]
Copyright 2005, Data Mining Research Lab, The Ohio State University
Ongoing and future work
• Algorithm re-design for next generation
architectures
– e.g. graph mining on multi-core architectures
• Cache-conscious optimizations for other data
mining and bioinformatics applications on
modern architectures
– e.g. classification algorithms, graph mining
• Out-of-core algorithm designs
• Microprocessor design targeted at data
mining algorithms
Copyright 2005, Data Mining Research Lab, The Ohio State University
Conclusions
• Characterized the performance of three popular
frequent pattern mining algorithms
• Proposed a tile-able cache-conscious prefix tree
– Improves spatial locality and allows for cache line prefetching
– Path tiling improves temporal locality
• Proposed novel thread-based decomposition for
improving ILP by utilizing SMT
– Overall, up to 4.8-fold speedup
• Effective algorithm design in data mining needs to
take into account modern architectural designs.
Copyright 2005, Data Mining Research Lab, The Ohio State University
Thanks
• We would like to acknowledge the following
grants
–
–
–
–
NSF: CAREER-IIS-0347662
NSF: NGS-CNS-0406386
NSF: RI-CNS-0403342
DOE: ECPI-DE-FG02-04ER25611
Copyright 2005, Data Mining Research Lab, The Ohio State University