Transcript PowerPoint

A Lightweight Hybrid
Hardware/Software Approach for
Object-Relative Memory Profiling
Licheng Chen, Zehan Cui, Yungang Bao,
Mingyu Chen, Yongbing Huang, and Guangming Tan
Institute of Computing Technology (ICT)
Chinese Academy of Sciences (CAS)
ISPASS 2012
April 2, 2012
Background
• Memory behavior is the key factor of the performance
of a program.
• Understanding memory behavior is significant for
identifying the bottleneck of both architecture and
application.
• For example,
– TLB is an essential component of memory system
– Applications’ working set tends to be larger and lager,
leading to serious TLB miss
– Study 1: that TLB miss can degrade system performance by
5~14% [Bhargava’08]
– Study 2: a large number of TLB misses in multi-threaded
programs are redundant and predictable, which implies
the optimization potential. [Bhattacharjee’08]
Done by memory profiling
Memory Profiling
• Memory profiling is to collect memory behavior
information during the execution of programs.
• Profiling can be performed for
– different hardware components
– at different software levels
Function
Application
Whole System
TLB/Cache/DRAM
Objects (Array, List etc.)
Object Memory Profiling
• Object refers to a group of
data stored as a unit [Wu’04]
– Distinguish regular patterns
from mixed and irregular traces
Application
Traces
Object
Trace
Whole System
Traces
• Valuable for optimization
–
–
–
–
Memory trace compression
Data layout
Object-level prefetching
Cache partition [Soft-OLP, PACT 2009]
Irregular
Regular
Current Profiling Approaches
• Existing approaches
–
–
–
–
Compiler-driven: re-compile/re-link, source code
Instrumentation: heavy overhead
Simulation: accuracy problem, slow
Performance Counter: lack of detailed information
• All cannot observe page table walks due to TLB Miss
• We propose a hybrid hardware/software approach
for object memory profiling
– Accurate: real application & real system
– Lightweight
– Track page table walks at object-level
Outline
• Background
• Design and Implementation
• Experimental Results
• Conclusion
An Overview
Physical
Address Trace
Virtual
Address Trace
0x398f24a
0x398f24b
0x398f24c
……
0x1af4aa
0x1af4a6
0x1af4a8
……
0x38d2cfc
0x38d2cfd
……
0x1f05000
0x1f06000
0x1f07000
……
0x1f15000
0x1f16000
0x1f17000
……
0x1f25000
0x1f26000
……
Object Access
Pattern
Matrix (VA: 0x1f05000)
HMTT
• Hybrid Memory Trace Toolkit
– A DDR3 SDRAM compatible memory trace monitoring system
– Adopts hardware snooping technology
Memory Trace:
<time_stamp, r/w, phy_addr>
Advantages:
• Platform independent
• Negligible overhead
• Full-system real memory
traces, including OS, page
table walks
PCIE Cable
Connector
DIMM plugged on
the other side
Challenges (1)
• How to translate physical address trace to
virtual address trace of a specific process?
• Modify OS kernel to
obtain page table
• Lookup a phy_addr in
the dumped page
table
• Generate virtual trace
of each process
Challenge (2)
• How to synchronize hardware and software
when an page table update occurs in kernel?
• Physical Page
allocation/Free in
kernel
• Trigger annotations in
OS VM module
• Update dumped page
table
• Send a sync_tag to
hardware
Challenge (3)
• How to translate virtual address to objects
without modifying source codes?
Virtual
Address Space
Object:
matrix
matrix
matrix==mymalloc(0x1000)
malloc(0x1000)
Object-VA
Mapping Table
• The role of malloc() is
to map VA to object
• Use dynamic library
overwrite to replace
malloc()
Put them all together
Physical
Address Trace
Virtual
Address Trace
0x398f24a
0x398f24b
0x398f24c
……
sync_tag
page
walk
0x1af4aa
0x1af4a6
0x1af4a8
……
sync_tag
0x38d2cfc
0x38d2cfd
page
walk
……
0x1f05000
0x1f06000
0x1f07000
……
0x1f15000
0x1f16000
0x1f17000
……
0x1f25000
0x1f26000
……
Dumped
Page Table
Object Access
Pattern
Matrix (VA: 0x1f05000)
Object-VA
Mapping Table
Use page table to distinguish three types of memory access
• Sync_tag  update page table
• Access page table itself  page table walk due to TLB miss
• Other memory access  virtual address
Evaluation Methodology
Intel Xeon E5504, 2.0GHz,
2 Sockets, 4 Cores per Socket (8 core in total)
Processor
L1
D-Cache: 32KB, 8-way, 64Byte/line
I-Cache: 32KB, 4-way, 64Byte/Line
L2
256KB, 8-way, 64Byte/line
Shared Cache
L3
4MB, 16-way, 64Byte/line
TLB
(private)
DTLB0
64 entries for 4-KByte pages
32 entries for huge pages (2MByte)
TLB1
512 entries for 4-KByte pages
Private Cache
Memory
DDR3-800 RDIMM, dual-rank, plugged into Socket 0, 4GB
0.25GB reserved for HMTT configuration and buffer
3.75GB system available
Operating System
CentOS 5.3, Linux kernel 2.6.32.18
Benchmarks
Multithreaded PARSEC 2.1
A custom hybrid MPI/pthread implemented BFS of Graph500-1.2
Validation
• For SpMV benchmark (CSR) :
y = ax * xhost
Our system is able to distinguish regular access pattern from irregular pattern
• Micro-benchmark:
—The error is less than 2%
Overhead
• Two main overhead:
– Dumping page table traces: + dump_pt
– Dumping object-VA mapping: + dump_obj
Normalized Overhead
• Monitoring objects >= 4KB: result in most memory references
1.06
1.04
1.02
1
0.98
0.96
Origin
+dump_pt
+dump_obj
<2%
<1%
Case Study 1: BFS (Breadth-First Search)
• column object got about 71% of page walks  key object
• Optimization: use huge page for column object
120%
rowstarts
column
pred
oldq
100%
80%
60%
40%
20%
Normalized Speedup
Percentage of Page Walks
– Speedup: about 12% for 8-thread, 8% for 128-thread
1.4
w/o hugetlb
1.3
w/ hugetlb
1.2
8.18%
1.1
1
0.9
0.8
0%
1
2
4
32
128
Number of Threads
1
2
4
8
16 32
Number of Threads
64
128
Case Study 2: Canneal (PARSEC)
Number of memory
requests
• Cache-aware simulated annealing (SA) to
minimize the routing cost of a chip design
• Two objects contribute most of the memory
accesses: _elements and _location
1.E+09
8.E+08
6.E+08
4.E+08
2.E+08
0.E+00
1
2
4
8
Main Objects in Canneal
The memory access almost do not change while increasing thread number.
Case Study 2: Canneal
3.E+08
total
_elements
_locations
1.1
Normalized Speedup
Number of Page Walks
• _elements object contributes the most of the increased
page walks
• Put the _elements object into huge page to reduce TLB
miss  Speedup: about 5% for 8-thread
2.E+08
w/o hugetlb
w/ hugetlb
1.05
2.E+08
1.E+08
5.E+07
1
0.95
0.E+00
1
2
4
Number of Threads
8
0.9
1
2
4
Number of Threads
8
A Visual Demo of the HMTT
Conclusion
• We have designed and implemented a hybrid
hardware/software approach to conduct objectrelative memory profiling.
– Accurate: real application & real system
– Lightweight
– Track page table walks at object-level
• We demonstrate two case studies to show how the
approach can help users better understand memory
behavior and optimize performance.
• We intend to use this approach to analyze virtual
machine on real machines.
Thanks!
&Questions?
Extra Slides
Memory Profiling Approaches
Low
overhead
Page
walks+
Accurate
Detailed
Instrument
√
√
×
×
Simulator
*
√
×
×
Performance
Counter
Compiler
√
×
√
*
√
√
√
×
Hybrid H/S
√
√
√
√
Note: √-Yes, ×-No, *-Maybe
Reverse Page Table
• Physical address  pid, virtual address
Physical page
number
0
Vaddr1
pid1
Vaddr1'
Pid1'
...
1
2
Index
3
...
N-1
...
...
Vaddr”
Pid”
...
Vaddrk
pidk
Validation
Access objects with different pattern:
• a0: all read accesses, forward
• a1: 3/4 read and 1/4 write accesses, forward
• a2: 2/4 read and 2/4 write accesses, forward
• a3: 1/4 read and 3/4 write accesses, backward
• a4: all write accesses, backward
Size 256MB, access step 64B, requests: 4M
Obj
a0
a1
a2
a3
a4
Read
4,194,370
4,194,310
4,194,369
4,194,303
4,194,436
Write
0
1,048,576
2,096,927
3,087,379
4,149,586
Rate
4:0
4:1
4:2
4:2.94
4:3.96
Per
4:0
4:1
4:2
4:3
4:4
Error
0%
0%
0%
2.04%
1.01%
a0
a4
HMTT Configuration Space
• A reserved physical memory region
• Can be accessed by source codes and binary codes