ASPLOS POSTER

Download Report

Transcript ASPLOS POSTER

Authors: Kenichi Imazato, Naoto Fukumoto,
Koji Inoue, Kazuaki Murakami
(Kyushu University)
Performance Balancing:
An Adaptive Helper-Thread Execution for Many-Core Era
1. Concept
3. Analysis
Goal: High-Performance parallel processing on a chip multiprocessor
• Conventional approach: All cores execute a
parallel program
Cholesky
– Performance improvement is very small even if we
increase the number of cores from 6 to 8.
Speedup
• Our approach: Core management considering
the balance of processor-memory performance
(Performance Balancing).
– Some cores are used to improve the memory
performance
Execute helper-threads for prefetching
If the effect of the software prefetching is larger than
the negative impact of the TLP throttling,
we can improve the CMP performance.
execute an
application computing cores
thread
core
0
MSB L1D$ L1I$
Perfect L2 cache
(hit rate 100% )
Our approach!
execute
a helper
helper cores
thread
MSB L1D$ L1I$
shared bus L2 shared cache
… core
N
MSB L1D$ L1I$
on-chip
off-chip
CC
ht
N m
6.9%
the application threads.
Reduction rate of L2 cache misses achieved by helper cores
1MB L2 cache
0
1
2
3
4
5
6
Execution cycles on
N-core execution
7
8
Number of Cores
CC
f
1 f 
2.5
N  m  1  r  k  CC

ht
N
N
2.0
f
1 f 
The fraction of main-memory access
N
1.5
time when all cores are used to execute
The number of cores on a chip
compute intensive
core …
1
main memory
The number of helper cores
• f ↓ , kN ↑
⇒ Benchmark programs are more beneficial
• m ↓ , rht ↑
⇒ Our approach is more effective
th
N m
/ CC N
Proposal
Conventional
Speed-Down
1.0
0.5
0
7
Speed-Up!
0
0
.
2
0
.
4
0.6
6 5 4 3 2
0
.
8
1 0 1.0
rht
m
Performance analysis of Cholesky ( f =0.73, k N =0.45)
• Helper cores work for computing cores.
• By exploiting profile information, compiler can
statically optimize the number of helper cores.
• By monitoring the processor and memory
performance, the OS determines the number of
helper-cores and the type of prefetchers.
4. Preliminary Evaluation
• Assumption: Execution threads is fixed in whole program execution
The best numbers of helper-cores are given
• Simulation Parameters: 8 in-order cores, 1MB L2 cache, 300 clock cycles Main Memory latency
1.6
– If memory performance is quite low, OS
increases the number of helper-cores.
2. Architectural Support
• For prefetching, helper cores need the information for cache misses caused by computing cores.
Introduce MSB
1.4
1.2
BASE
PB-GS
47%
speedup
1
0.8
0.6
0.4
0.2
0
• Miss Status Buffer (MSB): Records the information for cache misses caused by computing cores.
– Each entry consists of a core-ID, a PC value, and an associated miss address.
– Each core has an MSB. → Helper-threads can be executed on any core.
• The cache-miss information can be obtained by snooping the coherence traffic.
• By referring MSB, helper-thread emulates hardware prefetchers.
Cholesky FMM
LU
0.7
PB-LS
Reduction rate of L2 cache miss rate
memory intensive
4.5
4
3.5
3
2.5
2
1.5
1
0.5
0
Relative execution cycles
The fraction of operations that can be parallelized
Speedup
•
Execution time in clock cycles on an N-core CMP
Ocean
Radix Raytrace
PB-GS
PB-LS
LU
Ocean
0.6
0.5
0.4
0.3
0.2
0.1
0
-0.1
Cholesky FMM
Radix Raytrace
-0.2
BASE: All cores are used to execute the application-threads.
PB-GS(PB-LS) : The model supporting performance balancing, and
executes global(local) stride prefetch as helper threads.
Our approach improves performance by up to 47% (Cholesky).