Architectural Support for Scalable Speculative

Download Report

Transcript Architectural Support for Scalable Speculative

Alternatives to Eager Hardware Cache
Coherence on Large-Scale CMPs
Marcelo Cintra
University of Edinburgh
http://www.homepages.inf.ed.ac.uk/mc
Acknowledgments
 Research Team
–
–
–
–
–
Dr. Tom Ashby (now at IMEC)
Christian Fensch
Pedro Jimenez
Constantino Ribeiro
Marius Kuhn
 Funding
– UK – EPSRC
– EC – HiPEAC (NE)
– EC – SARC (IP)
ORNL - March 2007
2
Why Cache Coherence?
 When is it necessary?
– Shared-memory parallel applications (including OS)
– Thread migration under multiprogramming
– Implementing synchronization (e.g., spin-locks and Decker’s
algorithm)
 When is it not so important?
– Sequential applications with no migration
– Applications with not much locality or reuse of data
– Message-passing parallel applications
ORNL - March 2007
3
Hardware Cache Coherence?
 What is good about it?
– Portability and transparency: no need to compile applications
for different hierarchies
– High performance: high selectivity of invalidations, low
traffic, low latency operations
 What is not so good about it?
– Extra hardware: for coherence state, for controllers, and for
suitable interconnects
– Overkill:
 When coherence is not necessary (see previous slide)
 When software can do a reasonable, or even better, job
ORNL - March 2007
4
Hardware Cache Coherence?
 Extra hardware:
– Shared bus fabrics and crossbars will take up a large amount
of chip real estate (easily >50% overhead for 32 processors
or more by some accounts)
– Protocols and controllers for distributed directories and
multi-segment buses are complex and difficult to verify
 Overkill:
– Communication in applications does not occur at all times,
but hardware assumes so
– Hardware is always there even when not needed
ORNL - March 2007
5
Current CMP’s
 Example: IBM Power4
– 2 8-issue cores with
private 32KB L1
– Crossbar to 3 L2 banks
1.5MB total
– 174M transistors in
0.22micron on 412mm2
 and Power5
– 2 8-issue 2-way SMT
cores with private 32KB
L1
– Crossbar to 3 L2 banks
1.9MB total
– 276M transistors in
0.13micron on 389mm2
ORNL - March 2007
6
Current CMP’s
 Example: Sun T1
– 8 1-issue 4-way MT
cores with private 8KB
L1
– Crossbar to 4 L2 banks
3MB total
– ?M transistors in
0.09micron on ?mm2
 and T2
– 8 1-issue 8-way MT
cores
– Crossbar to 8 L2 banks
4MB total
– ?M transistors in
0.065micron on ?mm2
ORNL - March 2007
7
Where are we heading?
 Moore’s Law applied to number of cores? Not yet.
– Multithreading delays the need for more cores, but unlikely
to go beyond 8-way
– Ever larger shared on-chip cache is easy solution (i.e., with
directory); but how large can/should we go?
 But what if we want it to be so?
– Must find scalable and area-efficient solutions to cache
coherence problem
– Must find ways to make the most out of the off-chip data
bandwidth
– Must (ideally) make architecture that is good for various
workload types
Is it time to re-think traditional hardware cache coherence?
ORNL - March 2007
8
Our Tiled Architecture Approach
 Simple single- or dual-issue, possibly multithreaded,
cores
 Point-to-point interconnect
 Cache coherence mostly handled by OS and
compiler/programmer with minimal hardware support
 Cache coherence can be turned off with little hardware
waste for certain applications
 Reconfigurable NUCA caches maximize the usefulness
of the on-chip storage
ORNL - March 2007
9
Outline
 Motivation
 Page level placement and remote cache access
– Design
– Evaluation
– Related Work and Conclusions
 Software self invalidations and writebacks
– Design
– Evaluation
– Related Work and Conclusions
 Current and Future Directions
ORNL - March 2007
10
Baseline Tiled Architecture
 Single issue RISC processors
 Point-to-point interconnects
 Caches are only for local data
storage and there is no
hardware cache coherence
 Shown to be good for ILPand DLP-based applications
(e.g., MIT’s Raw)
 But, no support for sharedmemory parallel applications
ORNL - March 2007
11
Key Ideas
 There is only one copy of each data in all L1 caches
(thus, no coherence problem)
– Later we will relax this constraint a little bit
 Data is allocated to L1 caches on a page level
– OS controls placement
– New hardware tables ensure that location of data can be
easily determined
– Placement can change at quiescent points (e.g., barriers)
 Hardware supports remote cache accesses
Should work well if most data is placed locally and the
distances/latency on chip are not too high
ORNL - March 2007
12
Mechanism Overview
 Data placement
– There is a spectrum of options
Static per
page
-
Our proposal
Hardware Complexity
Mapping Flexibility
Dynamic per
line
+
+
– Dynamic per line is what previous proposals for NUCA-like
and adaptive shared/private cache schemes do
– Static per page is very restrictive and compiler/OS has to get
the mapping absolutely right
– Our proposal: dynamic OS controlled mapping per page
ORNL - March 2007
13
Data Placement
 New level of indirection allows OS to map any given
page to any L1 cache
 OS extensions
– Page table is extended to keep owner tile ID alongside (per
page) virtual-to-physical address translation (note that this
exposes the chip structure to the OS)
– Default mapping policy is first-touch
 Hardware extensions
– TLB-like table that caches the most recent translations and is
used to locate the line upon loads and stores (MAP)
– Instead of a directory coherence controller, a device to
handle remote cache access requests (RAC)
ORNL - March 2007
14
Data Placement
 Why not simply divide the physical address space
across tiles and use the virtual-to-physical mapping to
place pages?
– More restrictive
– Fragmentation in main memory
– More difficult to change mapping
L1
P addr.
V addr.
L1
L1
L1
L1
L1
L1
L1
ORNL - March 2007
P addr.
V addr.
15
Data Placement
 Difficulty:
– L1 caches are often virtually-indexed to speed up local
processor access
– Remote cache requests with physical addresses would require
some inverse translation at the remote tile
 Solution:
– Ship virtual addresses over the on-chip interconnect
– TLB’s only keep virtual-to-physical translations for the data
allocated locally
ORNL - March 2007
16
Load/Store Mechanism
1.
2.
3.
4.
5.
6.
Virtual address (v-addr) is used
to index L1, TLB, and MAP
If MAP entry points to local
L1 then proceed with cache
and TLB lookups as usual
If MAP entry points to remote
L1 then abort L1 lookup and
ignore TLB result (including
miss)
v-addr is sent over network to
home tile’s RAC
v-addr is used to index both
L1 and TLB
Remote cache and TLB
miss/hits are handled as usual
ORNL - March 2007
17
Allowing Migration of Data
 To migrate pages:
–
–
–
–
Invalidate MAP entries
Write-back dirty cache lines
Invalidate cache contents
The following access will generate a (possibly) new mapping
 Must be done at a quiescent state: e.g., barrier
 Instructions for cache writeback and invalidate already
exist in most IS’s, all we need is to invalidate the MAP
as well
 Note that it is not necessary to invalidate the TLB’s
ORNL - March 2007
18
Allowing Migration of Data
L1
P0
TLB MAP
Memory
pa→P0
L1
P1
1. P0 ld 0xA0 (pa→P0)
2. P1 ld 0xA1 (remote)
3. P0 st 0xA2
4. Barrier (write-back +
inv. L1, inv. MAP)
TLB MAP
pa→P0
L1
P2
TLB MAP
ORNL - March 2007
19
Allowing Migration of Data
L1
P0
TLB MAP
Memory
pa→P1
L1
P1
TLB MAP
pa→P1
L1
P2
1. P0 ld 0xA0 (pa→P0)
2. P1 ld 0xA1 (remote)
3. P0 st 0xA2
4. Barrier (write-back +
inv. L1, inv. MAP)
5. P1 ld 0xA1 (pa→P1)
6. P0 ld 0xA0 (remote)
TLB MAP
ORNL - March 2007
20
Allowing Replication of Read-Only Data
 Controlled replication: multiple readers but single, fixed,
writer
 Caveat: assume release consistent (RC) programs
 Operation
–
–
–
–
Processors reading data in a page get a local mapping in MAP
First write traps to OS and tile becomes the home/owner
Subsequent reads with old local mappings can continue
Subsequent writes, and reads without mapping, are directed to
home tile
– At locks must also invalidate MAP entries
– At barriers same procedure as for migration
 Key features
– OS does not have to keep track of sharers
– Ownership of modified data is not transferred
ORNL - March 2007
21
Allowing Replication of Data
L1
P0
TLB MAP
Memory
pa→P0
L1
P1
1. P0 ld 0xA0 (pa→P0)
2. P1 ld 0xA1 (pa→P1)
3. P0 st 0xA2 (owner)
4. Barrier (write-back +
inv. L1, inv. MAP)
TLB MAP
pa→P1
L1
P2
TLB MAP
ORNL - March 2007
22
Allowing Replication of Data
L1
P0
TLB MAP
Memory
pa→P0
L1
P1
1. P0 ld 0xA0 (pa→P0)
2. P1 ld 0xA1 (pa→P1)
3. P0 st 0xA2 (owner)
4. Barrier (write-back +
inv. L1, inv. MAP)
TLB MAP
5. P0 ld 0xA2 (pa→P0)
6. P2 ld 0xA1 (pa→P2)
7. P2 st 0xA1 (owner)
L1
P2
8. P0 acq. lock (invalidate
MAP)
TLB MAP
pa→P2
ORNL - March 2007
23
Allowing Replication of Data
L1
P0
TLB MAP
Memory
pa→P2
L1
P1
1. P0 ld 0xA0 (pa→P0)
2. P1 ld 0xA1 (pa→P1)
3. P0 st 0xA2 (owner)
4. Barrier (write-back +
inv. L1, inv. MAP)
TLB MAP
5. P0 ld 0xA2 (pa→P0)
6. P2 ld 0xA1 (pa→P2)
7. P2 st 0xA1 (owner)
L1
P2
8. P0 acq. lock (invalidate
MAP)
TLB MAP
pa→P2
ORNL - March 2007
9. P0 ld 0xA1 (pa→P2)
24
Evaluation Environment
 Simulator:
–
–
–
–
PowerPC IS
Pipelined single-issue cores and network modeled with Liberty
Caches and memory borrowed from SimpleScalar
All latencies modeled in detail; contention modeled in detail,
except at internal nodes in the network
 Compiler: gcc 3.4.4
 Applications: Splash-2
 Systems simulated:
– Proposed scheme with and without migration+sharing
(NUCA-Dist and NUCA-Dist+MS)
– Idealized distributed directory coherent system (Dir-Coh)
ORNL - March 2007
25
3%
No gap
Dir-Coh
vo
lre
nd
wa
te
r-n
sq
wa
te
r-s
pt
ra
yt
ra
ce
ra
di
os
it y
oc
ea
n
fm
m
ba
rn
es
ra
di
x
LU
T
42%
FF
40
35
30
25
20
15
10
5
0
ch
ol
es
ky
speedup
Overall Performance
NUCA-Dist+MS
NUCA-Dist+MS performs close (within 42%,
and 19% on avg.) to idealized directory system
ORNL - March 2007
26
Memory Access Breakdown
Seq
NUCA-Dist
16'
NUCA-Dist
16'
32'
32'
0%
20%
40%
60%
80%
100%
Local hit
NUCA-Dist+MS
Seq
1'
16'
32'
Dir-Coh
16'
LU
1'
Dir-Coh
Dir-Coh
FFT
NUCA-Dist+MS
Seq
16'
NUCA-Dist+MS
1'
NUCA-Dist
cholesky
32'
0%
20%
Remote hit
40%
60%
80%
100%
Local miss
16'
32'
32'
0%
20%
40%
60%
80%
100%
Remote miss
Off-chip accesses
very rare for all systems
ORNL - March 2007
27
Memory Access Breakdown
Seq
1'
NUCA-Dist
16'
NUCA-Dist
16'
32'
20%
40%
60%
80%
100%
Local hit
16'
32'
Dir-Coh
32'
NUCA-Dist+MS
Seq
1'
Dir-Coh
Dir-Coh
16'
0%
fmm
barnes
NUCA-Dist+MS
Seq
16'
NUCA-Dist+MS
1'
NUCA-Dist
radix
32'
0%
20%
Remote hit
40%
60%
80%
100%
Local miss
16'
32'
32'
0%
20%
40%
60%
80%
100%
Remote miss
Migration and replication significantly
- March 2007 cache accesses
reduce the amountORNL
of remote
28
Memory Access Breakdown
Seq
NUCA-Dist
16'
NUCA-Dist
16'
32'
32'
0%
20%
40%
60%
80%
100%
Local hit
NUCA-Dist+MS
Seq
1'
16'
32'
Dir-Coh
16'
raytrace
1'
Dir-Coh
Dir-Coh
radiosity
NUCA-Dist+MS
Seq
16'
NUCA-Dist+MS
1'
NUCA-Dist
ocean
32'
0%
20%
Remote hit
40%
60%
80%
100%
Local miss
ORNL - March 2007
16'
32'
32'
0%
20%
40%
60%
80%
100%
Remote miss
29
Memory Access Breakdown
Seq
1'
NUCA-Dist
16'
NUCA-Dist
16'
32'
20%
40%
60%
80%
100%
Local hit
16'
32'
Dir-Coh
32'
NUCA-Dist+MS
Seq
1'
Dir-Coh
Dir-Coh
16'
0%
water-spt
water-nsq
NUCA-Dist+MS
Seq
16'
NUCA-Dist+MS
1'
NUCA-Dist
volrend
32'
0%
20%
Remote hit
40%
60%
80%
16'
32'
32'
100%
Local miss
0%
20%
40%
60%
80%
100%
Remote miss
NUCA-Dist+MS has on avg. 90% of local
ORNL - March 2007
accesses
30
Ovhd. of Writebacks and Invalidations
 Recall:
Overhead (%)
– At barriers: write back dirty
cache lines, invalidate cache and
MAP
– At lock acquires: write back dirty
cache lines, invalidate MAP
Writebacks and invalidations don’t
seem to be a major bottleneck
ORNL - March 2007
Benchmark
Barriers
Lock
cholesky
<0.01
2.93
FFT
0.1
<0.01
LU
4.03
<0.01
radix
2.75
4.23
barnes
0.65
1.8
fmm
0.7
0.9
ocean
22.38
0.43
radiosity
<0.01
17.4
raytrace
0.38
8.07
volrend
<0.01
<0.01
water-nsq
0.1
1.08
water-spt
3.83
0.5
31
Related Work
 Tiled architectures:
– Raw (MIT); TRIPS (U. Texas); Cyclops (IBM); Wavescalar (U.
Washington); Vector-Thread (MIT)
 Shared/Private on-chip caches
– NUCA (U. Texas); CMP-NUCA (U. Wisconsin); NuRAPID
(Purdue); Victim Replication (MIT); Cooperative Caching (U.
Wisconsin)
 Alternatives to Snooping and Directory Coherence:
– Token Coherence and Ring Coherence (U. Wisconsin)
 Software DSM
ORNL - March 2007
32
Conclusions
 Alternative to full hardware distributed directory cache
coherence that divides work between OS and hardware
 Mostly acceptable performance degradation:
– Within 50% for all applications and within 15% for 8 out of 12
applications w.r.t. an ideal hardware coherent system
– Small number of remote cache accesses (10% on average and
38% maximum)
 Likely much less complex than distributed directory
coherence:
– Only request-response transactions; no forwarding or multiple
invalidations necessary
– Less storage required compared to directory state
ORNL - March 2007
33
Outline
 Motivation
 Page level placement and remote cache access
– Design
– Evaluation
– Related Work and Conclusions
 Software self invalidations and writebacks
– Design
– Evaluation
– Related Work and Conclusions
 Current and Future Directions
ORNL - March 2007
34
Hardware vs. Software Cache Coherence
 Hardware
+ Fast communication and possibly more accurate invalidations
– Eager write propagation at the time of store
+ Unnecessary invalidations on false sharing
– No exploitation of knowledge of memory consistency model,
synchronization points, and communication patterns
+ Has been done commercially for many years and proven to
work well
ORNL - March 2007
35
Hardware vs. Software Cache Coherence
 Software
+ Compiler/programmer may know exactly what needs to be
communicated and when
– Exact sharing information not usually easy to identify statically
– Hard to deal with hardware artifacts such as false sharing,
speculative execution, and prefetching
– Never really left the niche of research prototypes and DSP’s
– Only worked for loop-based data-parallel scientific applications
+ Less hardware required
ORNL - March 2007
36
Software Coh. for Release Consistency
 Our approach: look at the communication requirements
of the memory consistency model (RC in our case)
– Synchronization points explicitly defined
– No communication between threads happens in-between
synchronization points
– Lock acquires:
 Data modified by previous lock owner must be used (transitively)
– Lock releases:
 Data modified must be made visible to future lock owners (transitively)
– Barriers:
 Data modified by all threads before the barrier must be used by all
threads after the barrier
ORNL - March 2007
37
Software Coh. for Release Consistency
 To enforce cache coherence on RC
– Lock acquires:
 Write back dirty lines
 Invalidate cache contents
– Lock release:
 Write back dirty lines
– Barrier:
 Write back dirty lines
 Invalidate cache contents
– Use per-word dirty bits to perform selective writeback in
memory to handle false sharing
ORNL - March 2007
38
Evaluation Environment
 Simulator:
– Sparc IS
– Non-pipelined single-issue cores, caches, and network
modeled with Simics
– All latencies modeled in detail; contention modeled in detail at
bus
 Compiler: Sun C compiler
 Applications: Splash-2 (scientific&engineering) and
ALPBench (media)
 Systems simulated:
– Proposed scheme with self-invalidations and writebacks
(flush)
– Hardware coherent system (MSI)
ORNL - March 2007
39
Overall Performance
For a surprising number of applications software
coherence performs
close (within 10%) to MSI
ORNL - March 2007
40
Overall Performance
Somewhat worse results for larger (256KB vs.
64KB), but still fairly
close to MSI
ORNL - March 2007
41
Overall Performance
Even better results with a shared L2 cache
ORNL - March 2007
42
Related Work
 Software cache coherence:
– Pfister et. al., ICPP’85; Cheong and Veidenbaum, ISCA’88;
Cytron et. al., ICPP’88
– Compiler-directed, only successful with regular data-parallel
applications
 Most recent/relevant work: shared-regions, Sandhu et.
al., PPoPP’93
– Also based on explicit synchronization
– Required user help to identify critical sections and the
associated data
 Software DSM
ORNL - March 2007
43
Conclusions
 Revisited software-only cache coherence mechanisms
 Pure software-based cache coherence for releaseconsistent applications performs surprisingly well
– Within 50% for all applications and within 10% for 9 out of 13
applications w.r.t. hardware coherent system
 Some applications still suffer due to unnecessary cache
invalidations and writebacks
 Hardware needed to perform more selective cache
invalidation
ORNL - March 2007
44
Outline
 Motivation
 Page level placement and remote cache access
– Design
– Evaluation
– Related Work and Conclusions
 Software self invalidations and writebacks
– Design
– Evaluation
– Related Work and Conclusions
 Current and Future Directions
ORNL - March 2007
45
Current and Future Directions
 Tiled architectures
– Hybrid hardware-software cache coherence schemes
– Mechanisms to support streaming applications
– Reconfigurable on-chip cache hierarchies
 Speculative parallelization
– Tuning of common microarchitectural features
– Automatic compilation and optimization with gcc
– Software-only speculative parallelization
 Performance prediction and design space exploration
– With machine learning techniques
 Database Systems
– JIT code generation and optimization for relational queries
ORNL - March 2007
46
Alternatives to Eager Hardware Cache
Coherence on Large-Scale CMPs
Marcelo Cintra
University of Edinburgh
http://www.homepages.inf.ed.ac.uk/mc
Implementing Locks
 Strait-forward to implement Compare&Swap style primitives
(there is only one copy of the lock variable)
 Load-link/Store-conditional primitives are not as easy to
implement without cache coherence
Coherence guarantees that
all caches will be notified
of any conflicting store
Absence of coherence
makes it impossible to
detect conflict
ORNL - March 2007
48
Implementing Locks
 Solution:
– Place RESERVE register in the tile that is home to the lock
– RESERVE register is shared by all locks in pages assigned to
the tile
– Must not allow new LL requests to overwrite the RESERVE
register (to avoid livelock)
 Must also add a timeout mechanism to prevent deadlock when there
is no matching SC
– Must add an ID tag to the RESERVE register to identify the
processor with the pending LL (to avoid simultaneous
acquisition of a lock)
ORNL - March 2007
49
Evaluation Environment
 Configurations simulated:
ORNL - March 2007
50
Evaluation Environment
 Applications: Splash-2 benchmarks with reference input
sets (except for radiosity)
ORNL - March 2007
51
Memory Latencies
0.5<S ≤1
1<S ≤2
2<S ≤3
3<S ≤4
Fraction of loops (%)
80
70
60
50
40
30
20
10
0
mesa
art
equake ammp
vpr
mcf
crafty vortex bzip2 average
NUCA-Dist+MS performs very close (within
x% of ideal coherent
system
ORNL - March 2007
52
Evaluation Environment
 Configurations simulated: 2 to 16 processors
ORNL - March 2007
53
Evaluation Environment
 Applications: Splash-2 applications and ALPBench
benchmarks, both with reference input sets (except
MPGDEC/ENC)
ORNL - March 2007
54
Sources of largest errors (top
10%)
Source of error
Number
Error (%)
Incorrect IR workload estimation
Unknown iteration count(i<P)
Unknown inner loop iteration count
Biased conditional
4
3
2
1
54~116
54~61
98~161
136
ORNL - March 2007
55
Application Characteristics
Application
Loops
% of Seq.
Time
Spec data
Size (KB)
TREE
accel_10
94
<1
WUPWISE
muldeo_200’
muldoe_200’
41
12,000
MDG
interf_1000
86
<1
LUCAS
mers_mod_square
(line 444)
20
4,000
AP3M
Shgravll_700
78
3,000
ORNL - March 2007
56
Breaking News!!
 Intel has just announced an 80 core chip at
ISSCC
– Simple cores
– Point-to-point interconnect
– No cache coherence (?)
ORNL - March 2007
57