Transcript rd-club
Design and Management of 3D CMP’s
using
Network-in-Memory
Feihui Li et.al.
Penn State University
(ISCA – 2006)
News..
Moral of the story…
• 3D technology helps in reducing wire
delays
– Exploit it in as many ways as you can!
– They chose L2 caches
• Also, 3D leads to on-chip hotspots.
– Arrange units intelligently, reduce localized
hotspots.
Major Results/Contributions
• First 3D CMP design space exploration
• Proposal of 3D NUCA L2 caches for CMP’s.
– Comparison with the existing 2D counterparts.
– 3D works better even without data migration
• Proposal of NoC’s as a method of
communication between L2 banks.
– “Efficiently exploit fast vertical interconnects”
Basics…
Typical Network-on-Chip architecture
Major types of integration
Proposed : 3D Network-in-Mem
L2 Cache bank / or CPU
Pillar node
Processing
Element
(Cache Bank
or CPU)
NIC
b bits
R
Single-Stage
Router
NoC
dTDMA Bus
NoC/Bus Interface
Communication
b-bit dTDMA Bus
(Communication Pillar)
Pillar orthogonal to slide
dTDMA Bus (Dynamic Time-Division Multiple Access)
The dTDMA Bus as the
Communication Pillar
Do not use multi-hop for vertical communication
x vertical distance is so small
layers
Use dTDMA bus (VLSID 2006)
V efficient/fast bus
V small area/power overhead
10~100 um
1500 um
Router
dTDMA Bus
Arbiter
Proposals (1)
• Inter-die “communication
pillars”
• Integration of dTDMA buses
and NoC routers for a fast
communication interface –
typical NoC fails due to
• increased complexity
• contention issues
• increased power/area
overhead
• multi-hop vertical comm.
3D Benefit: Increased Locality
Nodes within 1 hop
CPU
Nodes within 2 hops Nodes within 3 hops
2D vicinity
3D vicinity
dTDMA pillar
Proposals (2)
• Cannot increase # of pillars arbitrarily
– Depends on via density
– Router complexity
• So, CPU’s share pillars
– Stacking of CPU’s also has to be considered
• CPU placement algorithm
– Stack CPU’s across dies so as to
• Maintain decent access hop-count
• Manage thermal profile
CPU placement example
This way, not stacking CPU’s on top of one another,
helps to solve localized hotspot problem
3D L2 Caches
• Clusters – Cache banks + tag array
– Some clusters have CPU’s, others don’t.
Cache Management
• Search
• Placement & Replacement
• Cache Line Migration
L2 Cache Management
Simulation Environment
• Simics + in-house NoC simulator
• All CPU’s issue in-order
– 8 CPU’s, SPARC ISA
– Directory based protocol for coherence between
L1’s and the L2
• HS3d for temperature modeling
• 64MB and 32 MB L2 caches
Performance
CMP-DNUCA
CMP-DNUCA-3D
CMP-SNUCA-3D
3.5
3
IPC
2.5
2
1.5
1
0.5
0
ammp
apsi
art
equake
fma3d
galgel
mgrid
swim
wupwise
Important Results
Important Results (2)
Impact of # of “pillars” on access latency
Important Results (3)
Final Word
• 3D is feasible & scalable… and has arrived.
• Localized hotspots can be solved by placing
hotter units apart.
• Power savings + performance gain even
without data migration
– No numbers to support the claim(!)
– Would that help the temperature issue as well?
Potential HPCA Submission
• An evaluation of temperature and IPC for a
single core 3D processor
• Leverage clustered architectures for
“temperature aware” processor designs.
– Basic premise : Stacking cooler units (caches) on
top of hotter units
• Better thermal profile of processor
Proposals
Cache
bank
Cache
bank
Cluster
Arch 1
Arch 2
Arch 3
Proposals (2)
• Cache banks (both data and instruction) are
– 2 way word-interleaved, or,
– Replicated
• Present study done for 8-cluster architecture
Results (Performance)
2-way word interleaved caches
Results (Performance)
Replicated caches
Traffic Analysis
TOTALD2DHOPCOUNT
INTERCLUSTER
RINGHOP FOR CACHE
25000000
20000000
15000000
10000000
5000000
Benchmarks - Arch1
wupwise
vpr
vortex
twolf
swim
parser
mgrid
mesa
mcf
lucas
gzip
gcc
gap
galgel
fma3d
equake
eon
crafty
bzip2
art
apsi
applu
0
ammp
Number of Accesses
RINGHOPCOUNT
Traffic Analysis (2)
RINGHOPCOUNT
TOTALD2DHOPCOUNT
INTERCLUSTER
RINGHOP FOR CACHE
20000000
15000000
10000000
5000000
Benchmarks -Arch2
wupwise
vpr
vortex
twolf
swim
parser
mgrid
mesa
mcf
lucas
gzip
gcc
gap
galgel
fma3d
equake
eon
crafty
bzip2
art
apsi
applu
0
ammp
Number of Accesses
25000000
aft
y
a
v
wu pr
pw
ise
pa
gr
id
rse
r
sw
im
tw
ol
vo f
rte
x
m
cf
es
m
as
BASE ARCH 1
m
c
ip
gz
gc
luc
eq
n
ua
ke
fm
a3
d
ga
lge
l
ga
p
eo
cr
t
ip2
bz
si
ar
ap
mp
ap
pl u
am
Peak Temp of Hottest Unit (C)
Results (Thermal)
ARCH 2
400
350
300
250
200
150
100
50
0