Transcript Document

In memory of Stamatis
The Paradigm Shift to Multi-Cores:
Opportunities and Challenges
Per Stenstrom
Department of Computer Science & Engineering
Chalmers University of Technology
Sweden
An Unwanted Paradigm Shift
30% annual
performance growth
60% annual
performance growth
• Clock frequency couldn’t be pushed higher
• Traditional parallelism exploitation didn’t pay off
The Easy Way Out: Replicate
Year
2006
2009
2012 2015
# Cores
4
16
64
256
64
256
1024
# Threads 16
•
Moore’s Law: 2X cores every 18 months
– Implication: About a hundred cores in five
years
• BUT: Software can only make use of one!
Main Challenges
– Programmability
– Scalability
We want to seamlessly scale up application
performance within power envelope
Vision: Multiple Cores = One Processor
Application SW
(existing and new)
P
P
P
M
P
P
System software
infrastructure
P
Multi-core
Requires a concerted action across layers:
programming model, compiler, architecture
How can Architects Help?
On-chip
cache
management
Cache hierarchy
P
P
P
Support
for
enhancing
programmability
• What is the best use of the many transistors?
”Inherent” Speculative Parallelism
[Islam et al. ICPP 2007]
Speedup
Representative of what is possible today
GMean 2
djpeg
cjpeg
fft00
conven00
GMean 1
rgbyiq01
rgbhpg01
rgbcmy01
viterb00
fbital00
autocor00
24
22
20
18
16
14
12
10
8
6
4
2
0
Scaling beyond eight cores will need manual efforts
Three Hard Steps
1. Decomposition
2. Assignment
3. Orchestration
Goal: expose concurrency
but beware of thread
mngmt overhead
Goal: balance load &
reduce communication
Goal: Orchestrate threads to
reduce communication
and synchronization costs
Transactional Memory
Transactional memory provides a safety net for
data races: hence, simplify coordination
T1
T2
LD A
SQUASH
ST A
LD A
RE-EXECUTE
• Research is warranted into
high-productivity
programming interfaces
• Transactional memory is a
good starting point
Transistors can Help Programmers
Recall the ”hard steps”:
• Decomposition
• Assignment
• Orchestration
Opportunities abound
Low-overhead spawning
mechanisms
Load balancing
supported in HW
Communication balancing
supported in HW
Processor/Memory Gap
P-M speed gap
How to bridge it?
Memory
Memory
Memory
Cache hierarchy
Cache hierarchy
Cache hierarchy
P
P
P
P
P
P
P
P
P
Adaptive Shared Caches
[Dybdahl & Stenstrom HPCA 2007]
P1
P2
P1
P2
P1
P2
L1
L1
L1
L1
L1
L1
L2
L2
L2
L2
L2
Shared
Private
Conflicts
--+++
Speed
--+++
Utilization +++
---
Adaptive Hybrid
+++
+++
+++
Scaling-Up Off-chip Bandwidth
Memory
Memory
Memory
Off-chip
bandwidth
bottleneck
...
...
...
Cache hierarchy
Cache hierarchy
Cache hierarchy
P
P
P
P
P
P
P
P
P
BW does not scale with Moore’s law unless optics or
other disruptive technologies change the rules
Memory/Cache Link Compression
[Thuresson & Stenstrom, IEEE TC to appear]
25
20
15
swc-data + oh
10
swc+ PVC
5
0
gzip
vpr
gcc
perl
AVER
Our combined scheme yields 3X in bandwidth reduction
Summary
• Multi-cores promise scalable performance under
a manageable power envelope, but are hard to
program
• To provide scalable application performance for
the future requires research at all levels
– Architecture (processor, cache, interconnect)
– Compiler
– Programming model
These topics are dealt with in the FET SARC IP
and in the HiPEAC network of excellence