Transcript Parallelism

Parallelism: A Serious Goal or a Silly Mantra
(some half-thought-out ideas)
Random thoughts on Parallelism
• Why the sudden preoccupation with parallelism?
• The Silliness (or what I call Meganonsense)
–
–
–
–
–
Break the problem  Use half the energy
1000 mickey mouse cores
Hardware is sequential
Server throughput (how many pins?)
What about GPUs and Data Base?
• Current bugs to exploiting parallelism (or are they?)
– Dark silicon
– Amdahl’s Law
– The Cloud
• The answer
– The fundamental concept vis-à-vis parallelism
– What it means re: the transformation hierarchy
Random thoughts on Parallelism
• Why the sudden preoccupation with parallelism?
• The Silliness (or what I call Meganonsense)
–
–
–
–
–
Break the problem  Use half the energy
1000 mickey mouse cores
Hardware is sequential
Server throughput (how many pins?)
What about GPUs and Data Base?
• Current bugs to exploiting parallelism (or are they?)
– Dark silicon
– Amdahl’s Law
– The Cloud
• The answer
– The fundamental concept vis-à-vis parallelism
– What it means re: the transformation hierarchy
It starts with the raw material (Moore’s Law)
• The first microprocessor (Intel 4004), 1971
– 2300 transistors
– 106 KHz
• The Pentium chip, 1992
– 3.1 million transistors
– 66 MHz
• Today
– more than one billion transistors
– Frequencies in excess of 5 GHz
• Tomorrow ?
And what we have done with this raw material
Number of Transistors
Cache
Microprocessor
Tim e
Too many people do not realize:
Parallelism did not start with Multi-core
• Pipelining
• Out-of-order Execution
• Multiple operations in a single microinstruction
• VLIW (horizontal microcode exposed to the software)
Random thoughts on Parallelism
• Why the sudden preoccupation with parallelism?
• The Silliness (or what I call Meganonsense)
–
–
–
–
–
Break the problem  Use half the energy
1000 mickey mouse cores
Hardware is sequential
Server throughput (how many pins?)
What about GPUs and Data Base?
• Current bugs to exploiting parallelism (or are they?)
– Dark silicon
– Amdahl’s Law
– The Cloud
• The answer
– The fundamental concept vis-à-vis parallelism
– What it means re: the transformation hierarchy
One thousand mickey mouse cores
• Why not a million? Why not ten million?
• Let’s start with 16
– What if we could replace 4 with one more powerful core?
• …and we learned:
–
–
–
–
One more powerful core is not enough
Sometimes we need several
Morphcore was born
BUT not all morphcore (fixed function vs flexibility)
The Asymmetric Chip Multiprocessor (ACMP)
Large
core
Large
core
Large
core
Large
core
“Tile-Large” Approach
Niagara Niagara Niagara Niagara
-like
-like
-like
-like
core
core
core
core
Niagara Niagara Niagara Niagara
-like
-like
-like
-like
core
core
core
core
Large
core
Niagara Niagara
-like
-like
core
core
Niagara Niagara
-like
-like
core
core
Niagara Niagara Niagara Niagara
-like
-like
-like
-like
core
core
core
core
Niagara Niagara Niagara Niagara
-like
-like
-like
-like
core
core
core
core
Niagara Niagara Niagara Niagara
-like
-like
-like
-like
core
core
core
core
Niagara Niagara Niagara Niagara
-like
-like
-like
-like
core
core
core
core
“Niagara” Approach
ACMP Approach
Large core vs. Small Core
Large
Core
•
•
•
•
Out-of-order
Wide fetch e.g. 4-wide
Deeper pipeline
Aggressive branch
predictor (e.g. hybrid)
• Many functional units
• Trace cache
• Memory dependence
speculation
Small
Core
•
•
•
•
In-order
Narrow Fetch e.g. 2-wide
Shallow pipeline
Simple branch predictor
(e.g. Gshare)
• Few functional units
Throughput vs. Serial Performance
Speedup vs. 1 Large Core
9
Niagara
Tile-Large
ACMP
8
7
6
5
4
3
2
1
0
0
0.2
0.4
0.6
Degree of Parallelism
0.8
1
Server throughput
• The Good News: Not a software problem
– Each core runs its own problem
• The Bad News: How many pins?
– Memory bandwidth
• More Bad News: How much energy?
– Each core runs its own problem
What about GPUs and Data Base
• In theory, absolutely!
• GPUs (SMT + SIMD + Predication)
– Provided there are no conditional branches (Divergence)
– Provided memory accesses line up nicely (Coalescing)
• Data Bases
– Provided there are no critical sections
Random thoughts on Parallelism
• Why the sudden preoccupation with parallelism?
• The Silliness (or what I call Meganonsense)
–
–
–
–
–
Break the problem  Use half the energy
1000 mickey mouse cores
Hardware is sequential
Server throughput (how many pins?)
What about GPUs and Data Base?
• Current bugs to exploiting parallelism (or are they?)
– Dark silicon
– Amdahl’s Law
– The Cloud
• The answer
– The fundamental concept vis-à-vis parallelism
– What it means re: the transformation hierarchy
Dark Silicon
• Too many transistors: we can not power them all
– All those cores powered down
– All that parallelism wasted
• Not really: The Refrigerator! (aka: Accelerators)
– Fork (in parallel)
– Although not all at the same time!
Amdahl’s Law
• The serial bottleneck always limits performance
• Heterogeneous cores AND control over them
can minimize the effect
The Cloud
• It is behind the curtain, how to manage it
• Answer: the on-chip run-time system
• Answer: Pragmas beyond the Cloud
Random thoughts on Parallelism
• Why the sudden preoccupation with parallelism?
• The Silliness (or what I call Meganonsense)
–
–
–
–
–
Break the problem  Use half the energy
1000 mickey mouse cores
Hardware is sequential
Server throughput (how many pins?)
What about GPUs and Data Base?
• Current bugs to exploiting parallelism (or are they?)
– Dark silicon
– Amdahl’s Law
– The Cloud
• The answer
– The fundamental concept vis-à-vis parallelism
– What it means re: the transformation hierarchy
The fundamental concept:
Synchronization
Problem
Algorithm
Program
ISA (Instruction Set Arch)
Microarchitecture
Circuits
Electrons
At every layer we synchronize
• Algorithm: task dependencies
• ISA: sequential control flow (implicit)
• Microarchitecture: ready bits
• Circuit : clock cycle (implicit)
Who understands this?
• Should this be part of students’ parallelism education?
• Where should it come in the curriculum?
• Can students even understand these different layers?
Parallel to Sequential to Parallel
• Guri says: think sequential, execute parallel
– i.e. don’t throw away 60 years of computing experience
– The original HPS model of out-of-order execution
– Synchronization is obvious: restricted data flow
• At the higher level, parallel at larger granularity
– Pragmas in JAVA? Who would have thought!
– Dave Kuck’s CEDAR project, vintage 1985
– Synchronization is necessary: course grain data flow
Can we do more?
• The run-time system – part of the chip design
– The chip knows the chip resources
– On-chip monitoring can supply information
– The run-time system can direct the use of those resources
• The Cloud – the other extreme, and today’s be-all
– How do we harness its capability?
– What is needed from the hierarchy to make it work
My message
• Parallelism is a serious goal
IF we want to solve the most challenging problems
(Cure cancer, predict tsunamis)
• Telling people to think parallel is nice, but often silly
• Examining the transformation hierarchy
and seeing where we can leverage
seems to me a sounder approach