CISC 879 : Advanced Parallel Programming

Download Report

Transcript CISC 879 : Advanced Parallel Programming

Importance of Single-core in Multicore
Era
Toshinori Sato, Hideki Mori, Rikiya Yano, Takanori Hayashida
- Fukuoka University, Japan
Published: Thirty-fifth Australasian Computer Science Conference
Vaibhav Naidu
Dept. of Computer & Information Sciences
University of Delaware
CISC 879 : Advanced Parallel Programming
Outline
•
Introduction
•
Motivation
•
Searching for best Multicore
•
Single-core Performance improvement
•
Results
Conclusion
•
CISC 879 : Advanced Parallel Programming
Introduction
•
Pollack’s rule
Processor performance is proportional to the
square root of the area of the processor.
•
Amhadl’s law:
The speedup using multiple processors in
parallel computing is limited by time needed for the
sequential fraction of the program.
CISC 879 : Advanced Parallel Programming
Motivation
Increasing number of transistors for increasing the
number of cores on a chip might not be the best
choice.
What would be the best configuration of a
multicore processor?
How do we improve the performance of the
single-core?
CISC 879 : Advanced Parallel Programming
Searching for the best
Multicore
•
•
As number of transistors increase on a chip, the
flexibility to determine a processor configuration
also increases.
With this flexibility, we don’t know which the best
configuration is; how many cores should it have;
etc.
CISC 879 : Advanced Parallel Programming
Searching for the best
Multicore
Processor Topologies:
CISC 879 : Advanced Parallel Programming
Searching for the best
Multicore
Processor Topologies:
1. Single-core:
For a better performance, in the future, one option is to increase
the size of the core. All transistors on the chip are utilized by a
single core.
2. Many-core:
The core microarchitecture is fixed and multiple copies of the
core are integrated on the chip.
CISC 879 : Advanced Parallel Programming
Searching for the best
Multicore
3. Heterogeneous Multicore:
Only one core becomes large and other cores remain small.
4. Scalable Multicore:
A collection of small cores that can logically fuse together to
compose a high-performance large core.
5. Dynamically Configurable:
The processor cores can combine together to form a larger
core.
CISC 879 : Advanced Parallel Programming
Single-core vs Many-core
•
Single-core:
As the core becomes larger, area-performance
ratio meets a diminishing return (Pollack’s rule)
•
Many-core:
If the amount of parallelizable code is less, the
speedup might not be as much (Amhadl’s law)
CISC 879 : Advanced Parallel Programming
Single-core vs Many-core
X-axis: Times the area of a baseline processor
Y-axis: Performance improvement rate
CISC 879 : Advanced Parallel Programming
Single-core vs
Heterogeneous Multicore
•
Heterogeneous Multicore:
They are widely studied for improving energy
efficiency.
Parallelized portions are executed by multiple
small cores and hard-to-parallelize portions are
executed by a big strong core.
Interestingly, the performance is equivalent
regardless of the big core’s size
CISC 879 : Advanced Parallel Programming
Single-core vs
Heterogeneous Multicore
X-axis: Times the area of a baseline processor
Y-axis: Performance improvement rate
CISC 879 : Advanced Parallel Programming
Heterogeneous Multicore
vs Scalable Homogeneous
•
Scalable Homogeneous:
They have smaller number of larger cores.
Sometimes using 3 large cores is desirable
when compared to using 6 small cores.
CISC 879 : Advanced Parallel Programming
Heterogeneous Multicore
vs Scalable Homogeneous
X-axis: Times the area of a baseline processor
Y-axis: Performance improvement rate
CISC 879 : Advanced Parallel Programming
Heterogeneous vs
Dynamically Configurable
•
Dynamically Configurable:
They dynamically configure each core and size
of each core.
CISC 879 : Advanced Parallel Programming
Heterogeneous vs
Dynamically Configurable
X-axis: Times the area of a baseline processor
Y-axis: Performance improvement rate
CISC 879 : Advanced Parallel Programming
Heterogeneous vs
Dynamically Configurable
•
Dynamic reconfiguration suffers approx. 25%
penalty. (0.8 DC-n & 0.8 DC-8)
•
As the number of cores increases, it becomes
difficult to combine all cores due to the increasing
complexity of interconnects.
•
Red dashed line represents the current technology.
•
The 0.8 DC 8 is the most practical Dynamically
configurable processor and it’s performance is not
as good as Heterogeneous.
CISC 879 : Advanced Parallel Programming
Single-Core performance
improvement
•
Increasing clock frequency has been the easiest
way to improve performance.
•
But it increases the power supply voltage, resulting
in serious power and temperature problems.
•
A technique to increase the clock frequency without
increasing supply voltage.
CISC 879 : Advanced Parallel Programming
Cool Turbo Boost
•
Intel’s Turbo Boost Technology increases the
supply voltage and thus clock frequency.
•
Cool Turbo Boost Technology, will not require the
increase in supply voltage.
•
When the hardware size and complexity become
small, there is an opportunity to increase its clock
frequency. (Intel ATOM)
CISC 879 : Advanced Parallel Programming
Cool Turbo Boost
•
Datapath:
A collection of functional units, as arithmetic
logic units or multipliers, that perform data processing
operations, registers, and buses.
•
When datapath becomes small, its computing
performance is degraded.
•
If the performance loss is not compensated by the
clock frequency boost, then the processor
performance is diminished.
CISC 879 : Advanced Parallel Programming
Cool Turbo Boost
•
Instruction level parallelism (ILP):
Number of operations in a computer program
that can be performed simultaneously.
•
When ILP is small, small datapath is enough;
otherwise, the datapath should not be reduced.
•
Hence, the datapath is dynamically configured
according to ILP in each program phase.
CISC 879 : Advanced Parallel Programming
Cool Turbo Boost
•
Multiple Clustered-Core Processor (MCCP):
Configures its datapath according to ILP and
thread level parallelism (TLP) in the program.
•
The authors configure MCCP so that its clock
frequency is increased when it configures its
datapath small.
CISC 879 : Advanced Parallel Programming
Results
•
Six programs from SPECint2000 are used and
executed for 2 billion instructions are executed
Narrow Datapath Results
Cool Turbo Boosting Results
X-axis: Boosting ratio
Y-axis: Normalized Single-core performance
CISC 879 : Advanced Parallel Programming
Results
•
Average performance loss of Narrow datapath is
36.1% and of Cool turbo boost is only 4.2%
•
When boosting rate reaches 1.4 and 1.6, the
performance is improved by 5.0% and 8.7% on
average respectively.
•
For parser (which includes gzip, vpr and parser) the
performance of cool turbo boost is not good.
Whereas for Vortex (includes gcc and vortex) the
performance is better regardless of boosting ratio.
CISC 879 : Advanced Parallel Programming
Conclusion
• Paper investigates the best multicore configuration
for the near future, winner is Heterogeneous
Multicore.
• It unveiled that the single-core performance is the
key for improving the performance of the
heterogeneous multicore in the near future.
• The average performance improvement using the
Cool Turbo Boost Technology is only 5%. Hence,
future studies are to be made in this area.
CISC 879 : Advanced Parallel Programming
Questions?
CISC 879 : Advanced Parallel Programming
Thank you
CISC 879 : Advanced Parallel Programming