slides - Pages

Download Report

Transcript slides - Pages

Increased Scalability and Power Efficiency through
Multiple Speed Pipelines
Emil Talpes, Diana Marculescu
Electrical and Computer Engineering Department
Carnegie Mellon University
Flywheel
 This microarchitecture relies on multiple parts of the processor running at
different clock frequencies
 Essentially, we assume a fast execution core that is fed by a slow pipeline
 This slow section can be decoupled, leaving the core to run as fast as possible
 More efficient than the baseline in terms of energy per computation
International Symposium on Computer Architecture, May 2005
2
Outline
 Motivation
 Flywheel
 Microarchitectural design
 Experimental results
 Conclusions
International Symposium on Computer Architecture, May 2005
3
Moore’s Law
"Moore's Law has been the name given to everything that changes exponentially
in the industry. I saw, if Gore invented the Internet, I invented the exponential."
– Gordon Moore
The number of transistors that can be placed on a chip will double every 2 years
 Transistors get smaller, faster, cheaper
 Allows faster clock speeds, increased parallelism (temporal and spatial)
Design problems
 Larger structures – slower access
 Wider pipelines – increased complexity, harder to keep them fed with instructions
Future problems
 Interconnect dominated structures – wires do not scale as transistors do
 Power consumption – limited capabilities for cooling and power delivery
International Symposium on Computer Architecture, May 2005
4
Optimal clock frequency
Is it worth increasing the clock speed further?
(or is it a lost cause?)
Sprangle et al. (ISCA 2002)
Use a Pentium 4-like microarchitecture (3-way, x86 machine)
Estimate that performance can still be improved up to about 50 pipeline
stages (running at about twice the clock speed of Pentium 4)
Hartstein et al. (ISCA 2002)
Use a 4-way machine running S/390 ISA
Estimate that the optimal pipeline length is around 25 stages, still longer than
the currently used pipelines
International Symposium on Computer Architecture, May 2005
5
Getting there
When trying to increase the clock frequency further
Transistors become faster than wires – Horowitz (Proc. of IEEE 2001)
Interconnect dominated structures don’t scale well with smaller transistors
Assuming that transistors will switch twice as fast in a next generation
process technology, designs will not work at twice its current clock frequency
Stages scale differently with faster transistors, but also with wider
pipelines – Palacharla et al. (ISCA 1997), Agarwal et al. (ISCA 2000)
International Symposium on Computer Architecture, May 2005
6
Getting there
Issue Window:
-
One of the structures limited
mostly by the wire speed
-
Wake-Up logic is mostly
limited by the speed of
propagating signals on the
tag lines
-
Selection logic is mostly
limited by transistor speed
International Symposium on Computer Architecture, May 2005
7
Outline
 Motivation
 Flywheel
 Microarchitectural design
 Experimental results
 Conclusions
International Symposium on Computer Architecture, May 2005
8
Flywheel
Why not allow different sections of the pipeline to work at their own
maximum clock frequency?
 Imagine an engine where moving parts spin at different speeds
 Dual Clock Issue Window – decouple the front-end from the slow scheduling logic
by using a mixed clock interface
 Pre-Scheduled Execution – decouple the back-end using an Execution Cache
(turn off the Issue Window)
International Symposium on Computer Architecture, May 2005
9
Pre-Scheduled Execution
 Decouple the back-end stages by caching traces after the scheduling
logic, already scheduled for execution
 Speed-up the execution whenever instructions can be retrieved from the
Execution Cache
 Shut down the front-end, including the slow Issue Window
International Symposium on Computer Architecture, May 2005
10
Benefits
Performance
 Faster front-end – smaller branch mispredict penalty, more parallelism
 Faster execution core – traces are created slower, but can be re-executed
faster
Power consumption
 Most of the time the entire front-end can be shut down
International Symposium on Computer Architecture, May 2005
11
Outline
 Motivation
 Flywheel
 Microarchitectural design
• Dual Clock Issue Window
• Execution Cache
• Register File and Register Renaming
 Experimental results
 Conclusions
International Symposium on Computer Architecture, May 2005
12
Dual Clock Issue Window
 Decouples the front-end, allowing it to use a different clock
signal
 Instructions coming from the faster front-end need to be
synchronized with the slower back-end clock
 Advantages:

Limits the mispredict penalty

Improves the bandwidth, filling faster a large Issue Window
International Symposium on Computer Architecture, May 2005
13
Execution Cache
Caching instructions out-of-order is the central idea of the Flywheel
microarchitecture
 Instructions cannot be retrieved based on their memory address – EC must be
trace-based
 Maintain significant instruction reordering capabilities – traces must be long
 Due to branch misprediction – fixing a standard trace length is not economical
 Must be efficient in terms of bandwidth and power consumption
International Symposium on Computer Architecture, May 2005
14
Execution Cache

Stores instructions in issue
order

Execution Cache = Tags Array
+ Data Array

Both of them – set associative
arrays

Tag Array


Stores information about existing traces in the cache

Accessed only when a new trace is started
Data Array

Instructions organized in groups that can be sent directly to execution

Accessed each cycle – use sub-banking to reduce power
International Symposium on Computer Architecture, May 2005
15
Execution Cache Microarchitecture
 Block
Fixed maximum length
 contains an arbitrary number of
Issue Units

 Issue Unit – group of independent
instructions that can be issued in
parallel
 Write

Each Issue Unit is captured at trace creation and inserted in a Fill Buffer

When enough instructions, a Block is written to the data array
 Read

One Block is read each time

The next Block location is known so read can start early to hide the latency
International Symposium on Computer Architecture, May 2005
16
Register Renaming
 Shut down the front-end – must reuse all the work performed there
We cannot rename registers for out-of-order instructions
We need to use the renaming done at trace creation
 A special register renaming mechanism
Must perform renaming inside each trace (always starting from a predefined
initial state)
Must be able to restore the initial state when a trace ends
 Conceptual algorithm similar to the “rotational remap” renaming scheme - Nair
et al. (ISCA 1997)
International Symposium on Computer Architecture, May 2005
17
Associative Register File


An architected register = Pool of N physical registers
Physical organization of the pool - as a circular queue

For each write – use the
next register in the pool as
destination and increment
IDX

For each read – use the
last allocated register (IDX)
International Symposium on Computer Architecture, May 2005
18
Adaptive Register File



Each conceptual renaming pool occupies a contiguous
area in the Register File
Each pool is defined by a Base address and a length
Resize pools to eliminate capacity stalls
International Symposium on Computer Architecture, May 2005
19
Two-levels Register Renaming



Requires two renaming levels


First stage uses the same conceptual algorithm
Second stage remaps each pool entry onto a contiguous Register File
structure
Introduces an extra-pipeline stage
Does not need associative Register File accesses
International Symposium on Computer Architecture, May 2005
20
Outline
 Motivation
 Flywheel
 Microarchitectural design
 Experimental results
 Conclusions
International Symposium on Computer Architecture, May 2005
21
Alternative execution path usage
Alternative path usage
2 cycles
110
3 cycles
4 cycles
5 cycles
100
[%]
90
80
70
60
50
40
ijpeg
gcc
gzip
vpr
mesa
equake
parser
vortex
bzip2
turb3d average
 Alternative path usage is very high in some cases
 On average, the processor spends 90% of the time fetching from EC
 During these intervals
Increase clock speed in the execution core
 Turn off the front-end to save power

 Back-end frequency contribute most to the overall performance
International Symposium on Computer Architecture, May 2005
22
Execution parallelism
Relative perform ance
1.3
2 cycles
3 cycles
4 cycles
5 cycles
Normalized performance
1.2
1.1
1
0.9
0.8
0.7
0.6
ijpeg
gcc
gzip
vpr
mesa
equake parser
vortex
bzip2
turb3d average
 Execution parallelism remains about the same
 At the same clock frequency, overall performance is comparable to the
baseline
 A pipelined EC access decreases performance (about 3% per each
additional stage)
International Symposium on Computer Architecture, May 2005
23
Relative performance
Relative Perform ance
FE0%, BE50%
Normalized performance
2.2
FE25%, BE50%
FE50%, BE50%
FE75%, BE50%
FE100%, BE50%
2
1.8
1.6
1.4
1.2
1
ijpeg
gcc
gzip
vpr
mesa equake parser vortex
bzip2
turb3d average
 Results assume a 50% faster execution core (BE 50%) and a up to 100% faster
front-end (FE 0% to FE 100%)
 Scalability is slightly super-linear – (FE 50%, BE 50%) runs about 54% faster
than the baseline
 Front-end frequency has a much smaller but noticeable effect, about 15%
performance increase from FE 0%, BE 50% to FE 100% to BE 50%
International Symposium on Computer Architecture, May 2005
24
Energy
Relative Energy
FE0%, BE50%
Normalized energy
0.9
FE25%, BE50%
FE50%, BE50%
FE75%, BE50%
FE100%, BE50%
0.8
0.7
0.6
0.5
ijpeg
gcc
gzip
vpr
mesa
equake
parser
vortex
bzip2
turb3d
average
 About 28% average energy savings, due to the low utilization of the front-end
International Symposium on Computer Architecture, May 2005
25
Power consumption
Relative Pow er
FE0%, BE50%
Normalized power
1.4
FE25%, BE50%
FE50%, BE50%
FE75%, BE50%
FE100%, BE50%
1.3
1.2
1.1
1
0.9
0.8
0.7
ijpeg
gcc
gzip
vpr
mesa
equake
parser
vortex
bzip2
turb3d average
 Even though the overall energy is reduced, it must be spent over a smaller
execution window
 In most cases, power increases slightly (from 2% to 15%)
 Power increase is much smaller than the performance increase (45% to 62%)
 Inside the same power envelope, this microarchitecture is significantly faster
than the baseline
International Symposium on Computer Architecture, May 2005
26
Scalability
Relative Energy
Normalized Energy
0.9
130nm
90nm
60nm
equake
parser
0.8
0.7
0.6
0.5
ijpeg
gcc
gzip
vpr
mesa
vortex
bzip2
turb3d average
 Leakage energy is not affected when turning off the front-end
 Energy efficiency will decrease as leakage increases
International Symposium on Computer Architecture, May 2005
27
Outline
 Motivation
 Flywheel
 Microarchitectural design
 Experimental results
 Conclusions
International Symposium on Computer Architecture, May 2005
28
Conclusion
 Better scalability, even though the Issue Window remains untouched
 Significant improvement can be obtained by speeding up the front-end of the
pipeline, but…
 Most of the improvement will come from speeding up the execution core (PreScheduled Execution)
 The resulting microarchitecture obtains better performance than the
corresponding baseline for the same power consumption
 Allows for higher clock speeds at the cost of a smaller power increase than the
corresponding fully synchronous superscalar, out of order processor
International Symposium on Computer Architecture, May 2005
29
Previous work
Execution Cache
 E. Talpes and D. Marculescu, “Power Reduction Through Work Reuse”, in Proceedings of the ISLPED, 2001
 E. Talpes and D. Marculescu, “Impact of Technology Scaling on Energy Aware Execution Cache-based
Microarchitectures”, in Proceedings of the ISLPED, 2004
 E. Talpes and D. Marculescu, “Reusing Scheduled Instructions to Improve the Power Efficiency of a
Superscalar Processor”, in IEEE Transactions on VLSI Systems, January 2005
Dual Clock Issue Windows and GALS microarchitectures
 E. Talpes and D. Marculescu , “Reducing the Asynchronous Communication Effect on GALS
Microarchitectures”, in Proceedings of the ISLPED, 2003
 E. Talpes and D. Marculescu, “Towards a Multiple Clock / Voltage Island Design Style for Power Aware
Processors”, in IEEE Transactions on VLSI Systems
 E. Talpes, V. S. Rapaka, and D. Marculescu, “Mixed-Clock Issue Queue Design for Energy Aware High
Performance Cores”, in Proceedings of the ASP-DAC, 2004
Flywheel Microarchitecture
 E. Talpes, “Flywheel: Increased Scalability and Power Efficiency through Multiple Speed Pipelines”, PhD
dissertation, Carnegie Mellon University, December 2004
International Symposium on Computer Architecture, May 2005
30