Low power Design Strategies - High Performance Computing Group

Download Report

Transcript Low power Design Strategies - High Performance Computing Group

Low power Design Strategies
Daniele Folegnani
Talk outline
•
•
•
•
•
•
•
Why Low Power is Important
Power Consumption in CMOS Circuits
New Trends for Future Microprocessors
Low Power Strategies
Power Consumption Evaluation of a Superscalar Processor
An Architectural Technique to Reduce the Power Consumption of the Issue Logic
Conclusions
Why Low Power is Important
High performance microprocessors
PowerPC704 consumes 85 Watt
Alpha 21364 consumes 100 Watt
Problems involved: thermal runaway, gate dielectric, junction
fatigue, electromigration diffusion, electrical parameters shift,
silicon interconnections fatigue, package related failure.
THE FUNCTIONALITY AND THE CLOCK SPEED CAN BE
LIMITED
Thermal and Power dissipation costs
60
Total dissipation cost
50
40
CPU
30
1$/1W
20
10
0
0
10
20
30
Watt
40
50
60
Low performance processors
• High demand of portable devices ( mobile phones, laptops, smart cards,
videogames, etc ) >>> 95% of production !!!
• Extensive use of multimedia features
Problems involved: >>> Battery life !!!
Energy battery will not grow drastically in the near future due to technology and safety
reasons ( today´s batteries has the same energy of a grenade !!! )
• One of the market point is: hours of use and hours of standby
• Need of techniques to improve energy efficiency without penalizing performance
Power Consumption in CMOS Circuits
• Static
•
Theoretically 0, in practice leakage and threshold currents exist in transistors
• Dynamic
•
•
Transients ( the linear zone )
Capacitance switching THE MOST IMPORTANT FACTOR
1
P  CV 2 f
2
New Trends for Future Microprocessors
2,5
2
1,5
Power
Perf
1
0,5
0
0
0,5
1
1,5
2
2,5
microarch complexity
3
3,5
4
4,5
Moore´s Law
doubling transistors every 18 months
Power is proportional to
DIE AREA and FREQUENCY
• In the same technology a new architecture has 2-3X in Die Area
• Changing technology implies 2X frequency
SCALING TECHNOLOGY ...
• Decreasing voltage
( 0.7 scaling factor )
• Decreasing of die area
( 0.5 scaling factor )
• Increasing C per unit area 43% !!!
This implies that the power density increase of 40% every
generation !!!
Temperature is a function of power density and determinates the type of
cooling system needed.
VARIABLES
• PEAK POWER ( worst case )
Today´s packages can sustain a power dissipation over 100W for up to
100msec
>>> cheaper package if peaks are reduced
• ENERGY SPENT ( for a workload )
More correlated to battery life
Low Power Strategies
• OS level :
PARTITIONING, POWER DOWN
• Software level :
REGULARITY, LOCALITY, CONCURRENCY
( Compiler technology for low power, instruction scheduling )
• Architecture level : PIPELINING, REDUNDANCY, DATA ENCODING
( ISA, architectural design, memory hierarchy, HW extensions, etc )
• Circuit/logic level : LOGIC STYLES, TRANSISTOR SIZING, ENERGY RECOVERY
( Logic families, conditional clocking, adiabatic circuits,
asynchronous design )
• Technology level : Threshold reduction, multi-threshold devices, etc
Power Consumption Estimation
30
25
Error estimation
20
15
power consumption
10
5
0
Arch
RTL
Circuit
Levels of abstraction
Layout
Due to the relative high error rate in the architectural
estimation ( no vision of the total area, circuit types,
technology, block activity, etc )
IMPORTANT DESIGN DECISIONS MUST BE DONE AT
ARCHITECTURAL LEVEL
• Accurate power evaluation is done at late design phases
• Needs of good feedback between all the design phases
- Correlation between power estimation from low level to high level
TRY TO IMPROVE ACCURACY AT HIGH LEVEL
- Critical path based power consumption analysis
( CIRCUIT TYPES, TECHNOLOGY, ACTIVITY FACTOR )
- Thermal images based correlation analysis
( HOTTEST SPOTS LOCATION, COOLEST SPOTS LOCATION,
TEMPERATURE DIFFERENCES, TEMPERATURE DISTRIBUTION )
Architectural Power Evaluation
[ G.Cai, Intel ]
• Architectural design partition
• Power consumption evaluation at block level
- Power density of blocks ( SPICE simulation, statistical input set,
technology and circuit types definition )
- Activity of blocks and sub-blocks ( running benchmarks )
- Area ( feedback from VLSI design, circuits and technology defined )
• TRY DO DEFINE SCALING FACTORS THAT ALLOW TO REMAP THE
ARCHITECTURAL POWER SIMULATOR WHEN TECHNOLOGY,
AREA AND CIRCUIT TYPES CHANGE
• TRY TO REDUCE THE ERROR ESTIMATION AT HIGH LEVEL
POW OUT ORDER
•
•
•
•
•
Technology assumed: CMOS 0.18 micron
5 types of circuit logic ( static, dynamic, SRAM, clock distribution, PLA )
32 architectural blocks and area associated
blocks built with custom design
two types of power density ( active and inactive power density )
Pj i   k ak i   Ak  APDk  k 1  ak i   Ak  IPDk
E j  i Pj i 
Power Consumption Evaluation of a Superscalar
Processor
Architectural parameters:
•
•
•
•
•
•
4 instr. fetch, issue and commit
128 entries instruction queue size
I-Cache 128Kbytes, direct mapped, 32 byte line, 1 cycle hit, 3 cycle miss
D-Cache 128Kbytes, 4 way set ass, 32 byte line, 1 cycle hit, 3 cycle miss
UL2-Cache,1024Kbytes, 4 way set ass, 64 byte line, 3 cycle hit
Combined predictor of 1K entries with Gshare with 1K 2-bit counters, 8 bit
global history and bimodal pred. of 2K entries with 2-bit counters
• 4 intALU, 4fpALU, 1int mul/div, 1 fp mul/div
• Out of order issue, oldest ready first selection policy
inst dec
BTB
TLB
IL1
DL1
UL2
rename table
instr queue
ROB
int FU
fp FU
I/O logic
Other
Total
applu
swim
tomcatv
wave
su2cor
hydro4
Avg (%)
340,952
336,48
351,133
341,916
344,195
349,875
2,751
143,346
119,679
195,688
149,563
156,424
187,337
1,269
63,492
53,009
86,676
66,246
69,284
82,977
0,562
677,202
565,397
924,48
706,574
738,986
855,03
5,955
621,026
518,495
847,79
647,961
677,684
811,613
5,497
1353,916
1130,387
1848,292
1412,638
1477,439
1769,421
11,986
1627,983
1672,283
1725,565
1668,854
1724,716
1738,017
13,539
3124,82
3136,195
3282,977
3160,858
3170,821
3269,56
25,201
3429,455
3445,777
3394,513
3489,683
3221,504
3348,813
27,099
111,612
109,172
110,205
112,288
103,285
108,362
0,873
147,722
144,934
145,859
148,617
136,701
143,42
1,156
244,201
203,884
333,371
254,793
266,481
319,145
2,161
189,273
180,103
214,89
192,667
200,816
242,276
1,951
12075,833
11615,795
13461,439
12352,658
12288,336
13225,846
100
perl
inst dec
BTB
TLB
IL1
DL1
UL2
rename table
instr queue
ROB
int FU
fp FU
I/O logic
Other
Total
345,874
164,525
72,873
777,26
712,783
1490,043
1812,117
3351,535
3247,036
105,166
139,19
280,028
267,57
12766,783
li
334,607
114,926
50,904
542,941
497,902
1040,843
1879,014
3420,231
3227,355
100,344
132,808
195,786
227,973
11360,385
m88ksim
vortex
compress
gcc
Avg (%)
355,221
346,63
333,885
349,317
2,761
107,751
169,376
108,817
109,063
1,045
47,726
75,021
48,198
84,184
0,511
509,046
800,174
514,082
897,905
5,456
466,819
733,796
471,437
823,42
5,004
1017,726
1412,638
1477,439
1795,162
11,117
1742,077
1999,771
1027,793
1773,833
13,819
3214,888
3645,328
3106,252
2906,202
26,524
3315,514
3558,173
3225,669
3499,98
27,104
104,111
114,311
103,075
109,19
0,859
137,794
151,294
136,423
144,517
1,136
183,564
288,545
185,38
323,788
1,967
178,148
364,64
403,729
549,925
2,697
11360,385
13659,697
11142,179
13366,486
100
An Architectural Technique to Reduce the Power
Consumption of the Issue Logic
• IQ + ROB responsible of about 53% of power consumption
• Cache hierarchy is not the most important power consumption factor in
superscalar paradigm
• Power consumption is almost independent to the instruction mix
TRENDS IN SUPERSCALAR
•
•
•
Increasing issue width
Increasing size of instruction window is more than linear respect IW
Area of IQ grows more than linear respect the number of entries
IQ power contribution may grow in the future
Every cycle the wakeup logic broadcast the result tags through the result buses
to all the entries and each entry compares them with their to find a match
THE ISSUE ENGINE SPEND EVERY CYCLE A LARGE AMOUNT OF
POWER ONYL FOR CHECKING IF SOME INSTRUCTIONS ARE
AVAILABLE FOR EXECUTION
Considering
• Periods of execution with high parallelism, just a subpart of the IQ may
satisfy the IW
• Periods of execution with poor parallelism, some parts of the IQ may not
provide any useful instruction ready to execute
The issue engine is very power inefficient
Issue in the window
100.000
90.000
80.000
Percentage
70.000
60.000
1 part
2 part
50.000
3 part
40.000
4 part
30.000
20.000
10.000
0
APPLU
HYDRO
SU2COR
SWIM
SpecFP
TOMCATV
WAVE5
Issue in the window
100.000
90.000
80.000
Percentage
70.000
60.000
1 part
2 part
50.000
3 part
40.000
4 part
30.000
20.000
10.000
0
COMPRESS
GCC
LI
M88KSIM
SpecINT
PERL
VORTEX
Commit in the window
100.000
90.000
80.000
Percentage
70.000
60.000
1 part
2 part
50.000
3 part
40.000
4 part
30.000
20.000
10.000
0
APPLU
HYDRO
SU2COR
SpecFP
SWIM
TOMCATV
WAVE5
Commit in the window
100.000
90.000
80.000
Percentage
70.000
60.000
1 part
2 part
50.000
3 part
40.000
4 part
30.000
20.000
10.000
0
COMPRESS
GCC
LI
M88KSIM
SpecINT
PERL
VORTEX
Dynamically Resizing the Instruction Queue
•
•
•
We propose a run-time mechanism that adapt the size of IQ based on its contribution
on IPC
We avoid the wake-up function in the parts that are temporally disabled
Resize decision are commit based
IQ implemented as a circular FIFO with head and tail pointers, no collapsing
What we do is ...
Partition the queue in 16 parts of 8 entries
Define a new pointer for the queue, called the limit pointer
•
•
•
•
At start time has the same value of the head pointer and is update as the head pointer
When a resize decision is done an offset ( one portion ) is added/subtracted from it
The zone between the head and the limit pointer is the disabled zone ( no wake-up )
If the tail grows more than the limit, we allow the correct wake-up and we stop the
insertion until the limit reach the tail
Heuristic to reduce size
•
•
•
Collect statistics about the instructions committed in the youngest portion of the queue
every quantum time ( 1000 cycles ).
We propose to insert a bit in each ROB entry that will be set at dispatch time if the
physical position of the instruction in IQ is in the current youngest part
The resize decision is threshold-based >>> 0.025 of IPC in the current portion
No limit to cut
Heuristic to increase size
•
Blind >>> grow of one portion every 5 quantum time at lets the cut approach decide if
the decision was correct or not ( time of high parallelism or not )
88
ks
im
Li
Pe
rl
Hy
dr
o
Su
2c
or
Av
g
G
cc
Vo
rte
x
Co
m
pr
es
s
M
ca
tv
W
av
e
To
m
Sw
im
Ap
pl
u
IPC
Results
3.500
3.000
2.500
2.000
128
dynamic
1.500
64
1.000
500
0
Applu
Swim
Tomcatv
Wave
Su2cor
Hydro
Perl
Li
M88ksim
Vortex
Compress
Gcc
Avg
% PowSav % TPowSav Avg entries
62.907
16.276
47.5
30.000
8.100
89.6
58.613
14.291
52.9
41.989
10.721
74.2
61.821
15.946
48.9
66.129
16.339
43.3
55.379
14.517
57.1
59.847
17.383
51.4
65.890
18.620
43.6
61.084
16.278
49.8
65.658
18.287
43.9
60.243
13.088
50.9
57.463
14.987
54.4
Conclusions
• Power consumption is a new constraint in the design of computer
systems like cost and performance
• The problem must be attacked from different levels of abstraction
• Power decision must be done at early steps of the design
• There is a need of power estimation models and tools, specially at
architectural level
Q&A ?