Transcript Lec12-Power
Low-power
computer architecture
Dr. Avi Mendelson
1
Disclaimer
No Intel proprietary information is disclosed.
Every future estimate or projection is only a
speculation
Responsibility for all opinions and conclusions
falls on the author only.
It
does not means you cannot trust them…
© Dr. Avi Mendelson
2
Agenda
The power crisis
Power
consumption
Power density and thermal limitations
General solutions and directions
© Dr. Avi Mendelson
5
Moore’s law
“Doubling the number of transistors on a manufactured
die every year” - Gordon Moore, Intel Corporation
109
256M
64M
Memory
108
16M
Transistors Per Die
Microprocessor
4M
107
1M
256K
106
64K
16K
105
Pentium®III
Pentium® II
Pentium®
Pro
®
Pentium
i386™
4K
80286
1K
104
103
i486™
Pentium®4
8086
Source: Intel
8080
4004
102
’70
’73
’76
’79
’82
’85
’88
© Dr. Avi Mendelson
’91
’94
'97
2000
6
In the Last 25 Years Life was Easy
(*)
Doubling of transistor density every 30 months
Increasing die sizes, allowed by
Increasing Wafer Size
Process technology moving from “black art” to “manufacturing science”
Doubling of transistors every 18 months
Tech
Old Arch
i386C
i486C
Pentium®
Pentium® III
mm (linear)
6.5
9.5
12.2
10.3
New Arch
i486
Pentium®
Pentium® Pro
Next Gen
mm (linear)
11.5
17
17.3
?
Ratio
3.1
3.2
2.1
2--3
Implications: (in the same technology)
1. New Arch ~ 2-3X die area of the last Arch
2. Provides 1.5-1.7X integer performance of the last Arch
© Dr. Avi Mendelson
(*) source Fred Pollack,
Micro-32
7
Suddenly, the power monster
appears in all different market
segments
© Dr. Avi Mendelson
8
Processor Power Evolution
?
100
Pentium® II
Pentium® 4
Max Power (Watts)
Pentium® Pro
Pentium® III
10
Pentium®
Pentium®
w/MMX tech.
i486
i386
1
6
3
2
3
Traditionally: new generation always increase power
Compactions: higher performance at lower power
Used to be “One size fits all”: start with high power and shrink to Mobile
© Dr. Avi Mendelson
9
The power crisis – power consumption
Sourse:
coolchips,
Micro 32
© Dr. Avi Mendelson
10
Power challenges per segment
Servers
Desktops
Mobile
Handhelds
Thermal cost
Delivery cost
Form factor
Thermal cost
Delivery cost
Thermal cost
Delivery cost
Form factor
Battery size
Form Factor
Battery size
Battery cost
Price drivers Performance
Performance
Noise
Perf/$$
Performance
Noise
Perf/Kg.
Battery life
Performance
Battery life
Max
performance
@ thermal
constraint
Max
performance @
thermal
constraint
Max battery life
Max battery life
Max perf/power
to meet
application’s
need
Power
related
system cost
drivers
Perf/inch^3
Optimization Max
performance
point
@ thermal
constraint
© Dr. Avi Mendelson
11
Power & Energy
Power
Dynamic power: consumed by transistors during
switching.
P = aCV2f - Work done per time unit (Watts)
(a: activity, C: capacitance, V: voltage, f: frequency)
Static Power (Leakage): consumed by all
“inactive transistors”, it depends on temperature
and voltage.
Power aware architectures -> aim to reduce peak power
Energy
Power consume during some period of time.
Energy aware architectures -> aims to reduce average
power consumption
© Dr. Avi Mendelson
12
Power Evolution (Theoretical)
250
Watts
200
Leakage Power
Active Power
150
100
50
0
2
3
For a 15mm/side die (225mm2)
Assume 2X frequency increase each generation
Future process numbers are estimated
© Dr. Avi Mendelson
13
Why high power matters
Power Limitations
Higher power higher current
–
Cannot exceed platform power delivery constraints
Higher power higher temperature
Cannot exceed the thermal constraints (e.g., Tj < 100oC)
– Increases leakage.
The heat must be controlled in order to avoid electric migration
and other “chemical” reactions of the silicon
–
Energy
Affects battery life.
Consumer devices – the processor may consume most of the
energy
Mobile computers (Laptops) - the system (display, disk, cooling,
energy supplier, etc) consumes most of the energy
Affects the cost of Electricity
© Dr. Avi Mendelson
14
Power Density
1000
Rocket
Nozzle
Nuclear Reactor
Watts/cm 2
100
Pentium® 4
Pentium® III
Pentium® II
Hot plate
10
Pentium® Pro
Pentium®
i386
i486
1
3
2
3
* “New Microarchitecture Challenges in the Coming Generations of CMOS Process Technologies” –
Fred Pollack, Intel Corp. Micro32 conference key note - 1999.
© Dr. Avi Mendelson
15
© Dr. Avi Mendelson
16
Why power and power density
increase over time ?
© Dr. Avi Mendelson
17
How do we keep up with the
Moore’s Law?
Every 18 month in average we introduce a new
process
The new process shrinks the dimension of the
transistors by 0.7 (ideal shrink)
As a result, on the same die area, we can have
more transistors, each of them running at higher
frequency
One may mistakenly think that this is the reason
for the increase in power and power density.
© Dr. Avi Mendelson
18
Scaling theory--1 of 2
Width W 0.7, Length L 0.7, tox 0.7
Lateral and vertical dimensions reduce by 30%
0 .7 0 . 7
Area Cap Ca
0.7,
0 .7
Fringing Cap Cf 0.7,
Total Cap C 0.7
Capacitance--area and fringing—reduce by 30%
Die Area X Y 0.7 0.7 0.72
Die area reduces 50%
© Dr. Avi Mendelson
19
Scaling theory--2 of 2
Cap
0.7
0.7
Transistor
1
Capacitance per transistor reduces 30%
Cap
0.7
1
Area 0.7 0.7 0.7
Capacitance per unit area increases 43%
Vdd 0.7, Vt 0.7, I
W
0.7 0.7
(Vdd Vt )
0.7
tox
0.7
2
C Vdd 0.7 0.7
0
.
7
0
.
7
T
0.7, Power C V 2 f
0.72
I
0.7
0.7
Delay reduces 30%, power reduces 50%
© Dr. Avi Mendelson
20
Ideal Scenarios...
Ideal “Shrink”
Same
Ideal New arch
arch
Same
#Xistors
0.5X size
1.5X frequency
die size
2X #Xistors
1X size
1.5X frequency
0.5X
1X
1X
power
1X IPC (instr./cycle)
1.5X performance
1X power density
© Dr. Avi Mendelson
power
2X IPC
3X performance
1X power density
21
Process Technologies – Reality
But in reality:
So, every new process and architecture generation:
New process is not ideal anymore
New designs squeeze frequency to 2X per process
New designs use more transistors (2X-3X to get 1.5X-1.7X perf)
Power goes up about 2X
Power density goes up 30%~80%
This is bad, and…
Will get worse in future process generations:
Voltage (Vdd) will scale down less
Leakage is going to the roof
© Dr. Avi Mendelson
22
Die increases in order to maintain performance boost
Silicon Process Technology 1.5µ 1.0µ 0.8µ
0.6µ 0.35µ 0.25µ 0.18µ 0.13µ
Intel386™ DX
Processor
Intel486™ DX
Processor
Pentium®
Processor
Pentium® Pro
Processor
Pentium® II
Processor
Pentium® III
Processor
Pentium® 4
Processor
© Dr. Avi Mendelson
23
Put it all together: Power and Power density are
real threat to the Moore’s law
Complex algorithms lead to denser power:
Timing pressure leads to faster/bigger/power-hungrier
gates
Dense random logic
Designers put together units that communicate with each other. It
creates “regions” with high activity factors -> hot spots.
Power is not distributed evenly over the chip. A failure
can happen if a single point reach the max power point.
Many of the modern processors are power limited
© Dr. Avi Mendelson
24
Some implications
We can’t build microprocessors with ever
increasing power density and die sizes
The constraint is power – not manufacturability
The design of any future micro-processor should
take power into consideration. We need to
distinguish between different aspects of power:
Power delivery
Max power (TJ)
Power density - hot spots
Energy – static + dynamic
Power and Energy aware design should take care
of each of these aspects
One-size does not fit all anymore
© Dr. Avi Mendelson
25
General solutions and directions
Assume that one size does not fit all.
For different segments there may be different
solutions (although many of them share the
same principle of operation).
© Dr. Avi Mendelson
26
Embedded systems vs. Laptops
Embedded systems
Most of the power is consumed by the CPU
Usually not thermally limited.
What we really care about is battery life and meeting the timing
limitations.
In real time systems we can take advantage of known “deadlines”
Laptops (Mobile systems)
We are thermally limited.
We can not use deadlines (most of the time).
We need to optimize for max battery life and max performance in a
given power envelope.
© Dr. Avi Mendelson
27
How to extend Battery life: Voltage Scaling
Within a given voltage range, higher voltage allows higher freq.
Used for trading power and frequency. Either
Statically, at manufacturing time
Dynamically, at run time (e.g., Intel’s SpeedStep® Technology)
Actual range depends on specific
design and process technology
Examples*:
Intel® XScale™ processors runs
from 0.75V (150MHz/50mW)
to 1.65V (800MHz/900mW)
Intel mobile Pentium® III processor
sells from 1.1V (600MHz)
to 1.7V (1GHz)
1000
XScale proc. freq & power vs voltage
900
800
Fequency(Mhz)
700
Power (mWatt)
600
500
400
300
200
100
0
* Source: Intel Corp. (http://developer.intel.com)
© Dr. Avi Mendelson
0.5
0.7
0.9
1.1
1.3
1.5
1.7
28
1.9
Voltage Scaling (cont.)
Huge effect on Dynamic Power:
20% freq reduction 20% voltage reduction
35% energy reduction. (aCV2 = aC*0.82 = aC*0.64)
50% power reduction. (aCV2f = aC*0.83 = aC*0.51)
Even more impressive if we recall:
20% freq hit only 10%-15% performance hit*
Voltage scaling can be used to
trade performance for power
Reduce the power consumption when performance needs
can be released e.g., if deadlines known and if we have
enough “dead time”, we can extend the execution time on
the expense of lowering the voltage.
BUT it has technology limitations
* Depends mainly on core to bus frequency ratio and caches size.
© Dr. Avi Mendelson
29
How to extend battery life: energy Efficiency
Energy per task
Proportional
Proportional
to # of processed instructions per task
to the average work consumed per
instruction
“Energy per (retired) instruction” = b*W, where
b: Ratio of Total to Retired number of processed
instructions
W: Average energy spent in processing an instruction
Both figures deteriorate with every new microarchitecture
Since speculation increases and complexity grows
In that respect:
high performance modern microarchitectures
are less energy-efficient
© Dr. Avi Mendelson
30
Improving Hot Spots
Clustering
Build your system as clustered architecture (e.g.,
Alpha)
Design your system so that when all clusters are
active the system exceeds the Max-Power
allowed
Most of the time, not all the clusters are active
“Smart scheduling” will spread the thermal hotspots among different clusters.
In VLIW based architectures, compilers can help
© Dr. Avi Mendelson
31
Alpha hot spots
Area 30%
Freq. 50%
Power 67%
Source - CoolChips-99
© Dr. Avi Mendelson
32
Power Complexity Metrics
Power a C V2 f
Metrics: suppose we introduce new feature that
consumes extra x power and gain y performance:
1.
Power/Perf ( Energy), assuming same technology (same
C) and same voltage
2.
Power/Perf2 ( Energy*Delay)
3.
For battery life, energy bills.
For a given power envelope – without voltage scaling.
Balance performance and power needs.
Power/Perf3 ( Energy*Delay2)
For a given power envelope – with voltage scaling.
assuming that we can (1) trade frequency and voltage scaling,
and (2) we can lower the voltage as much as we wish
© Dr. Avi Mendelson
33
E*D product (lower is better)
E *D ~ Watt / MIPS
2
1
3
Delay
E = energy / instruction
= Power * sec / instruction
= Watt / MIPS
D = sec / instruction
= 1 / MIPS
2
1
0
0
0
1
2
Vdd (volts)
0
1
2
Vdd (volts)
3
400
ExD
Energy (PJ)
4
300
200
100
© Dr. Avi Mendelson
3
34
Leakage control
Leakage
depends on: technology, area voltage and
temperature.
High temperature high leakage high power
higher temperature
Leakage will be very significant in future microarchitectures.
Large
caches contributes to the performance but
may increase the power due to leakage.
Larger caches: better performance
higher leakage -> slower clock ->
lower performance.
Leakage make the major difference between clock
gating and deep sleep modes (where power is
disconnected)
© Dr. Avi Mendelson
35
Design for power: Out Of Order Execution
OOO architecture was
found to be very
efficient in masking the
effect L1 cache misses.
Aggressive OOO, and
wider machines require
more registers and
memory ports
It consumes a lot of
power
Can we slow down the
access to the cache and let
the OOO solve the
performance problem?
Can we simplify the OOO
mechanisms, assuming that
the memory subsystem limits
the performance?
How aggressive we should
be as speculation (branch
prediction, value prediction,
etc)
© Dr. Avi Mendelson
36
Pentium Pro Power Breakdown
Actual computation:
less than 25%!
What can be done:
Trace
cache
Many low-level
improvements
Fetch
14%
Misc
23%
Decode
14%
RAT
4%
MOB
4%
ROB
7%
External Bus
6%
RS
5% Clock
5%
© Dr. Avi Mendelson
Data $
7%
FP Exec
5%
Int Exec
6%
37
SMT
Single CPU µArch augmented to look as 2 or
more CPUs to the software
Adds ~10% logic to CPU (Alpha experience)
Average power increases <10%.
Can increase performance of two threads by
20-50% in respect of running the same
applications sequentially.
Looks like a good tradeoffs between power and
performance.
© Dr. Avi Mendelson
38
MT - Implications on power
The area and the power consumption of register files
and memory elements within the processor increases
significantly due to aggressive out-of-order and
aggressive SMT (Alpha, CoolChip, 99’)
Increase the power at the hotspot, not fit to thermally
limited segments (where performance is needed).
May better tolerate cache misses, so power aware
caches can be used
Hot-spots may force us to use more aggressive
clustering
© Dr. Avi Mendelson
39
Question?
© Dr. Avi Mendelson
40