Closing the Power Gap between ASIC and Custom

Download Report

Transcript Closing the Power Gap between ASIC and Custom

Closing the Power Gap
between
ASIC and Custom
David Chinnery, Kurt Keutzer
Outline





Motivation for focusing on reducing ASIC power
The power gap between ASIC and custom
Where does the power go?
What can we do about it?
Conclusions on automating low power techniques
3
Why power?
 Battery life is limited by power (e.g. laptop, mobile phone)
 Cost for packaging and cooling increase rapidly with power
dissipation (e.g. plastic vs. ceramic package, heatsink, fan)
 Higher temperatures degrade performance and reliability
– Circuits are slower, with more leakage, at higher temperature
– Less reliable due to increased rate of electromigration
 Increasing integration increases power demand in portable
applications (e.g. mp3 player/PDA/mobile phone combined)
 Performance is limited by power now even for high end
microprocessors
4
Power of high performance chips has increased
Further voltage scaling
may be limited …
data from ISSCC chips 1982-2002
Power/Unit Area (W/cm2)
As device dimensions (W, L, Tox)
scaled down by a factor k,
for high performance,
 If supply Vdd and threshold
voltage Vth fixed,
then power/unit area  k3
 If Vdd and Vth scaled down
linearly and
,
then power/unit area  k0.7
1000
100
10
1
microprocessor
digital signal processor
[Kuroda OYO
1
10
BUTURI 2004] Scaling Factor k (1/um)
5
Impact of voltage scaling on power
Major components of power: Ptotal = Pdynamic + Pleakage
 Dynamic power due to switching of capacitances
Vdd
Vth,p
dynamic
power
Vth,n
– Reducing Vdd gives quadratic reduction in Pdynamic
 But transistor drive current depends on Vdd
Cload
[Chen in Trans. On Electron Devices 1997]
– Must reduce Vth to maintain drive current
 But reducing Vth increases subthreshold leakage
current, which is the major contributor to Pleakage
Vdd
Vth,p
Vdd
 Must look for other ways to reduce power
Vth,n
Vdd
0V
subthreshold
leakage
Vth,p
Vth,n
6
Automate low power techniques
 Custom designers can try to optimize the design at all levels
 Electronic design automation (EDA) tools for ASICs
– Most of the design optimization is high level
– Fast time-to-market and lower design cost
– Increasingly important to reduce design cost for larger chips
 What is the power gap between (automated) ASIC design
and custom design?
– We need to characterize the contributing factors
– Can we close the power gap?
– Identify custom techniques that can be used in an EDA flow
7
Outline





Motivation for focusing on reducing ASIC power
The power gap between ASIC and custom
Where does the power go?
What can we do about it?
Conclusions on automating low power techniques
8
What is our metric for power?
 Power
– Fixed performance constraint (clock frequency or throughput –
e.g. 30 frames/s for MPEG2)
– Reduce the power and meet the performance constraint
 Energy efficiency
– No performance constraint
– Throughput/unit power (1/PTCPI), e.g. MIPS/mW
– Cycles per instruction (CPI) accounts for impact of
architectural choices (e.g. stalled pipeline stages)
– Energy/operation is the inverse of throughput/unit power
– Maximize throughput/unit power or minimize energy/operation
9
What is the power gap? ARM cores
 ×2 to ×3 gap between
custom and hard macro ARMs
Comparison of Custom and
Hard Macro ARM Implementations
Dhrystone 2.1 MIPS/mW
XScale
3.0
2.0
1.0
StrongARM
Burd
0.0
0.60
0.50
0.35
0.25
0.18
Process Technology (um)
0.13
 ×1.3 to ×1.4 gap between
ARM7TDMI-S and ARM7TDMI
 ×3 to ×4 overall from synthesizable to custom ARMs
10
What is the power gap? DCT/IDCT blocks
 ×4 to ×7 between discrete cosine transform (DCT) and
inverse discrete cosine transform (IDCT) blocks, after
scaling linearly for technology [Fanucci ICECS 2002]
– We assumed power reduces linearly with technology
 To get 30 frame/s MPEG2 with a general purpose processor
would require two ARM9 cores and would consume 15×
power [Fanucci ICECS 2002]
– Application-specific hardware substantially reduces power
11
Outline





Motivation for focusing on reducing ASIC power
The power gap between ASIC and custom
Where does the power go?
What can we do about it?
Conclusions on automating low power techniques
12
Breakdown of power by functionality
Typical breakdown of on-chip power consumption for an
embedded microprocessor
 Clock 20% to 40%
 Memory 20% to 40%
 Control + datapath 40% to 60%
 Input/output to off-chip ~5%
 Most of power is in datapath, control, clock tree and memory
– Techniques focus on reducing this power
– Several companies provide custom memory for ASIC
processes, so we won’t discuss memory here
13
Summary of factors effect on active power
Automated designs are higher power than custom because of …
ASIC design quality
Factor
typical excellent
 Microarchitecture (pipelining, parallelism)
×2.6 ×1.3
 Clock gating and power gating
×1.6 ×1.0
 Logic design
×1.2 ×1.0
 High speed logic styles (DCVSL, PTL, domino) ×1.3 ×1.3
 Technology mapping
×1.4 ×1.0
 Cell sizing and wire sizing
×1.6 ×1.1
 Voltage scaling, multi-Vth, multi-Vdd
×4.0 ×1.0
 Floorplanning and placement
×1.5 ×1.1
 Process variation and process technology
×2.6 ×1.2
14
Outline





Motivation for focusing on reducing ASIC power
The power gap between ASIC and custom
Where does the power go?
What can we do about it?
Conclusions on automating low power techniques
15
Outline




Motivation for focusing on reducing ASIC power
The power gap between ASIC and custom
Where does the power go?
What can we do about it?
Factor
– Microarchitecture (pipelining, parallelism)
– Clock gating and power gating
– Logic design
– High speed logic styles (DCVSL, PTL, domino)
– Technology mapping
– Cell sizing and wire sizing
– Voltage scaling, multi-Vth, multi-Vdd
– Floorplanning and placement
– Process variation and process technology
ASIC design quality
typical excellent
×2.6
×1.3
×1.6
×1.0
×1.2
×1.0
×1.3
×1.3
×1.4
×1.0
×1.6
×1.1
×4.0
×1.0
×1.5
×1.1
×2.6
×1.2
 Conclusions on automating low power techniques
16
Microarchitecture
leverage for voltage scaling and sizing
Increase throughput/cycle
to allow Vdd reduction
 Pipelining inserts registers,
increasing throughput
 Limited by
– Reduction in instructions/cycle
(1/CPI) due to branch misprediction,
waiting to read or write memory, etc.
– Power and delay for registers, data
forwarding logic, and branch prediction
instruction
fetch
instruction
fetch
instruction
decode
instruction
decode
ALU
insert
registers
ALU
memory
access
memory
access
write
back
write
back
 Parallelism increases throughput
in exchange for increased area
 Limited by
– Routing, multiplexing, control overheads
17
Microarchitecture: pipelining model
leverage for voltage scaling and sizing
 Pipeline power model [Harstein 2003]:
– n stages, =1.1 latch growth vs. n, =0.05 for register power
 Minimum stage delay:
– ASIC tpipelining overhead of 10 FO4 (register delay) + 10 FO4 (imbalance)
– Custom tpipelining overhead of 2.6 FO4 total, same tcombinational of 175 FO4
 CPI penalty 0.025/stage for custom, and 0.05/stage for ASICs
 Add fits for dynamic and leakage power with voltage scaling and sizing
 At 40 FO4 delay constraint (500MHz for Leff=0.1um), ASIC is 2.6 worse
0.050 vs. 0.019 => ×2.6
ASIC
1/(energy/operation)
1/(energy/operation)
custom
18
Microarchitecture
leverage for voltage scaling and sizing
Custom IDCT – pipelining to reduce Vdd [Xanthapoulos JSSC’99]
 With pipeline: Vdd=1.32V, 20% power overhead 2
 Without pipeline: Vdd=2.2V to meet throughput
Parallel datapaths [Bhavnagarwala IEEE Trans. VLSI’00]
 2 to 4 reduction in power by reducing Vdd by increasing
throughput with parallel datapaths
Microarchitecture speed gap is 1.8 (typical) to 1.3 (excellent)
 At a tight delay constraint, this corresponds to about 2.6 to
1.3 worse power due to higher Vdd, lower Vth, and wider
gates to compensate
19
Outline




Motivation for focusing on reducing ASIC power
The power gap between ASIC and custom
Where does the power go?
What can we do about it?
Factor
– Microarchitecture (pipelining, parallelism)
– Clock gating and power gating
– Logic design
– High speed logic styles (DCVSL, PTL, domino)
– Technology mapping
– Cell sizing and wire sizing
– Voltage scaling, multi-Vth, multi-Vdd
– Floorplanning and placement
– Process variation and process technology
ASIC design quality
typical excellent
×2.6
×1.3
×1.6
×1.0
×1.2
×1.0
×1.3
×1.3
×1.4
×1.0
×1.6
×1.1
×4.0
×1.0
×1.5
×1.1
×2.6
×1.2
 Conclusions on automating low power techniques
20
Clock gating
1.6 to 1.0
 Clock signal has high activity, 2. Logic is lower activity ~0.1.
 Turn off clocks to inactive modules
 Some DCT/IDCT registers are active < 3% of time, clock gating
and avoiding computation reduces power by 10 [August SOC’01]
 Typical savings are up to 1.6 power reduction
 Power minimization tools automatically insert gated clocks
 Designer can make microarchitectural/algorithm decisions
– E.g. reduce precision for DCT/IDCT coefficients
– Precomputation control signals reduces
power by 1.4 to 3.3 [Hsu ISLPED’02]
 ASICs can do this
clock
add
shift
insert
clock
gating
add
select_add
clock
select_shift
shift
21
Power gating
reduces leakage in standby
 Turn off leakage path in inactive modules
– May need to preserve the state registers
 Can reduce standby leakage by 3 orders of magnitude
[Mutoh JSSC’95]
 Other approaches
– reverse biasing the substrate
– setting input vectors to low leakage states, gives 1.4 leakage
reduction [Lee DAC’03]
 Just now getting ASIC methodology support
– Need large sleep transistors to turn off power
– Sleep transistors reduce available supply voltage
add
select_add
clock
select_shift
shift
22
Outline




Motivation for focusing on reducing ASIC power
The power gap between ASIC and custom
Where does the power go?
What can we do about it?
Factor
– Microarchitecture (pipelining, parallelism)
– Clock gating and power gating
– Logic design
– High speed logic styles (DCVSL, PTL, domino)
– Technology mapping
– Cell sizing and wire sizing
– Voltage scaling, multi-Vth, multi-Vdd
– Floorplanning and placement
– Process variation and process technology
ASIC design quality
typical excellent
×2.6
×1.3
×1.6
×1.0
×1.2
×1.0
×1.3
×1.3
×1.4
×1.0
×1.6
×1.1
×4.0
×1.0
×1.5
×1.1
×2.6
×1.2
 Conclusions on automating low power techniques
23
leverage for voltage scaling and sizing
Low power designs use mostly static CMOS logic
 Static CMOS logic is low leakage, robust
 PMOS pullup series transistors are slow
Faster custom logic styles speedup critical paths
 Custom can use slack from higher speed (1.4)
to reduce power by lowering Vdd
– ASIC power 1.3 worse than custom at a
tight delay constraint due to logic style
32-bit Adder
[Tiwari DAC’98]
Power
High speed logic styles
domino
22% higher
static 25% lower
Delay
slow, larger
capacitance
PMOS transistors
in series
static CMOS
DCVSL
PTL
domino
24
Outline




Motivation for focusing on reducing ASIC power
The power gap between ASIC and custom
Where does the power go?
What can we do about it?
Factor
– Microarchitecture (pipelining, parallelism)
– Clock gating and power gating
– Logic design
– High speed logic styles (DCVSL, PTL, domino)
– Technology mapping
– Cell sizing and wire sizing
– Voltage scaling, multi-Vth, multi-Vdd
– Floorplanning and placement
– Process variation and process technology
ASIC design quality
typical excellent
×2.6
×1.3
×1.6
×1.0
×1.2
×1.0
×1.3
×1.3
×1.4
×1.0
×1.6
×1.1
×4.0
×1.0
×1.5
×1.1
×2.6
×1.2
 Conclusions on automating low power techniques
25
Technology mapping
1.4 to 1.0
Technology mapping tools don’t target low power
 We found that targeting minimum area for multipliers can
result in 1.3 power, delay is a poor choice
Technology mapping techniques to reduce active power
1.0 – ASICs can do as well as custom, if tools improve
1/2
1/2
1/2
1/2
3/8
7/32
1/2
3/8
equivalent logic, 1/2
1/2
lower activity
1/2
1/2
3/8
7/32
3/8
3/8
26
Outline




Motivation for focusing on reducing ASIC power
The power gap between ASIC and custom
Where does the power go?
What can we do about it?
Factor
– Microarchitecture (pipelining, parallelism)
– Clock gating and power gating
– Logic design
– High speed logic styles (DCVSL, PTL, domino)
– Technology mapping
– Cell sizing and wire sizing
– Voltage scaling, multi-Vth, multi-Vdd
– Floorplanning and placement
– Process variation and process technology
ASIC design quality
typical excellent
×2.6
×1.3
×1.6
×1.0
×1.2
×1.0
×1.3
×1.3
×1.4
×1.0
×1.6
×1.1
×4.0
×1.0
×1.5
×1.1
×2.6
×1.2
 Conclusions on automating low power techniques
27
Cell sizing and wire sizing
1.6 to 1.1
 1.35 power reduction on Xtensa processor at 325MHz by
(mostly sizing) power minimization with Design Compiler and
0.13um library [internship at Tensilica]
 Can do better than
Design Compiler (DC)
with cell sizing via
linear program (LP)
(global optimization
vs. greedy “pin-hole”
optimization), about
1.1 to 1.2 power
reduction
[Chinnery, Keutzer will be at ISLPED’05]
ISCAS'85 # logic
Minimum
Netlist levels # cells Delay (ns)
c17
4
10
0.094
c432
24
259
0.733
c499
25
644
0.701
c880
23
484
0.700
c1355
27
764
0.778
c1908
33
635
0.999
c2670
23 1164
0.649
c3540
36 1283
1.054
c5315
34 1956
0.946
c6288
113 3544
3.305
c7552
31 2779
0.847
Average savings vs. Design Compiler:
Power (mW)
1.1T min
1.2T min
DC LP DC LP
1.11 1.08 0.86 0.76
2.78 2.25 2.22 1.76
5.83 4.62 4.98 3.76
3.37 3.49 2.83 2.61
6.88 5.53 5.97 4.12
3.26 3.11 2.67 2.44
9.23 8.63 8.08 6.90
6.69 5.79 5.60 4.70
10.39 9.51 8.82 7.81
6.91 6.07 6.08 4.78
18.02 16.65 15.60 13.63
10%
16%
28
Cell sizing and wire sizing
1.6 to 1.1
 Cell libraries lack fine-grained sizes and skewed P:N drives
– [Hurat SNUG’01] Generate new cells:
1.2 power reduction and 1.15 faster
for bus controller, 1.4 MHz/mW
Vdd
optimize
transistor
sizes
GND
Vdd
GND
 Simultaneous buffer and wire sizing
reduced clock tree power by 2.7 [Gong ISLPED’96]
– 1.1 to 1.2 reduction in total power
– Not available for ASIC interconnect yet
 Up to 1.6 gap due to cell sizing and wire sizing,
can reduce to 1.1 using a library with finely-grained sizes,
a good sizing tool, and design-specific cells
29
Outline




Motivation for focusing on reducing ASIC power
The power gap between ASIC and custom
Where does the power go?
What can we do about it?
Factor
– Microarchitecture (pipelining, parallelism)
– Clock gating and power gating
– Logic design
– High speed logic styles (DCVSL, PTL, domino)
– Technology mapping
– Cell sizing and wire sizing
– Voltage scaling, multi-Vth, multi-Vdd
– Floorplanning and placement
– Process variation and process technology
ASIC design quality
typical excellent
×2.6
×1.3
×1.6
×1.0
×1.2
×1.0
×1.3
×1.3
×1.4
×1.0
×1.6
×1.1
×4.0
×1.0
×1.5
×1.1
×2.6
×1.2
 Conclusions on automating low power techniques
30
Dynamic supply and substrate biasing
4.0 to 1.0
– 10 more energy efficient at low
performance [Burd ISSCC’00]
– Adaptive voltage scaling with the
ARM11 gives 1.7 power reduction
for voice, SMS, web applications
[National Semiconductor, ARM ’02]
MIPS
 Change Vdd based on processor load
[Burd ISSCC 2000]
Energy (mW/MIPS)
 Reduce Vdd and bias substrate to lower Vth
– 1.7 reduction in power, same speed [Hamada CICC’98]
– Increase Vth in standby to reduce leakage
 These are complicated to automate for ASICs
– Dynamic voltage requires accurate knowledge of path delays
31
Multiple supply and threshold voltages
4.0 to 1.0
Basic idea: high speed where critical, low power elsewhere
 Dual Vdd reduces power by 1.7 after substrate biasing/lower Vdd
[Usami JSSC’98]
– 2 reduction in clock tree power by using low Vdd
 Separate voltage islands – different speeds and Vdd
[Lackey ICCAD’02]
– Turn off Vdd to modules not in use, reduces leakage by 500
– 1.25 to 3 average power reduction, depending on activities
 Dual Vth can give 3 to 6 reduction in leakage
[Sirichotiyakul DAC’99]
ASICs are limited to Vdd and Vth offered by library and foundry
 Can’t change Vth to design-specific optimal point
 Standard cell libraries characterized at only two or three Vdd
 Dual Vdd requires level converters and dual Vdd layout
32
Outline




Motivation for focusing on reducing ASIC power
The power gap between ASIC and custom
Where does the power go?
What can we do about it?
Factor
– Microarchitecture (pipelining, parallelism)
– Clock gating and power gating
– Logic design
– High speed logic styles (DCVSL, PTL, domino)
– Technology mapping
– Cell sizing and wire sizing
– Voltage scaling, multi-Vth, multi-Vdd
– Floorplanning and placement
– Process variation and process technology
ASIC design quality
typical excellent
×2.6
×1.3
×1.6
×1.0
×1.2
×1.0
×1.3
×1.3
×1.4
×1.0
×1.6
×1.1
×4.0
×1.0
×1.5
×1.1
×2.6
×1.2
 Conclusions on automating low power techniques
33
Floorplanning and placement
Poor floorplanning and cell placement,
inaccurate wire loads
 ×1.5 worse power than custom
We compared partitioning a design into 50K vs.
200K gate modules from 0.25um to 0.13um
 42% longer wires for 200K partitions
 Interconnect is 20% to 40% of total power
[Sylvester ICCAD’98]
 1.1 to 1.2 increase in total power due to
wiring, and gates will be upsized to drive
the longer wires
1.5 to 1.1
automatic
place and route
block
partitioned
[Hauck Micro.
Report ’01] 34
Floorplanning and placement
1.5 to 1.1
 Bit slices – can reduce wire length by 70% or
more vs. automated place-and-route
– up to 1.4 energy reduction as faster and lower
wiring capacitance [Chang SM Thesis MIT’98]
– 1.5 energy reduction from bit slicing and some
logic optimization [Stok, Puri, Bhattacharya, Cohn]
 Manual place-and-route achieves 10% shorter
wires and 1.1 faster, about 1.1 energy
reduction [Chang SM Thesis MIT’98]
 ASICs still ×1.1 higher power than custom
due to layout
automatic
place-and-route
tiled
bit-slices
custom
35
Outline




Motivation for focusing on reducing ASIC power
The power gap between ASIC and custom
Where does the power go?
What can we do about it?
Factor
– Microarchitecture (pipelining, parallelism)
– Clock gating and power gating
– Logic design
– High speed logic styles (DCVSL, PTL, domino)
– Technology mapping
– Cell sizing and wire sizing
– Voltage scaling, multi-Vth, multi-Vdd
– Floorplanning and placement
– Process variation and process technology
ASIC design quality
typical excellent
×2.6
×1.3
×1.6
×1.0
×1.2
×1.0
×1.3
×1.3
×1.4
×1.0
×1.6
×1.1
×4.0
×1.0
×1.5
×1.1
×2.6
×1.2
 Conclusions on automating low power techniques
36
Process variation impact on power
2.6 to 1.2
 ASICs are designed to work at the worst case delay and worst case
power corners for the process – typical delay and power are less
– Simulated power was ×1.7 actual power for custom DCT/IDCT
 Up to a factor of 1.75 between worst and best (average power of 80
chip samples in 0.3um)
×1.75
[Takahashi JSSC’98]
×1.5
37
Process variation impact on power
2.6 to 1.2
 Binning would leave gap of 1.4 between low and high bins
 We found a gap of 1.2 between low speed (high power) and high
speed (low power, after derating for Vdd and frequency) bins of
0.18 and 0.13um Intel and AMD PC chips
– ASICs don’t speed bin (they scan test, no speed test)
low
power
bin
×1.4
higher
power
bin
38
Process technology
2.6 to 1.2
 Low power libraries are more expensive
– 5% to 10% transistor width shrinks to reduce capacitances
– Copper is 40% lower resistivity than aluminum
– Low-k dielectric reduces wire capacitances – we estimate about a
×1.1 reduction in total power with a low-k dielectric
– Silicon-on-insulator is ×1.1 to ×1.3 faster, ×1.4 power reduction
[Narendra Symp. VLSI 2001]
 We compared cell libraries in UMC 0.13um vs. IBM 0.13um process
– IBM cells about ×1.05 faster, ×1.6 higher active power,
UMC had ×17 leakage
Overall impact of process variation and technology
 2.6 ASIC power relative to custom for worst case conditions
and a cheap process
 1.2 in a low power process, typical conditions, no speed binning
39
Outline





Motivation for focusing on reducing ASIC power
The power gap between ASIC and custom
Where does the power go?
What can we do about it?
Conclusions on automating low power techniques
40
Low power design conclusions
 Typical ASIC is 3 to 7 less energy efficient than custom
– We assumed ASIC and custom designs can use the same
microarchitectural and logic design techniques. These are
the biggest levers for reducing power.
– Can get 10 or more going from general purpose
hardware to application-specific hardware.
– E.g. Fast Fourier transform implementations as
discussed in Andrew Chang’s paper.
 The largest factor for the power gap is voltage scaling
– responsible for up to ×4
 Process and microarchitecture can be large factors,
about ×2.6 each
41
Low power design conclusions
By incorporating custom techniques can get within
 3 at a high performance target
– Can’t use custom logic styles
– ASIC speed penalty drags down efficiency, as higher Vdd,
lower Vth, and upsized gates are needed to meet
performance target
 1.5 at a lower performance target (~2 slower)
– Make full use of scaling down Vdd and Vth
42
Low power ASIC design example
0.13um DSP example [Stok, Puri, Bhattacharya, Cohn]
 240,000 gates implementing Hilbert transform, FIR filter, and
fast Fourier transform, with 42KB register array
 Technology mapping, logic design (carry save adders), bitslicing, physical synthesis gave 1.86 increase in efficiency
 A fine grained standard cell library gave another 1.16
 Voltage scaling gave another factor of 1.46
 3.1 increase in MHz/mW overall
The third speaker, Ruchir Puri will discuss some of their recent
low power work at IBM.
43
Extra slides
Impact of voltage scaling on power
Vdd
Ptotal = Pdynamic + Pshort circuit + Pstatic
 Short circuit power when switching is 10% or less of Ptotal
Vth,p
short
circuit
current
 Dynamic power due to switching of capacitances
Vth,n
– Reducing Vdd gives quadratic reduction in Pdynamic
 But transistor drive current depends on Vdd
Vdd dynamic
power
V
th,p
[Chen in Trans. On Electron Devices 1997]
– Must reduce Vth to maintain drive current
 But reducing Vth increases subthreshold leakage
current, which is the major contributor to Pstatic
Vth,n
Vdd
Vth,p
Vdd
Vth,n
subthreshold 0V
leakage
(Clock frequency f; gate switching activity a; capacitance C; transistor length L;
transistor gate oxide thickness Tox; temperature T; constants , t, Io, and m.)
Cload
Vdd
Vth,p
Vth,n
45
ITRS leakage power trends
 Further Vdd voltage scaling will be
limited
 Must also look to other low power
techniques
high speed, total power
high speed, leakage
low power, total power
low power, leakage
100
2
Power/Die Area (W/cm )
 Can’t scale down Vth much further
due to large subthreshold leakage
currents
 Gate tunneling leakage through thin
gate oxide Tox is also becoming a
significant cause of leakage
1000
10
fast, low Vth
slow,
high Vth
1
0.1
leakage
increasing
0.01
0.001
0.13
0.09
0.065 0.045
Technology (um)
0.022
From International Technology Roadmap for
Semiconductors data for 2001-2016 (assuming
activity of 0.1, ignoring interconnect).
46
Summary of factors affecting (active) power
Automated designs are higher power than custom because of …
Factor
 Microarchitecture (pipelining, parallelism)
 Memory
 Clock gating and power gating
 Logic design
 High speed logic styles (DCVSL, PTL, domino)
 Technology mapping
 Cell sizing and wire sizing
 Voltage scaling, multi-Vth, multi-Vdd
 Floorplanning and placement
 Process variation and process technology
ASIC design quality
typical
excellent
×2.6
×1.3
×1.4
×1.0
×1.6
×1.0
×1.2
×1.0
×1.3
×1.3
×1.4
×1.0
×1.6
×1.1
×4.0
×1.0
×1.5
×1.1
×2.6
×1.2
47
Memory – reduce cache misses
1.4 to 1.0
 Larger caches consume more power, but reduced cache misses
– Pipeline stalls, waits many cycles for read/write to off-chip
memory
 Caches with higher associativity (e.g. 8-way vs. direct mapped) consume
more power, also affects likelihood of a cache miss
 [Duarte ASIC/SOC 2001]
– Sub-banking: only precharge the need section
of the cache bank, 1.32 energy savings
– Software optimizations to reduce cache misses
gave on average a 1.6 reduction in power
 90% of the StrongARM area was caches, increasing
the transistor length in the caches by 12% reduced
leakage by 20 [Montanaro JSSC’96]
slower off-chip
memory
write
buffer
on-chip
cache
processor
 ASICs can do this, custom memory is available for ASICs
48
Outline




Motivation for focusing on reducing ASIC power
The power gap between ASIC and custom
Where does the power go?
What can we do about it?
Factor
– Microarchitecture (pipelining, parallelism)
– Clock gating and power gating
– Logic design
– High speed logic styles (DCVSL, PTL, domino)
– Technology mapping
– Cell sizing and wire sizing
– Voltage scaling, multi-Vth, multi-Vdd
– Floorplanning and placement
– Process variation and process technology
ASIC design quality
typical excellent
×2.6
×1.3
×1.6
×1.0
×1.2
×1.0
×1.3
×1.3
×1.4
×1.0
×1.6
×1.1
×4.0
×1.0
×1.5
×1.1
×2.6
×1.2
 Conclusions on automating low power techniques
49
Logic design
Logic design refers to the topology and logic
structure to implement functional units
 Logic switching activity of a carry select
adder was 1.8 worse than a 32-bit carry
lookahead [Callaway VLSI Signal Proc.’92]
 0.13um 64-bit radix-2 compound domino
adder was slower and about 1.3 energy
compared to radix-4 [Zlatanovici ESSC’03]
 We implemented an algorithm to reduce
switching activity in multipliers, reduced
energy by 1.1 for 64-bit [Ito ICCD’03]
 Given similar design constraints, ASIC
designers can choose the same logic design
as custom, 1.0
1.2 to 1.0
carry
save
adder
x0
y0
+
ripple
carry
adder z
0
(x+y+z)0
+
z1
x1
y1
+
(x+y+z)1
+
z2
x2
y2
x3
y3
+
(x+y+z)2
+
z3
+
+
(x+y+z)3
(x+y+z)4
50