SoC low power

Download Report

Transcript SoC low power

1
Low Power Techniques
경종민
[email protected]
2
Low Power Techniques
Contents
1. Introduction
2. 왜 low power 인가?
3. Future Opportunities for Low-Power
4. How to reduce power
3
1. Introduction
1) Drivers for IC progress
Delay

FPGA
P

Power 
 

Size

Reliability-1
 


Full custom






Flexibility-1
(Programmability)
Design TAT
Cost
• Silicon is the winner, and among many, CMOS is
the winner.
• So will it be at least for next 25 years.
4
There’s no show stopper! (in technology)
ex. 양자/열역학(min. switching energy, power dissipation)
전자기학(빛의 속도)
material, etc.
Except for Multi-Billion $ investment cost!
Moore’s law will keep being honored.
Why?
1. No insurmountable obstacle exists.
2. People believes & behaves accordingly.
• Huge opportunity exists only if we do good in exploiting
1) cross-breeding, co-utilization and co-development among
interactable technologies
2) Technology
sharing using network
5
2) Big Picture : If power reduction is THE goal, you need
to visit all areas to achieve it.
Speed Power Designtime Feb. Cost
algorithm
architecture
logic
circuit
device
process
material
S/W
Pgmmability
6
Analogy : Vertical engineer vs. horizontal engineer
IF you want to sell graphic chip, you need to do anything to help
achieve it, from design, application to marketing, etc.
P-core wireless graphics giga-bit switch MPEG RAMBUS
marketing
application
Legal affairs(IP)
Horizontal engineer
Main facturing
verification
design
testing
simulation
Process tuning
Vertical engineer
7
2. 왜 low power인가?
1) Battery 기술 발전 slow ! : 5-8배 향상/200yrs
200년전 : 납축전지 25 watt.hour/kg
 now : lithium polymer 전지 : 200 watt.hour/kg
이에 비하면 반도체기술은 30년동안 106배(CPU속도) 매 3년마다
4배(Memory density)  Still wild wild frontier stretching before
us!
2) 열방출 문제 :
You don’t want big cooling tower for each IC’s !
3) Energy 절약 :
minimize the amount of energy consumption, and recirculation
period, otherwise our earth will be EXHAUSTED.
4) Convenience :
too many wires around : mess
8
3. Future Opportunities for Low-Power
1) PDA(Personal Digital Assistant)
telephone, pager, pen-based input, schedule keeper,
audio/video entertainment fax, video camera, data security
with fingerprint and/or voice recognition, speech recognition,
appl. S/W, teleconferencing…
Appl.
Server
PDA
Base
Station(RF)
Function sharing
for “low-power”ing PDA
2) Tablet(descendent of current Notebook)
9
3) Virtual Reality(VR) headset for Games
: allows you to move around, only if there’s no wire.
: delegate complex processing to fixed server, while
performing only video decompression.
4) Military :
No chance for wires, No heavy batteries was your too busy.
– Information warfare :
1) Soldier locates enemy tank using laser rangefinder with GPS
2) request(for airstrike) to control officers
3) aircraft nearby gets command
10
5) Pico-cell based home network for Games
FTTH
Satellite
xDSL
Cable
I/F
Home
automation
Temp control
security
Home
cellular
A/V digital
network
PDA
cellular
video-phone
Phone &
TV
HDTV, VCR
Game
Camera, Printer
Get all available service,
Allow all possible communications among home devices,
But with no messy wires.
11
6) Medical Uses
pace maker(implanted)
health monitor
hearing aids
7) GPS(for traveller/explorer, driver(car, ship, boat, soldiers …)
8) RF ID(for identifying people, animal, cars…)
passive type : resonant LC circuits
active type(no battery, draws RF power from RF field)
9) Smart Cards :
주민증, Cash drawing
encryption, COS(card OS)
12
4. How to reduce Power
• By all means possible, algorithm, S/W, architectures, data
representation, logic & circuits place & route, clock, process,
library, material
1) algorithm :
adjusting # of taps(N) in FIR filters by measuring noise power.
N=10
transfer
function
N=6(low power)
13
2) Software : similar to the case when reducing code size &
improving speed of execution
– instruction selection and ordering  compiler’s job
to minimize Bus switching
– minimize memory space & access (reduce cache miss)
– codesign for low power
– slow down clock
– halt clock
– lower VDD
– Shut down
14
3) Architectures
• Parallel architecture
VDD
f
VDD/n, f/n
MUX
– Switching Power
P1  CVDD f
2
MUX
VDD 2 f P1
P2  (1  ) n  C 2   2
n
n n
For the same speed
CVDD
CVDD
1
t

2 ~
i
( VDD  VT )
VDD
 f  VDD
Sacrifice area for low power
15
• Pipelining
Latches
VDD
VDD
,f
n
f
P1  CVDD 2 f
i) VDD가
VDD 2
P1
P2  (1   )C  (
) f  2
n
n
1
1
로 되면 speed도
로 됨.
n
n
ii) pipeline stage 수를 n으로 하면 각 stage의 logic complexity는
1
로 되고, 따라서 speed(throughput)가 n 배로 됨.
n
iii) speed는 그대로 유지 됨.(는 pipelining overhead,
ex : 각 stage delay의mismatch …. )
16
• BUS에서의 switching power 소모를 최소화:
CV 2 f
• Effective capacitance
 activity-driven bus placement
Decreasing (activity)
SRAM data
Phys.
Cap.
address bus mostly READ operation
mostly sequential access
 : small
Display
data
 : large
Distance from core to pads
priority for placing bus(route, layer)
17
• V(voltage swing) reduction
lowV
hi-V
I/F
High V
I/F
Large C
Small C
- low-swing bus
ex. GTL(Xerox)
CTT(Mosaid)
JTL(Jedec)
LVTTL, LVCTT ….
- Charge-recycling bus
V
V  01
. VDD
18
• BUS invert encoding :
- send inverted signals when majority of bits are switching,
and de-invert.
Source
Data
EX-OR
DATA bus
Received
data
Polarity signal
Polarity
decision logic
19
• F(frequency) lowering :
PLL
Multiply f by N
using PLL
before distribution
f/N master
clock
PLL
PLL
20
4) Data representation
• Gray code vs. binary 2’s(or 1’s) compl.
n
B
2
(
2
 1)
n
# of toggles ratio :

2
Gn
2n
• signed magintude
vs. 2’s compl.
Zero-crossing 시 sign-bit
만 변함.
Zero crossing 시 full switching
21
5) Logic
• Signal gating : masking unwanted switching
activities from propagating forward, causing
unnecessary power dissipation.
• Additional power due to control signal generation
should be small. Frequency of control signal
needs to be slower than the signal frequency.
22
• Logic encoding ; binary vs. Gray code for
counters
23
24
• State encoding
0.1
0.1
11
0.3
01
0.1
0.4
00
01
0.1
(M1)
VS.
0.3
0.4
00
0.1
11
0.1
(M1)
E(M1) 
= expectation of # of switchings per transition
= 2(0.3+0.4)+1(0.1+0.1)=1.6
E(M2)
1(0.3+0.4+0.1)+2(0.1)=1.0
- assigning don’t cares to either 1 or - for low switching
25
• Precomputation logic ;
– saves power by masking uninfluential input signals into the
combinational logic with g(x), precomputation logic.
– I.e., for the out put f(x), there may be some conditions under
which f(x) is independent of some set of input signals
latched in R2, which can be disabled according to g(x).
26
ex.) Binary comparator : f(A,B) = 1 if A>B
g(x) = AnBn
27
• Systematic method to derive a pre-computation function, g(x),
given f(x), R1 and R2
• Let f(p1, … pm, x1, …, xn) be Boolean function where p1,…, pm
are pre-computed inputs corresponding to R1, and x1,…,xn are
gated inputs corresponding to R2.
• Let fxi(fxi)be the Boolean function obtained by setting xi=1(xi=0)
in f.

• Define Uxi f (=
universal quantification of f w.r.t. xi )= fxi * fxi
• Then Uxi f = 1 implies f=1 regardless of the value of xi,
because Uxif=1 means fxi= fxi =1 in the Shannon’s
decomposition of f w.r.t. xi
f=xi*fxi +xi*fxi
28
• Let g1 = Ux1 Ux2 … Uxn f
Then g1 =1 implies that f=1 regardless of the values of x1 … xn.
I.e., g1=1 is one of the conditions where f is indep. of the
input values of x1 … xn.
• Similarly, g0 = Ux1 Ux2 … Uxn f
g0=1 implies that f=0 regardless of x1,…xn.
• Then
g=g1+g0
is
the
pre-computation
function.
I.e. if g = 1, we can disable the loading of x1,…xn into R2
because output f is independent of gated inputs.
• G, computed this way, may not be the unique precomputation function, but it contains the most number of 1’s
in its truth table among all pre-computation functions.
29
• Examples 1)
Precomputation architecture based on Shannon’s
decomposition;
f(x1,…,xn) = xi *fxi + xi*fxi
30
• Ex 2)
Latch-based pre-computation architecture:
31
6) Low Power Circuits
• Use static rather than dynamic
to avoid unnecessary precharge
• low static power
– self reverse bias for reducing subthreshold current
VDD
S
I1
I2
Pc(Wc)
X

lnID
act
stdby
S=0(active)
S=1(stdby)
Pdi
VGs
Word line drivers
32
• Compromise between dynamic and leakage power dissipation
33
• Multi-VT(threshold) :
speed-critical part : low VT
power-critical part : high VT
- by back-gate bias : routing difficult
- by additional implant
• Adiabatic Computing :
Power dissipation is due to voltage
drop on R  reduce it!
R
by gradual rise & fall of inputs
C
 multi-step clock 파형
34
• Delay vs. power supply voltage(Td vs. VDD)
Td  VDD-1
35
• Power delay product(Energy) vs. delay for various circuits
36
7) Power reduction in clock network
• Why bother with clock network?
– In synchronous circuit, clock is generally the highest
frequency signal.
– And, clock typically drives a large load as it has to reach
many sequential elements.
– In alpha chip, power consumption in the clock network is
40% of total.
• Clock gating:
– Most popular method for power reduction of clock signals
– effective when some functional module(ALU, memory or FPU,
etc) is not required for some extended period.
– Gated clock suffers additional gate
delay due to gating function.
37
• Reduced clock swing:
– Conventional vs. half-swing clocking
38
– Charge sharing circuit for half-swing clock
When CLK is low, VH 
C1  C A
 Vdd
C1  C4  C A  C B
When CLK is high, VH 
C2  C A
 Vdd
C2  C3  C A  C B
 VH  0.5 Vdd if CA=CB >> C1, C2, C3, C4
39
– Simple charge sharing circuit
40
• Tri-state keeper circuit:
– Floating node with its potential somewhere between
GND and VDD is noise-sensitive and can cause DC
power dissipation in the fanin gate
– Floating bus suppressor circuit
41
• Blocking gate
– Fanin gates connected to a node floating( as it is
powered down) can experience large short-circuit
current.
• Use a blocking NAND gate as below:
42
• Reduction of switching activity:
– guarded evaluation:
• adding latches or blocking gates before C/L if its outputs
are not used.
• Ex).
43
– Careful bus multiplexing for +vely correlated data stream
– Aggressive bus multiplexing for -vely correlated data
stream
44
8) process :
• VDD reduction
conflict
 reduce VT
• Standby current를 줄인다.
 VT not too small
• leakage 전류 축소  junction profile, high subthreshold
swing
• switching power 축소  parasitic C 축소
(high-speed와 같은 goal유지)
retrograded channel
trench
sidewall pacer for S/D implant
45
9) Library :
• Small size, various sizes for tr. sizing for delay balancing
long intercon. on low C-layer
to reduce glitch
large C
to reduce buffer size
small C
10) Material
low e inter-layer dielectric
low  material for intercon  copper