Lower Power Synthesis - VADA

Download Report

Transcript Lower Power Synthesis - VADA

Lower Power
Logic/Circuit/Layout Design
1998. 6.7
성균관대학교 조 준 동 교수
http://vlsicad.skku.ac.kr
SungKyunKwan Univ.
VADA Lab.
1
Transition Probability
•
•
•
Transition Probability: Prob. of a transition at the output of a gate, given a
change at the inputs
For temporally uncorrelated data, use signal probabilities
Example: F = X’Y + XY’
– Signal Prob. Of F: Pf = Px(1-Py)+(1-Px)Py
– Transistion Prob. Of F = 2Pf(1-Pf)
– Assumption of independence of inputs
•
•
•
•
Use BDDs to compute these
References: Najm’91
For temporarily correlated data, this is not true, e.g., every 1 on input is
immediately followed by a 0.
Need to compute switching probabilities taking into account the temporal
correlations
SungKyunKwan Univ.
VADA Lab.
2
Technology Mapping
•Implementing a Boolean network in terms of gates from a
given library
•Popular technique: Tree-based mapping
•Library gates and circuits decomposed into canonical
patterns
•Pattern matching and dynamic programming to find the best
cover
•NP-complete for general DAG circuits
•Ref: Keutzer’87, Rudell’89
•Idea: High transition probability points are hidden within
gates
SungKyunKwan Univ.
VADA Lab.
3
Low Power Cell Mapping
•
Example of High Switching
Activity Node
•
Internal Mapping in Complex
Gate
A
A
B
B
Y
C
Y
C
Q
D
SungKyunKwan Univ.
D
VADA Lab.
4
Signal Probability vs. Power
p(x) > 0.5
power : P(x)  p(x) (1-p(x))
p(x) < 0.5
0.0
SungKyunKwan Univ.
0.5
signal probability :p(x)
1.0
VADA Lab.
5
Spatial Correlation
P(x) = 0.25
P(x) = 0.25
P(b) = 0.5
a
x
z
y
P(x) = 0.25
x
P(c) = 0.5
P(z) = 0.4375
b
z
P(z) = 0.375
y
P(d) = 0.5
c
SungKyunKwan Univ.
P(y) = 0.25
VADA Lab.
6
Low Activity XOR Function
SungKyunKwan Univ.
VADA Lab.
7
GLITCH (Spurious transitions)
• 15-20% of the total
power is due to
glitching.
SungKyunKwan Univ.
VADA Lab.
8
Glitches
SungKyunKwan Univ.
VADA Lab.
9
Logic Transformation
SungKyunKwan Univ.
VADA Lab.
10
Logic Transformation
•
•
•
•
•
•
•
Use a signal with low switching activity to reduce the activity on a highly active
signal.
Done by the addition of a redundant connection between the gate with low
activity (source gate) to the gate with a high switching activity (target gate).
Signals a, b, and g1 have very high switching activity and most of time its value
is zero
Suppose c and g1 are selected as the source and target of a new connection ` 1
is undetectable, hence the function of the new circuit remains the same.
Signal c has a long run of zero, and zero is the controlling value of the and gate
g1 , most of the switching activities at the input of g1 will not be seen at the
output, thus switching activity of the gate g1 is reduced.
The redundant connection in a circuit may result in some irredundant
connections becoming redundant.
By adding ` 1 , the connections from c to g3 become redundant.
SungKyunKwan Univ.
VADA Lab.
11
Logic Transformation
SungKyunKwan Univ.
VADA Lab.
12
High-Performance PowerDistribution
• (S: Switching probability; C: Capacitance)
• Start with all logic at the lowest power level; then, successive
iterations of delay calculation, identifying the failing blocks, and
powering
• up are done until either all of the nets pass their delay criteria or
the
• maximum power level is reached.
• Voltage drops in ground and supply wires use up a more
serious fraction of the total noise margin
SungKyunKwan Univ.
VADA Lab.
13
Hazard Generation in Logic Circuits
•Static hazard: A transient pulse
of width w (= the delay of the
inverter).
• Dynamic hazard: the transient
consists of three edges, two
rising and one falling with w of
two units.
• Each input can have several
arriving paths.
SungKyunKwan Univ.
VADA Lab.
14
GATED-CLOCK D-FLIP-FLOP
• Flip- op present a large internal capacitance on the internal clock node.
• If the DFF output does not switch, the DFF does not have to be clocked.
SungKyunKwan Univ.
VADA Lab.
15
Frequency Reduction
◈ Power saving
 Reduces capacitance on the clock network
 Reduces internal power in the affected registers
 Reduces need for muxes(data recirculation)
◈ Opportunity
 Large opportunity for power reduction, dependent on;
 Number of registers gated
 percentage of time clock is enabled
◈ Cost
 Testability
 Complicates clock tree synthesis
 Complicates clock skew balancing
SungKyunKwan Univ.
VADA Lab.
16
Frequency Reduction
Clock Gating Example - When D is not equal to Q
32
data_in
32
data_in
D
reset
FSM
load_en
Q
D
data_out
data
_reg
clk
32
load_en
reset
FSM
clk
clk
load-en_latched
L
A
T
C
H
Q
data_out
data
_reg
clk_en
Before Clock Gating
After Clock Gating
SungKyunKwan Univ.
VADA Lab.
17
Frequency Reduction
◈ Clock Gating Example - Before Code
library ieee;
use ieee.std_logic_1164.all;
use ieee.std_logic_unsigned.all;
entity nongate is
port(clk,rst : in std_logic;
data_in : in std_logic_vector(31 downto 0);
data_out : out std_logic_vector(31 downto 0));
end nongate;
architecture behave of nongate is
signal load_en : std_logic;
signal data_reg : std_logic_vector(31 downto 0);
signal count : integer range 0 to 15;
begin
FSM : process
begin
wait until clk'event and clk='1';
if rst='0' then count <= 0;
elsif count=9 then
count <= 0;
else count <= count+1;
end if;
end process FSM;
SungKyunKwan Univ.
enable_logic : process(count,load_en)
begin
if(count=9) then
load_en <= '1';
else
load_en <= '0';
end if;
end process enable_logic;
datapath : process
begin
wait until clk'event and clk='1';
if load_en='1' then data_reg <= data_in;
end if;
end process datapath;
data_out <= data_reg;
end behave;
configuration cfg_nongate of nongate is
for behave
end for;
end cfg_nongate;
VADA Lab.
18
Frequency Reduction
◈ Clock Gating Example - After Code
library ieee;
use ieee.std_logic_1164.all;
use ieee.std_logic_unsigned.all;
entity gate is
port(clk,rst : in std_logic;
data_in : in std_logic_vector(31 downto 0);
data_out : out std_logic_vector(31 downto
0));
end gate;
architecture behave of gate is
signal load_en,load_en_latched,clk_en : std_logic;
signal data_reg : std_logic_vector(31 downto 0);
signal count : integer range 0 to 15;
begin
SungKyunKwan Univ.
VADA Lab.
19
Frequency Reduction
FSM : process
begin
wait until clk'event and clk='1';
if rst='0' then
count <= 0;
elsif count=9 then count <= 0;
else count <= count+1;
end if;
end process FSM;
enable_logic : process(count,load_en)
begin
if(count=9) then
load_en <= '1';
else load_en <= '0';
end if;
end process enable_logic;
deglitch : PROCESS(clk,load_en)
begin
SungKyunKwan Univ.
if(clk='0') then
load_en_latched <= load_en;
end if;
end process deglitch;
clk_en <= clk and load_en_latched;
datapath : process
begin
wait until clk_en'event and clk_en='1';
data_reg <= data_in;
end process datapath;
data_out <= data_reg;
end behave;
configuration cfg_gate of gate is
for behave
end for;
end cfg_gate;
VADA Lab.
20
Frequency Reduction
◈ Clock Gating Example - Report
SungKyunKwan Univ.
VADA Lab.
21
Frequency Reduction
◈ 4-bit Synchronous & Ripple counter - code
 4-bit Synchronous Counter
Library IEEE;
use IEEE.std_logic_1164.all;
use IEEE.std_logic_arith.all;
entity BINARY is
Port ( clk : In std_logic;
reset : In std_logic;
count : BUFFER UNSIGNED (3
downto 0));
end BINARY;
architecture BEHAVIORAL of BINARY is
begin
process(reset,clk,count)
begin
SungKyunKwan Univ.
if (reset = '0') then count <= "0000”
elsif (clk'event and clk = '1') then
if (count = UNSIGNED'("1111")) then
count <= "0000";
else count <=count+UNSIGNED'("1");
end if;
end if;
end process;
end BEHAVIORAL;
configuration
CFG_BINARY_BLOCK_BEHAVIORAL of
BINARY is
for BEHAVIORAL
end for;
end CFG_BINARY_BLOCK_BEHAVIORAL;
VADA Lab.
22
Frequency Reduction
 4-bit Ripple Counter
Library IEEE;
use IEEE.std_logic_1164.all;
use IEEE.std_logic_arith.all;
entity RIPPLE is
Port ( clk : In std_logic;
reset : In std_logic;
count : BUFFER UNSIGNED (3
downto 0));
end RIPPLE;
architecture BEHAVIORAL of RIPPLE is
signal count0, count1, count2 : std_logic;
begin
process(count)
begin
count0 <= count(0);
count1 <= count(1);
SungKyunKwan Univ.
count2 <= count(2);
end process;
process(reset,clk)
begin
if (reset = '0') then count(0) <= '0';
elsif (clk'event and clk = '1') then
if (count(0) = '1') then count(0) <= '0';
else count(0) <= '1';
end if;
end if;
end process;
process(reset,count0)
begin
if (reset = '0') then count(1) <= '0';
elsif (count0'event and count0 = '1') then
VADA Lab.
23
Frequency Reduction
if (count(1) = '1') then count(1) <= '0';
else count(1) <= '1';
end if;
end if;
end process;
process(reset,count1)
begin
if (reset = '0') then count(2) <= '0';
elsif (count1'event and count1 = '1') then
if (count(2) = '1') then count(2) <= '0';
else count(2) <= '1';
end if;
end if;
end process;
if (count(3) = '1') then count(3) <= '0';
else count(3) <= '1';
end if;
end if;
end process;
end BEHAVIORAL;
configuration
CFG_RIPPLE_BLOCK_BEHAVIORAL of RIPPLE
is
for BEHAVIORAL
end for;
end CFG_RIPPLE_BLOCK_BEHAVIORAL;
process(reset,count2)
begin
if (reset = '0') then count(3) <= '0';
elsif (count2'event and count2 = '1') then
SungKyunKwan Univ.
VADA Lab.
24
Frequency Reduction
◈ 4-bit Synchronous & Ripple counter - Report
SungKyunKwan Univ.
VADA Lab.
25
Bus-Invert Coding for Low Power I/O
An eight-bit bus on which all
eight lines toggle at the same
time and which has a high peak
(worst-case) power dissipation.
•There are 16 transitions over
16 clock cycles (average 1
transition per clock cycle).
SungKyunKwan Univ.
VADA Lab.
26
Peak Power Dissipation
An eight-bit bus on which the
eight lines toggle at different
moments and which has a low
peak power dissipation. There
are the same 16 transitions
over 16 clock cycles and thus
the same average power
dissipation
SungKyunKwan Univ.
VADA Lab.
27
Bus-Invert - Coding for low power
•
•
•
•
•
The Bus-Invert method proposed here uses one extra control bit called
invert. By convention then invert = 0 the bus value will equal the data
value. When invert = 1 the bus value will be the inverted data value.
The peak power dissipation can then be decreased by half by coding
the I/O as follow
1. Compute the Hamming distance (the number of bits in which they
differ) between the present bus value (also counting the present invert
line) and the next data value.
2. If the Hamming distance is larger than n=2, set invert = 1 (and thus
make the next bus value equal to the inverted next data value).
3. Otherwise, let invert = 0 (and let the next bus value equal to the next
data value).
4. At the receiver side the contents of the bus must be conditionally
inverted according to the invert line, unless the data is not stored
encoded as it is (e.g. in a RAM). In any case the value of invert must be
transmitted over the bus (the method increases the number of bus lines
from n to n + 1).
SungKyunKwan Univ.
VADA Lab.
28
Example
A typical eight-bit synchronous data
bus. The transitions between two
consecutive time-slots are \clean".
There are 64 transitions for a period
of 16 time slots. This represents an
average of 4 transitions per time slot,
or 0.5 transitions per bus line per time
slot.
SungKyunKwan Univ.
VADA Lab.
29
Bus encoding
The same sequence of data coded
using the Bus
Invert method. There are now only
53 transitions over a period of 16
time slots. This represents an
average of 3.3 transitions per time
slot, or 0.41 transitions per bus line
per time slot.
The maximum number of
transitions for any time slot is now
4.
SungKyunKwan Univ.
VADA Lab.
30
Comparisons
Comparison of unencoded I/O and coded I/O with one or more invert lines.
The comparison looks at the average and maximum number of transitions
per time-slot, per bus-line per time-slot, and I/O power dissipation for
different bus-widths.
SungKyunKwan Univ.
VADA Lab.
31
Remarks
•
•
•
•
The increase in the delay of the data-path: By looking at the power-delay
product which removes the effect of frequency (delay) on power dissipation, a
clear improvement is obtained in the form of an absolute lower number of
transitions. It is also relatively easy to pipeline the bus activity. The extra pipeline
stage and the extra latency must then be considered.
The increased number of I/O pins. As was mentioned before ground-bounce is a
big problem for simultaneous switching in high speed designs. That is why
modern microprocessors use a large number of Vdd and GND pins. The BusInvert method has the side-effect of decreasing the maximum ground-bounce by
approximately 50%. Thus circuits using the Bus Invert method can use a lower
number of Vdd and GND pins and by using the method the total number of pins
might even decrease.
Bus-Invert method decreases the total power dissipation although both the total
number of transitions increases (by counting the extra internal transitions) and
the total capacitance increases (because of the extra circuitry). This is
possible because the transitions get redistributed very nonuniformly, more on
the low-capacitance side and less on the high-capacitance side.
SungKyunKwan Univ.
VADA Lab.
32
References
[1] H. B. Bakoglu, Circuits, Interconnections and Packaging for
VLSI, Addison-Wesley, 1990.
[2] T. K. Callaway, E. E. Swartzlander, \Estimating the Power Consumption of CMOS Adders", 11th Symp. on Comp. Arithmetic,
pp. 210-216, Windsor, Ontario, 1993.
[3] A. P. Chandrakasan, S. Sheng, R. W. Brodersen, \Low-Power
CMOS Digital Design", IEEE Journal of Solid-State Circuits,
pp. 473-484, April 1992.
[4] A. P. Chandrakasan, M. Potkonjak, J. Rabaey, R. W. Brodersen,
\HYPER-LP: A System for Power Minimization Using Architectural Transformations", ICCAD-92, pp.300-303, Nov. 1992,
Santa Clara, CA.
[5] A. P. Chandrakasan, M. Potkonjak, J. Rabaey, R. W. Brodersen,
\An Approach to Power Minimization Using Transformations",
IEEE VLSI for Signal Processing Workshop, pp. , 1992, CA.
[6] S. Devadas, K. Keutzer, J. White, \Estimation of Power Dissipation in CMOS Combinational Circuits", IEEE Custom Integrated Circuits Conference, pp. 19.7.1-19.7.6, 1990.
[7] D. Dobberpuhl et al. \A 200-MHz 64-bit Dual-Issue CMOS Microprocessor", IEEE Journal of Solid-State Circuits, pp. 15551567, Nov. 1992.
[8] R. J. Fletcher, \Integrated Circuit Having Outputs Congured
for Reduced State Changes", U.S. Patent no. 4,667,337, May,
1987.
SungKyunKwan Univ.
[9] D. Gajski, N. Dutt, A. Wu, S. Lin, High-Level
Synthesis, Introduction to Chip and System Design,
Kluwer Academic Publishers, 1992.
[10] J. S. Gardner, \Designing with the IDT
SyncFIFO: the Architecture of the Future", 1992
Synchronous (Clocked) FIFO Design Guide,
Integrated Device Technology AN-60, pp. 7-10, 1992,
Santa Clara, CA.
[11] A. Ghosh, S. Devadas, K. Keutzer, J. White,
\Estimation of Average Switching Activity in
Combinational and Sequential Circuits",
Proceedings of the 29th DAC, pp. 253-259, June
1992, Anaheim, CA.
[12] J. L. Hennessy, D. A. Patterson, Computer
Architecture - A
Quantitative Approach, Morgan Kaufmann
Publishers, Palo
Alto, CA, 1990.
[13] S. Kodical, \Simultaneous Switching Noise",
1993 IDT High-Speed CMOS Logic Design Guide,
Integrated Device Technology AN-47, pp. 41-47,
1993, Santa Clara, CA.
[14] F. Najm, \Transition Density, A Stochastic
Measure of Activity in Digital Circuits", Proceedings
of the 28th DAC, pp. 644-649, June 1991, Anaheim,
CA.
VADA Lab.
33
References
[16] A. Park, R. Maeder, \Codes to Reduce Switching
Transients Across VLSI I/O Pins", Computer
Architecture News, pp. 17-21, Sept. 1992.
[17] Rambus - Architectural Overview, Rambus Inc.,
Mountain View, CA, 1993. Contact
[email protected].
[18] A. Shen, A. Ghosh, S. Devadas, K. Keutzer, \On
Average Power Dissipation and Random Pattern
Testability", ICCAD-92, pp. 402-407, Nov. 1992,
Santa Clara, CA.
[19] M. R. Stan, \Shift register generators for circular
FIFOs", Electronic Engineering, pp. 26-27,
February 1991, Morgan Grampian House,
London, England.
[20] M. R. Stan, W. P. Burleson, \Limited-weight
codes for low power I/O", International
Workshop on Low Power Design, April 1994,
Napa, CA.
SungKyunKwan Univ.
[21] J. Tabor, Noise Reduction Using Low Weight and
Constant Weight Coding Techniques, Master's Thesis,
EECS Dept., MIT, May 1990.
[22] W.-C. Tan, T. H.-Y. Meng, \Low-power polygon
renderer for computer graphics", Int. Conf. on
A.S.A.P., pp. 200-213, 1993.
[23] N. Weste, K. Eshraghian, Principles of CMOS
VLSI Design, A Systems Perspective, AddisonWesley Publishing Company, 1988.
[24] R. Wilson, \Low power and paradox", Electronic
Engineering Times, pp. 38, November 1, 1993.
[25] J. Ziv, A. Lempel, A universal Algorithm for
Sequential Data Compression", IEEE Trans. on Inf.
Theory, vol. IT-23, pp. 337-343, 1977.
VADA Lab.
34
DesignPower Gate Level Power
Model
◈ Switching Power
 Power dissipated when a load capacitance(gate+wire) is charged or
discharged at the driver’s output
 If the technology library contains the correct capacitance
value of the cell and if capacitive_load_unit attribute is
specified then no additional information is needed for
switching power modeling
 Output pin capacitance need not be modeled if the switching
power is incorporated into the internal power
2
V
Psw 
 [  Ci  TRi ]
2
forall
nets
SungKyunKwan Univ.
VADA Lab.
35
DesignPower Gate Level Power
Model
◈ Internal Power
 power dissipated internal to a library cell
 Modeled using energy lookup table indexed by input
transition time and output load
 Library cells may contain one or more internal energy lookup
tables
P int   E int i ( outputload, inputtransition) TRi ]
forall
Cells
SungKyunKwan Univ.
VADA Lab.
36
DesignPower Gate Level Power
Model
◈ Leakage Power
 Leakage power model supports a signal value for each library cell
 State dependent leakage power is not supported
Pleak

 Pleaki
fo ra ll
C ells
SungKyunKwan Univ.
VADA Lab.
37
Operand Isolation
m
Significant Power Dissipation
m
D
n
m
Q
Register
Data_out
Bank
• Combinational logic
dissipates significant power
when output is unused
FSM
EN
m
m
n
D
n
Latch
m
Q
Register
G
Bank
FSM
Data_out
• Inputs to combination logic
held stable when output is
unused
EN
SungKyunKwan Univ.
VADA Lab.
38
Operation Isolation Example -Diagram
Data_Mul
8
Data_Add
a
8
D
ADD
do
MUL
Q
b
16
8
Before
DataReg
c
rst
FSM
Load_En Load_En_Latched
D
Q
Latch
G
Operand Isolation
Clk_En
clk
Data_Add
Iso_Data_Add
Data_Mul
8
a
D
8
ADD
Q
Latch
G
D
do
ADD
Q
b
16
8
After
DataReg
c
rst
FSM
Load_En
D
Load_En_Latched
Q
Clk_En
Operand Isolation
Latch
G
clk
SungKyunKwan Univ.
VADA Lab.
39
Operand Isolation Example - Before Code
Library IEEE;
Use IEEE.STD_LOGIC_1164.ALL;
Use IEEE.STD_LOGIC_SIGNED.ALL;
Signal Data_Add : std_logic_vector(7 downto 0);
Signal Data_Mul : std_logic_vector(15 downto 0);
Begin
Entity Logic is
Port(
a, b, c : in std_logic_vector(7 downto 0);
do : out std_logic_vector(15 downto 0);
rst : in std_logic;
clk : in std_logic
);
End Logic;
Process(clk,rst)
Architecture Behave of Logic is
Signal Count : integer;
Signal Load_En : std_logic;
-- Counter Logic in FSM
Begin
If(clk='1' and clk'event) then
If(rst='0') then
Count <= 0;
Elsif(Count=9) then
Count <= 0;
Else
Count <= Count + 1;
End If;
End If;
End Process;
Signal Load_En_Latched : std_logic;
Signal Clk_En : std_logic;
SungKyunKwan Univ.
VADA Lab.
40
Operand Isolation Example - Before Code
Process(Count)
-- Enable Logic in FSM
Begin
If(Count=9) then
Load_En <= '1';
Else
Load_EN <= '0';
End If;
End Process;
Process(clk,Load_En)
-- Latch(for Deglitch) Logic
Begin
If(clk='0') then
Load_En_Latched <= Load_En;
End If;
End Process;
clk_En <= clk and Load_En_Latched;
SungKyunKwan Univ.
Data_Add <= a + b;
Data_Mul <= Data_Add * c;
Process(Data_Mul,Clk_En) -- Data Reg Logic
Begin
If(Clk_En='1' and Clk_En'event) then
Do <= Data_Mul;
End If;
End Process;
End Behave;
Configuration CFG_Logic of Logic is
for Behave
End for;
End CFG_Logic;
VADA Lab.
41
Operand Isolation Example - After Code
Library IEEE;
Use IEEE.STD_LOGIC_1164.ALL;
Use IEEE.STD_LOGIC_SIGNED.ALL;
Entity Logic1 is
Port(
a, b, c : in std_logic_vector(7 downto 0);
do : out std_logic_vector(15 downto 0);
rst : in std_logic;
clk : in std_logic
);
End Logic1;
Architecture Behave of Logic1 is
Signal Count : integer;
Signal Load_En : std_logic;
Signal Load_En_Latched : std_logic;
Signal Clk_En : std_logic;
SungKyunKwan Univ.
Signal Data_Add : std_logic_vector(7 downto 0);
Signal Data_Mul : std_logic_vector(15 downto 0);
Signal Iso_Data_Add : std_logic_vector(7 downto 0);
Begin
Process(clk,rst)
-- Counter Logic in FSM
Begin
If(clk='1' and clk'event) then
If(rst='0') then
Count <= 0;
Elsif(Count=9) then
Count <= 0;
Else
Count <= Count + 1;
End If;
End If;
End Process;
VADA Lab.
42
Operand Isolation Example - After Code
Process(Count)
-- Enable Logic in FSM
Begin
If(Count=9) then
Load_En <= '1';
Else
Load_EN <= '0';
End If;
End Process;
Process(Load_En_Latched,Data_Add)
-- Latch
Begin
-- for Operand Isolation
If(Load_En_Latched='1' and
Load_En_Latched'event) then
Iso_Data_Add <= Data_Add;
End If;
End Process;
Data_Mul <= Iso_Data_Add * c;
Process(clk,Load_En)
-- Latch(for Deglitch) Logic
Begin
If(clk='0') then
Load_En_Latched <= Load_En;
End If;
End Process;
Process(Data_Mul,Clk_En) -- Data Reg Logic
Begin
If(Clk_En='1' and Clk_En'event) then
Do <= Data_Mul;
End If;
End Process;
clk_En <= clk and Load_En_Latched;
End Behave;
Data_Add <= a + b;
SungKyunKwan Univ.
VADA Lab.
43
Operand Isolation Example - Report
Before Code
SungKyunKwan Univ.
After Code
VADA Lab.
44
Precomputation
•
Power saving
– Reduces power dissipation of combinational logic
– Reduces internal power to precomputed registers
• Opportunity
– Can be significant, dependent on;
• percentage of time latch precomputation is successful
•
Cost
– Increase area
– Impact circuit timing
– Increase design complexity
• number of bits to precompute
– Testability
• may generate redundant logic
SungKyunKwan Univ.
VADA Lab.
45
Precomputation
Register
Bank
p
/
n
/
n
Register
Bank
p
/
Entire function is
computed.
Data_out
/
n-m
/
Register
Bank
p
/
n-m
/
p
/
EN
m
/
D
Register
Bank
Register
Bank
1
/
Q
m
/
/
p
/
SungKyunKwan Univ.
p
/
Data_out
Smaller function is
defined,
Enable is precomputed.
VADA Lab.
46
Precomputation
• Before Precomputation Diagram
a
b
8
/
8
/
8
/
a>b
8
/
1
/
1
/
Data_out
CLK
SungKyunKwan Univ.
VADA Lab.
47
Precomputation
• After Precomputation Diagram
a(6:0)
7
/
a(6:0)
b(6:0)
7
/
8
/
7
/
b(6:0)
1
/
7
/
a>b
1
/
Data_out
8
/
Latch
1
a(7) /
a(7)
1
/
b(7)
1
/
1
b(7) /
CLK
SungKyunKwan Univ.
VADA Lab.
48
Precomputation
• Before Precomputation - Report
SungKyunKwan Univ.
VADA Lab.
49
Precomputation
• After Precomputation - Report
SungKyunKwan Univ.
VADA Lab.
50
Low power circuit techniques
• Power modeling on circuit level. Node
activity. Speed and supply voltage. Flipflops and latches.
• Driving large loads. Clocking and clock
distribution, Low swing
• circuit techniques (adiabetic, carry select
adder, manchester carry chain).
SungKyunKwan Univ.
VADA Lab.
51
Precomputation Example - Before
Code
Library IEEE;
Use IEEE.STD_LOGIC_1164.ALL;
Entity before_precomputation is
port ( a,b : in std_logic_vector(7 downto
0);
CLK: in std_logic;
D_out: out std_logic);
end before_precomputation;
Architecture
Behav
before_precomputation is
of
signal a_in, b_in : std_logic_vector(7
downto 0);
signal comp : std_logic;
SungKyunKwan Univ.
Begin
process (a,b,CLK)
Begin
if (CLK = '1' and CLK'event)
then
a_in <= a;
b_in<= b;
end if;
if (a_in > b_in) then
comp <= '1';
else
comp <= '0';
end if;
if (CLK'event and CLK='1')
then
D_out <= comp;
end if;
end process;
end Behav;
VADA Lab.
52
Precomputation Example - After
Code
Begin
process(a,b,CLK)
Begin
Library IEEE;
Use IEEE.STD_LOGIC_1164.ALL;
Entity after_precomputation is
port (a, b : in std_logic_vector(7 downto
0);
CLK: in std_logic;
D_out: out std_logic);
end after_precomputation;
if (CLK='1' and CLK'event) then
a_in(7) <= a(7);
b_in(7) <= b(7);
end if;
Architecture
Behav
after_precomputation is
if (CLK='0') then
pcom_D <= pcom;
end if;
of
signal a_in, b_in : std_logic_vector(7
downto 0);
signal pcom, pcom_D : std_logic;
signal CLK_en, comp : std_logic;
SungKyunKwan Univ.
pcom <= a xor b;
CLK_en <= pcom_D and CLK;
VADA Lab.
53
Precomputation - Example After
Code
if (CLK_en='1' and CLK_en'event)
then
a_in(6 downto 0) <= a(6
downto 0);
b_in(6 downto 0) <= b(6
downto 0);
end if;
if (CLK='1' and CLK'event) then
D_out <= comp;
end if;
end process;
end Behav;
if (a_in > b_in) then
comp <= '1';
else
comp <= '0';
end if;
SungKyunKwan Univ.
VADA Lab.
54
Peak Power Reduction
•
•
Peak Power has relation to EMI
Reducing concurrent switching makes
peak power reduction
– Adjust delay  within the speed of
system clock in Bus/Port driver
– Consider the power consumption
of delay element
– Maintaining total power
consumption, we improve EMI in
peak power reduction
• Before Peak Power Reduction
Itotal
n bits wide
E1
• After Peak Power Reduction
t
Itotal
E  Vdd   I totoldt

n bits wide
E2
t
t
(n-1)/ 
SungKyunKwan Univ.
VADA Lab.
55
Factoring Example
Function :
f = ad + bc + cd
The function f is not on the critical path.
The signal a,b,c and d are all the same bit width.
Signal b is a high activity net.
The two factorings below are equivalent from both a timing
and area criteria.
Net Result : network toggling and power is reduced.
f = b(a+c) + cd
f = b(a+c) + cd
a
a
c
b
b
f
f
c
c
b
d
d
SungKyunKwan Univ.
VADA Lab.
56
Low Power Logic Gate
Resynthesis on Mapped
Circuit
김현상 조준동
전기전자컴퓨터공학부
성균관대학교
SungKyunKwan Univ.
VADA Lab.
57
Low Power Logic Synthesis
RTL Description
Logic Synthesis
Technology Independent
Optimization
Logic Equation
Timing & Power
Analysis Tools
Technology Mapping
Connection of Gates
Resynthesis on Mapped
Circuit
Gate Level Description
SungKyunKwan Univ.
VADA Lab.
58
Technology Mapping
h
l
h
h
h
l
l
(a)
l
(b)
l
h : high switching activity node
l
l : low switching activity node
(c)
SungKyunKwan Univ.
VADA Lab.
59
Tree Decomposition
f
f
Low Power
(b)
(a)
critical path
primary input
gate(AND)
f
SungKyunKwan Univ.
output
VADA Lab.
60
Huffman Algorithm
23
13
y3
y1
5
x5
8
2
3
x1
x2 x3
SungKyunKwan Univ.
10
4
y2
4
x4
VADA Lab.
61
Depth-Constrained Decomposition
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Algorithm
problem : minimize SUM from i=1 to m p_t (x_i )
input : 입력 시그널 확률(p1, p2,íñíñíñ, pn), 높이(h), 말단 노드의 수(n), 게이트당 fanin
limit(k)
output : k-ary 트리 topology
Begin
sort (signal probability of p1, p2,íñíñíñ, pn);
while (n!=0)
if (h>logkn)
assign k nodes to level L(=h+1);
/*레벨 L(=h+1)에 노드 k개만큼 할당*/
h=h-1, n=n-(k-1);
/*upward*/
else if (h<logkn)
assign k nodes to level L(=h+2);
/*이전 레벨 L(=h+2)에 노드 k개만큼 할당*/
h=h, n=n-(k-1);
/*downward*/
else (h=logkn)
assign the remaining nodes to level L(=h+1);
/*complete; 레벨 L(=h+1)에 나머지 노드를 모두 할당하고
complete k-ary 트리 구성*/
•
•
•
for (bottom level L; L>1; L--)
min_edge_weight_matching (nodes in level L);
End
SungKyunKwan Univ.
VADA Lab.
62
Example
level L=0
h=1
level L=1
h=2
x
x
y
x
y
e
f
0.5
0.6
level L=2
h=3
a
b
a
b
c
d
a
b
c
d
0.1
0.2
0.1
0.2
0.3
0.4
0.1
0.2
0.3
0.4
x
y
e
f
0.5
0.6
level L=3
x
y
a
b
c
d
a
d
b
c
0.1
0.2
0.3
0.4
0.1
0.4
0.2
0.3
before matching
SungKyunKwan Univ.
e
f
0.5
0.6
after matching
VADA Lab.
63
After Decomposition
K 1 =2
16
14
SIS
SIS+OURS
Improvement Ratio
Value, Ratio
12
10
8
6
4
2
0
h=3
6
h=4
10
h=6
h=5
h=7
h=5
20
h=7
h=9
Fanin, Height
SungKyunKwan Univ.
VADA Lab.
64
After Tech. Mapping
K 1 =3, k 2 =3
80
SIS+LEVEL MAP
70
SIS+OURS+LEVEL MAP
Improvement Ratio
Power(mW), Ratio
60
50
40
30
20
10
0
h=2
6
h=3
h=3
10
h=4
h=5
h=3
15
h=4
h=5
h=5
20
h=6
h=7
h=8
Fanin, Height
SungKyunKwan Univ.
VADA Lab.
65
Buffer Chain
•
•
Delay analysis of buffer chain
(W / L) k 1  a (W / L) k
n
Ck  a k Cin  a k 1C p
n
Td   t d (k )   a  t0  n a  t0
k 1
Pk  Ck  Vdd  f  Vdd  f  a i 1  (a  Cin  C p )
2
k 1
n
2
PT   Pk  Vdd  f  (a  Cin  C p )
C L  a  Cin
n
2
k 1
ln( C L / Cin )
ln( a )
ln( C L / Cin )
Td  a  t0 
ln( a )
 (Td )
0
a
(a ) optimum  e  2.72 ,
n
size 1
Delay analysis considering
parasitic capacitance,Cp
Eff 
a n 1
a 1
a n 1
a n 1  a n  2 
a 1
a  2 ~ 10 (typical : e)
Ck,Pk: stage k buffer output의 total capacitance, power
PT: buffer chain의 power consumption
Pn: load capacitance CL의 power consumption
(n) optimum  ln( C L / Cin )
size a
size a
i-2
Eff: power efficiency pn/pT
size ai-1
size an-1
input
C in
stage 1
stage 1
aC in
ai-1 C in
stage (i-1)
SungKyunKwan Univ.
aiC in
stage i
C in = an C in
stage n
VADA Lab.
66
Slew Rate
•
Determining rise/fall time
I short
I mean
t3
t

2 2
   I short (t )dt   I short (t )dt 
T  t1

t2
t

4 2
   I short (t )dt 
T  t1

t

4 2 
   (Vin  Vt ) 2 dt 
T  t1 2

where,  n   p   , Vtp  Vtn  Vt
PSC  I mean  Vdd 
where, t r  t f  

2
(Vdd  2Vt ) 3 f
Period T
Vin
Vdd +V tp
tr
tf
Vtn
Imax
Imean
t1 t2 t3
SungKyunKwan Univ.
VADA Lab.
67
Slew Rate(Cont’d)
•
Power consumption of Short circuit current in Oscillation Circuit
Vo
Vo
Vdd
Vdd
Vi
Vi
Vdd
Vi
SungKyunKwan Univ.
Vdd
Vo
VADA Lab.
68
Pass Transistor Logic
•
Reducing Area/Power
– Macro cell(Large part in chip
area)
XOR/XNOR/MUX(Primitive) 
Pass Tr. Logic
– Not using charge/discharge scheme
 Appropriate in Low Power Logic
•
CPL
– Basic Scheme
A
B
B
A
B
B
AB
•
Pass Tr logic Family
– CPL (Complementary Pass
Transistor Logic)
– DPL (Dual Pass Transistor Logic)
– SRPL (Swing Restored Pass
Transistor Logic)
AB
– Inverter Buffering
A
B
B
A
B
B
Vdddd
V
AB
AB
p-MOS Latch
SungKyunKwan Univ.
VADA Lab.
69
Pass Transistor Logic(Cont’d)
•
DPL
– Pass Tr Network + Dual p-MOS
– Enables rail-to-rail swing
– Characteristics
• Increasing input
capacitance(delay)
• Increasing driving ability for
existing 2 ON-path
• equals CPL in input loading
capacitance
•
A
A
B
B
B
SRPL
– Pass Tr network + Cross
coupled inverter
– Restoring logic level
– Inverter size must not be too
big
n-MOS CPL
network
B
A
A
AB
SungKyunKwan Univ.
AB
VADA Lab.
70
Dynamic Logic
•
•
•
Using Precharge/Evaluation scheme
Family
– Domino logic
– NORA(NO RAce) logic
Characteristics
– Decreasing input loading
capacitance
– Power consumption in precharge
clock
– Increasing useless switching in
precharging period
precharge
evaluation
•
Basic architecture of Domino logic
P1
A
CL
A
B
N
Logic Block
clk
clk
A
C in
N1
B
SungKyunKwan Univ.
VADA Lab.
71
Input Pin Ordering
•
•
•
Reorder the equivalent inputs to a
transistor based on critical path delays
and power consumption
N- input Primitive CMOS logic
– symmetrical in function level
– antisymmetrical in Tr level
• capacitance of output stage
• body effect
Scheme
– The signal that has many transition
must be far from output
– If it is hard to estimate switching
frequency, we must determine pin
ordering considering path and path
delay balance from primary input
to input of Tr.
SungKyunKwan Univ.
•
Example of N-input CMOS logic
CL
A
B
C1
C
C2
D
C3
Experimentd with gate array of TI
For a 4-input NAND gate in TI’s BiCMOS gate
array library (with a load of 13 inverters), the delay
varies by 20% while power dissipation by 10%
between a good and bad ordering
VADA Lab.
72
INPUT PIN Reordering
VDD
A
B
C
MPA
MPB
1
D
MPC
1
A
MNA
MPD
CL Simulation result
( tcycle=50ns, tf/tr=1ns)
1
1
1
1
B
MNB
CB
: A가 critical input인 경우
=38.4uW,
1
1
1
1
C
MNC
CC
D가 critical input인 경우
=47.2uW
D
MND
CD
1
(a) (b)
1
(c) (d)
SungKyunKwan Univ.
VADA Lab.
73
Sensitization
•
Definition
– sensitization : input signal that
forces output transition event
– sensitization vector : the other
inputs if one signal is
sensitized
Y
 [ f ] X i 0  [ f ] X i 1
X i
 f ( X 1 ,, X i l ,0, X i 1 ,, X n ) 
 f ( X 1 ,, X i l ,1, X i 1 ,, X n )
SungKyunKwan Univ.
•
Example
X1
X2
X3
Y  ( X1  X 2 )  X 3
Y
 [ f ] X1 0  [ f ] X1 1
X 1
 X2X3  X3  X2X3
VADA Lab.
74
Sensitization(Cont’d)
•
Considering Sensitization in
Combinational logic:Remove
unnecessary transitions in the C.L
Q
Considering Sensitization in
Sequential logic: Also reduces the
power consumption in the flipflops.
X1
D
Q
Q
Y
Combinational
Logic
Xn
•
Xn
D
Q
E
E
X1
Q
X1
Y
E
Q
D
Q
Y
Combinational
Logic
Combinational
Logic
Xn
Y
Combinational
Logic
Xn
D
Q
E
clk
SungKyunKwan Univ.
VADA Lab.
75
TTL-Compatible
• TTL level signal  CMOS
input
•
Vdd
IDTTL1
Characteristic Curve of CMOS
Inverter
Vo
V dd = 3.3V
IDTTL2
Vin
TTL
INPUT
Vi
Vo
1.4V
Ileak = avg(I
V dd = 3.3V
,I )
d1 d2
PTTL  NTTL Vdd  ( I DTTL1  I DTTL 2 )
wher e NTTL : number of TTL compatible input pad
V IL = 0.8V
SungKyunKwan Univ.
V IH = 2.0V
Vi
V dd = 3.3V
VADA Lab.
76
TTL Compatible(Cont’d)
•
CMOS output signal  TTL input
Chip Boundary
Chip Boundary
– Because of sink current IOL,
CMOS gets a large amount of
heat
IOL
– Increased chip operating
temperature
– Power consumption of whole
system
SungKyunKwan Univ.
Input Pad
VOL
Output Pad
VADA Lab.
77
INPUT PIN Reordering
◈ To reduce the power dissipation one should place
the
input with low transition density near the ground
end.
(a) If MNA turns off , only CL needs to be charged
(b) If MND turns off , all CL, CB, CC and CD needs to be charged
(c) If the critical input is rising and placed near output node,
the
initial charge of CB, CC and CD are zero and the delay time
of CL
discharging is less than (d)
(d) If the critical input is rising and placed near ground end, the
charge of
CB, CC and CD must dischagge before the charge of CL
discharge
to
SungKyunKwan
Univ.
VADA Lab.
zero
78
Conclusion
% of instances with
circuit states effects
9.0%
reduction
Power[pJ]
35
12
30
10
12bit
8bit
0
4bit
5
bits
SungKyunKwan Univ.
4
2
0
average
10
6
12bit
circuit
states
effects
considered
4.0%
reduction
8
8bit
20
15
circuit
states
effects not
considered
12.0%
reduction
4bit
25
bits
VADA Lab.
Device Scaling of Factor of S
•
•
•
•
•
•
•
•
•
Constant scaled wire increases coupling capacitance by S and wire resistance
by S
Supply Voltage by 1/S, Theshold Voltage by 1/S, Current Drive by 1/S
Gate Capaitance by 1/S, Gate Delay by 1/S
Global Interconnection Delay, RC load+para by S
Interconnect Delay: 50-70% of Clock Cycle
Area: 1/S2
Power dissipation by 1/S - 1/S2
( P = nCVdd2f, where nC is the sum of capacitance times #transitions)
SIA (Semiconductor Industry Association): On 2007, physical limitation: 0.1 m
20 billion transistors, 10 sqare centimeters
SungKyunKwan Univ.
, 12 or 16 inch wafer
VADA Lab.
80
Delay Variations at Low-Voltage
• At high supply voltage, the delay increases with temperature
(mobility is decreasing with temperature) while at very low
supply voltages the delay decreases with temperature (VT is
decreasing with temperature).
• At low supply voltages, the delay ratio between large and
minimum transistor widths W increases in several factors.
• Delay balancing of clock trees based on wire snaking in order
to avoid clock-skew. In this case, at low supply voltages, slightly
VT variations can significantly modify the delay balancing.
SungKyunKwan Univ.
VADA Lab.
81
Quarter Micron Challenge
•
•
•
•
•
•
•
•
•
•
•
•
•
Computers/peripherals (SOC): 1996 ($50 Billion) 1999 ($70 Billion)
Wiring dominates delay: wire R comparable to gate driver R; wire/wire coupling
C > C to ground
Push beyond 0.07 micron
Quest for area(past), speed-speed (now), power-power-power(future)
Accelerated increases of clock frequencies
Signal integrity-based tools
Design styles (chip + packages)
System-level design(system partitioning)
Synthesis with multiple constraints (power,area,timing)
Partitioning/MCM
Increasing speed limits complicate clock and power distribution
Design bounded by wires, vias, via resistance, coupling
Reverse scaling: adding area/spacing as needed: widening, thickening of wires,
metal shielding & noise avoidance - adding metal
SungKyunKwan Univ.
VADA Lab.
82
CLOCK POWER CONSUMPTION
•Clock power consumption is as
large as the logic power; Clock
Signal carrying the heaviest load
and switching at high frequency,
clock distribution is a major
source of power dissipation.
• In a microprocessor, 18% of
the total power is consumed by
clocking
• Clock distribution is designed
as a hierarchical clock tree,
according to the decomposition
principle.
SungKyunKwan Univ.
VADA Lab.
83
Power Consumption per block in
typical microprocessor
SungKyunKwan Univ.
VADA Lab.
84
Crosstalk
SungKyunKwan Univ.
VADA Lab.
85
Solution for Clock Skew
•
•
•
•
•
•
•
•
•
•
•
Dynamic Effects on Skew
Capacitance Coupling
Supply Voltage Deviation (Clock
driver and receiver voltage
difference)
Capacitance deviation by circuit
operation
Global and local temperature
Layout Issues: clocks routed first
Must aware of all sources of delay
Increased spacing
Wider wires
Insert buffers
Specialized clock need net
matching
Two approaches: Single Driver, Htree driver
SungKyunKwan Univ.
•
•
•
•
Gated Clocks: The local clocks that
are conditionally enabled so that the
registers are only clocked during the
write cycles. The clock is partitioned
in different blocks and each block is
clocked with its own clock.
Gating the clocks to infrequently
used blocks does not provide and
acceptable level of power savings
Divide the basic clock frequency to
provide the lowest clock frequency
needed to different parts of the
circuit
Clock Distribution: large clock buffer
waste power. Use smaller clock
buffers with a well-balanced clock
tree.
VADA Lab.
86
PowerPC Clocking Scheme
SungKyunKwan Univ.
VADA Lab.
87
CLOCK DRIVERS IN THE DEC ALPHA
21164
SungKyunKwan Univ.
VADA Lab.
88
DRIVER for PADS or LARGE CAPACITANCES
Off-chip power (drivers and pads) are increasing and is very difficult
to reduce such a power, as the pads or drivers sizes cannot be
decreased with the new technologies.
SungKyunKwan Univ.
VADA Lab.
89
Layout-Driven Resynthesis for Lower Power
SungKyunKwan Univ.
VADA Lab.
90
Low Power Process
• Dynamic Power Dissipation
Vdd
C djp
Pd  a  C L  Vdd  f
2
I ds 

2
(Vgs  Vt )
2
Vin
C ovp
Vo
C ovn
C djn
n
C gate  Cox  (W  L)
i 1
m
Cin   (C gate ) j
D
j 1
Cov  CGD 0  W
Cdj  C j  AD  C jsw  PD
AD  W  D, PD  2(W  D)
SungKyunKwan Univ.
Drain
W
C jb
C jsw
VADA Lab.
91
Crosstalk
•
•
•
In deep-submicron layouts, some of the netlengths for connection between
modules can be so long that they have a resistance which is comparable to the
resistance of the driver.
Each net in the mixed analog/digital circuits is identified depending upon its
crosstalk sensitivity
– 1. Noisy = high impedance signal that can disturb other signals, e.g., clock
signals.
– 2. High-Sensitivity = high impedance analog nets; the most noise sensitive
nets such as the input nets to operational amplifiers.
– 3. Mid-Sensitivity = low/medium impedance analog nets.
– 4. Low-Sensitivity = digital nets that directly affect the analog part in some
cells such as control
signals.
– 5. Non-Sensitivity = The most noise insensitive nets such as pure digital
nets,
The crosstalk between two interconnection wires also depends on the
frequencies (i.e., signal activities) of the signals traveling on the wires. Recently,
deep-submicron designs require crosstalk-free channel routing.
92
SungKyunKwan Univ.
VADA Lab.
Power Measure in Layout
•
•
•
•
•
The average dynamic power consumed by a CMOS gate is given below, where
C_l is the load capacity at the output of the node, V_dd is the supply voltage,
T_cycle is the global clock period, N is the number of transitions of the gate
output per clock cycle, C_g is the load capacity due to input capacitance of
fanout gates, and C_w is the load capacity due to the interconnection tree
formed between the driver and its fanout gates.
Pav = (0.5 Vdd2) / (Tcycle Cl N) = (0.5 Vdd2) / (Tcycle (Cg + Cw )N)
Logic synthesis for low power attempts to minimize SUMi Cgi Ni
Physical design for low power tries to minimize
SUMi Cwi Ni
. Here Cwi consists of Cxi + CsI, where Cxi is the capacitance of net i due to its
crosstalk, and CsI is the substrate capacitance of net i. For low power layout
applications, power dissipation due to crosstalk is minimized by ensuring that
wires carrying high activity signals are placed sufficiently far from the other wires.
Similarly, power dissipation due to substrate capacitance is proportional to the
wirelength and its signal activity.
SungKyunKwan Univ.
VADA Lab.
93
이중 전압을 이용한 저전력
레이아웃 설계
성균관대학교
전기전자컴퓨터공학부
김 진 혁, 이 준 성, 조 준 동
SungKyunKwan Univ.
VADA Lab.
목
•
•
•
•
•
•
•
•
•
차
연구목적
연구배경
Clustered Voltage Scaling 구조
Row by Row Power Supply 구조
Mix-And-Match Power Supply 구조
Level Converter 구조
Mix-And-Match Power Supply 설계흐름
실험결과
결론
SungKyunKwan Univ.
VADA Lab.
연 구 목 적 및 배경
•
조합회로의 전력 소모량을 줄이는
이중 전압 레이아웃 기법 제안
•
이중 전압 셀을 사용할 때, 한 cell
row에 같은 전압의 cell이 배치되면
서 증가하는 wiring 과 track 의 수를
줄임
•
최소 트랜지스터 개수를 사용하는
Level Converter 회로의 구현
SungKyunKwan Univ.
•
디바이스의 성능을 유지하면서
이중 전압을 사용하는 Clustered
Voltage Scaling [Usami, ’95]을 적
용
•
제안된 Mix-And-Match Power
Supply 레이 아웃 구조는 기존의
Row by Row Power Supply
[Usami, ’97] 레이 아웃 구조를
개선하여 전력과 면적을 줄임
VADA Lab.
96
Clustered Voltage Scaling
• 저전력 netlist 를 생성
G5
F/F
S 5>0
G4
Slack(S i) = R i - A i
G3
G6
G2
S 6>0
S 4>0
G8
S 2<0
S 3>0
LC1
S 8<0
G1
S 1>0
F/F
G7
S 7<0
S 9>0
: VDDL
S 11<0
F/F
: VDDH
LC2
G11
G10
SungKyunKwan Univ.
S 10<0
G9
: Level Converter
VADA Lab.
Row by Row Power Supply 구조
standard
cell
VDDL
VDDH
VDDL
cell
VDDL
VDDH
standard cell
standard cell
VDDL
VDDH cell
module
VSS
VDDL cell
SungKyunKwan Univ.
VDDH
VSS
VDDH cell
VADA Lab.
Mix-And-Match Power Supply 구조
standard cell
VDDL
VDDH
cell
VDDH
VDDL VDDL
cell
VDDH
standard cell
standard cell
module
VDDH cell
SungKyunKwan Univ.
VDDL
cell
VDDH
cell
VDDL
VDDL
VDDH
VDDH
VSS
VSS
VDDL cell
VADA Lab.
구조비교
Conventional
Circuit
RRPS
MAMPS
VDDL
VDDH
VDDH
VDDL
VDDH
module
SungKyunKwan Univ.
module
module
VADA Lab.
100
Level Converter 구조
• Transistor의 갯수 : 6개
4개
• 전력과 면적면에서 효과적
VDDH
VDDH
VDDH
OUT
VDDL
VSS/VDDL
VSS/VDDH
IN
Vth=1.5V
기 존
SungKyunKwan Univ.
Vth=2.0V
제 안
VADA Lab.
Mix-And-Match Power Supply
Design Flow
Single voltage netlist
Multiple voltage scaling
Netlist with multiple supply voltage
(OPUS)
Assign supply voltage to each cell
Physical placement
(Aquarius XO)
Routing
Synthesis timing, power and area
SungKyunKwan Univ.
(PowerMill)
VADA Lab.
실험결과
전체 Power
전체 Area
Area
(%)
power
(%)
100
47%
10%
15%
100
2%
Conventional
circuit
RRPS
MAMPS
SungKyunKwan Univ.
Conventional
circuit
RRPS
MAMPS
VADA Lab.
결
론
• 단일 전압 회로와 비교하여 49.4%의 Power 감소를
Area overhead가 발생
얻은 반면 5.6%의
• 기존의 RRPS 구조보다 10%의 Area 감소와 2%의 Power 감소
• 제안된 Level Converter는 기존의 Level Converter보다 30%의 Area 감소와
35%의 Power 감소
SungKyunKwan Univ.
VADA Lab.
Low Power Design Tools
•
Transistor Level Tools (5-10% of silicon)
– SPICE, PowerMill(Epic), ADM(Avanti/Anagram), Lsim Power Analyst(mentor)
•
Logic Level Tools (10-15%)
– Design Power and PowerGate (Synopsys), WattWatcher/Gate (Sente), PowerSim
(System Sciences), POET (Viewlogic), and QuickPower (Mentor)
•
Architectural (RTL) Level Tools (20-25%)
– WattWatcher/Architect (Sente): 20-25% accuracy
•
Behavioral (spreadsheet) Level Tools (50-100%)
– Active area of academic research
SungKyunKwan Univ.
VADA Lab.
105
Commercial synthesis systems
SungKyunKwan Univ.
VADA Lab.
106
Research synthesis systems
AArchitectural
synthesis.
L - Logic
synthesis.
SungKyunKwan Univ.
VADA Lab.
107
Low-Power CAD sites
•
•
•
•
•
•
Alternative System Concepts, Inc, : 7X power reduction throigh optimization,
contact http://www.ee.princeton.edu and Jake Karrfalt at [email protected] or
(603) 437-2234. Reduction of glitch and clock power; modeling and
optimization of interconnect power; power optimization for data-dominated
designs with limited control flow.
Mentor Graphics QuickPower: Hierarchical of determining overall benet of
exchanging the blocks for lower power. powering down or disabling blocks when
not in use by gated-clock
choose candidates for power-down Calculate the effect of the power-down logic
http://www.mentorg.com
Synopsys's Power Compiler http://www.synopsys.com/products/power/power_ds
Sente's WattWatcher/Architect (first commerical tool operating at the
architecture level(20-25 %accuracy). http://www.powereda.com
Behavioral Tool: Hyper-LP (Optimization), Explore (Estimation) by J. Rabaey
SungKyunKwan Univ.
VADA Lab.
108
Design Power(Synopsys)
•
•
•
DesignPower(TM) provides a single, integrated environment for power
analysis in multiple phases of the design process:
–
Early, quick feedback at the HDL or gate level through probabilistic
analysis.
–
Improved accuracy through simulation-based analysis for gate level
and library exploration.
DesignPower estimates switching, internal cell and leakage power. It accepts
user-defined probabilities, simulation toggle data or a combination of both as
input. DesignPower propagates switching information through sequential
devices, including flip-flops and latches.
It supports sequential, hierarchical, gated-clock, and multiple-clock designs.
For simulation toggle data, it links directly to Verilog and VHDL simulators,
including Synopsys' VSS.
SungKyunKwan Univ.
VADA Lab.
109
References
[1] Gary K. Yeap, "Practical Low Power Digital VLSI Design",
Kluwer Academic Publishers.
[2] Jan M. Rabaey, Massoud Pedram, "Low Power Design Methodologies",
Kluwer Academic Publishers.
[3] Abdellatif Bellaouar, Mohamed I. Elmasry, "Low-Power Digital VLSI Design
Circuits And Systems", Kluwer Academic Publishers.
[4] Anantha P. Chandrakasan, Robert W. Brodersen, "Low Power Digital CMOS
Design", Kluwer Academic Publishers.
[5] Dr. Ralph Cavin, Dr. Wentai Liu, "1996 Emerging Technologies : Designing
Low Power Digital Systems"
[6] Muhammad S. Elrabaa, Issam S. Abu-Khater, Mohamed I. Elmasry,
"Advanced Low-Power Digital Circuit Techniques",
Kluwer Academic Publishers.
SungKyunKwan Univ.
VADA Lab.
110
References
•
•
•
•
•
[BFKea94] R. Bechade, R. Flaker, B. Kaumann, and et. al. A 32b 66 mhz 1.8W
Microprocessor". In IEEE Int. Solid-State Circuit Conference, pages 208-209,
1994.
[BM95] Bohr and T. Mark. Interconnect Scaling - The real limiter to high
performance ULSI". In proceedings of 1995 IEEE international electron devices
meeting, pages 241-242, 1995.
[BSM94] L. Benini, P. Siegel, and G. De Micheli. Saving Power by Synthesizing
Gated Clocks for Sequential Circuits". IEEE Design and Test of Computers,
11(4):32-41, 1994.
[GH95] S. Ganguly and S. Hojat. Clock Distribution Design and Verification for
PowerPC Microprocessor". In International Conference on Computer-Aided
Design, page Issues in Clock Designs, 1995.
[MGR96] R. Mehra, L. M. Guerra, and J. Rabaey. Low Power Architecture
Synthesis and the Impact of Exploiting Locality". In Journal of VLSI Signal
Processing,, 1996.
SungKyunKwan Univ.
VADA Lab.
111