Low Power ASIC Design (ppt File) - VADA

Download Report

Transcript Low Power ASIC Design (ppt File) - VADA

Lower Power Design Guide
1998. 6.7
성균관대학교 조 준 동 교수
http://vlsicad.skku.ac.kr
SungKyunKwan Univ.
VADA Lab.
1
Contents
1. Intoduction
Trends for High-Level Lower Power Design
2. Power Management
Clock/Cache/Memory Management
3. Architecture Level Design
Architecture Trade offs, Transformation
4. RTL Level Design
Retiming, Loop-Unrolling, Clock Selection, Scheduling, Resource Sharing,
Register Allocation
5. partitioning
6. Logic Level Design
7. Circuit Level Design
8. Quarter Sub Micron Layout Design
Lower Power Clock Designs
9. CAD tools
10. References
SungKyunKwan Univ.
VADA Lab.
2
1. Introduction
SungKyunKwan Univ.
VADA Lab.
3
Motivation
•
•
•
•
Portable Mobile (=ubiquitous
=nomadic)
Systems with limited for heat
sinks
Lowering power with fixed
performance: DSPs in modems
and cellular phones
Reliability: Increasing power !
increasing electromigration, 40year reliability guarantee
(product life cycle of
telecommunication industries)
SungKyunKwan Univ.
•
•
•
•
Adding fans to reduce power
cause reliability to plummet.
Higher power leads to higher
packaging costs: 2-watt
package can be four times
greater than a 1-watt package
Myriad Constraints: timing,
power, testability, area,
packaging, time-to-market.
Ad-Hoc Design: Lack a
systematic process leading to
universal applicability.
VADA Lab.
4
Power!Power!Power!
SungKyunKwan Univ.
VADA Lab.
5
Power Dissipation in VLSI’s
I/O
I/O
memory
clock
MPU1
clock
memory
MPU1
clock
clock
I/O
ASSP1
memory
logic
logic
ASSP2
I/O
logic
memory
MPU1: low-end microprocessor for embedded use
MPU2: high-end CPU with large amount of cache
ASSP1: MPEG2 decoder
ASSP2: ATM switch
SungKyunKwan Univ.
VADA Lab.
6
Current Design Issues in Lower Power Problem
Energy-hungry Function by Network
Server:
•
Infopad (univ. of California,
Berkeley), weight < 1 pound,
•
0.5W (re ective color display) +
0.5W (computation,communication,
I/O support) = 1W (Alpha chip: 25W
StrongARM: 215 MHz at 2.0V:0.3W)
•
runtime 50 hours, target:
100MIPS/mW.
•
Deep-sub micron (0.35 - 0.18) with
low voltage for portable full motion
video terminal; 0:5m : 40 AA NiMH;
1m : 1 AA NiMH
SungKyunKwan Univ.
•
•
•
•
System-On-A-Chip to reduce
external Interconnection
Capacitances
Power Management: shut down
idle units
Power Optimization Techniques in
Software, Architecture,Logic/Circuit,
Layout Phases to reduce operations,
frequency, capacitance, switching
activity with maintaining the same
throughput.
VADA Lab.
7
Battery Trends
SungKyunKwan Univ.
VADA Lab.
8
Road-Map in Semiconductor
Device Integration
SungKyunKwan Univ.
VADA Lab.
9
Road-Map in Semiconductor Device Complexity
SungKyunKwan Univ.
VADA Lab.
10
Power Component
•
•
Static: Leakage current(<< 1%)
Dynamic:
–
–
SungKyunKwan Univ.
Short Circuit power(10-30%): Short
circuit ow during transitions,
Switching (or capacitive) power(7090%): Charging/discharging of
capacitive loads during transitions
VADA Lab.
11
Vdd vs Delay
•use architecture optimization to compensate for slower operation,
e.g., Parallel Processing and Pipelining for concurrent increasing
and critical path reducing.
•Scale down device sizes to compensate for delay (Interconnects
do not scale proportionately and can become dominant)
SungKyunKwan Univ.
VADA Lab.
12
Good Design Methodologies
SungKyunKwan Univ.
VADA Lab.
13
Synthesis and Optimization
Pareto point
SungKyunKwan Univ.
VADA Lab.
14
2. Power Management
SungKyunKwan Univ.
VADA Lab.
15
Power Consumption in Multimedia Systems
•
•
LCD: 54.1%, HDD 16.8%, CPU
10.7%, VGA/VRAM 9.6%, SysLogic
4.5%, DRAM 1.1%, Others: 3.2%
5-55 Mode:
–
–
•
Display mode: CPU is in sleepmode (55 minutes), LCD (VRAM +
LCDC)
CPU mode: Display is idle ( 5
minutes), Looking up - data retrival
Handwrite recognition - biggest
power (memory, system bus active)
SungKyunKwan Univ.
VADA Lab.
16
Power Management
•
DPM
(Dynamic Power Management):
stops the clock switching of a
specific unit generated by clock
generators. The clock
regenerators produce two
clocks, C1 and C2 . The logic:
0.3%, 10-20% of power savings.
•
•
•
SungKyunKwan Univ.
SPM
(Static Power Management):
saving of the power dissipation
in the steady mode. When the
system (or subsystem) remains
idle for a significant period time,
then the entire chip
(or subsystem) is shut-down.
Identify power hungry modules
and look for opportunities to
reduce power
If f is increased, one has to
increase the transistor size or
Vdd.
VADA Lab.
17
Power
Management([email protected])
•
•
•
•
use right supply and right frequency to each part of the system If one
has to wait on the occurence of some input, only a small circuit could
wait and wake-up the main circuit when the input occurs.
Another technique is to reduce the basic frequency for tasks that can
be executed slowly.
PowerPC 603 is a 2-issue (2 instructions read at a time) with 5 parallel
execution units. 4 modes:
– Full on mode for full speed
– Doze mode in which the execution units are not running
– Nap mode which also stops the bus clocking and the Sleep mode which
stops the clock generator
– Sleep mode which stops the clock generator with or without the PLL (20100mW).
•
Superpipelined MIPS R4200 : 5-stage pipleline, MIPS R4400: 8 stage,
2 execution units, f/2 in reduce mode.
SungKyunKwan Univ.
VADA Lab.
18
TI
•
•
•
•
•
•
•
Two DSPs: TMS320C541, TMS320C542 reduce power and chip count and
system cost for wireless communication applications
C54X DSPs, 2.7V, 5V, Low-Power Enhanced Architecture DSP (LEAD) family:
Three different power down modes, these devices are well-suited for wireless
communications products such as digital cellular phones, personal digital
assistants, and wireless modem,low power on voice coding and decoding
The TMS320LC548 features:
– 15-ns (66 MIPS) or 20-ns (50 MIPS) instruction cycle times
– 3.0- and 3.3-V operation
32K 16-bit words of RAM and 2K 16-bit words of boot ROM on-chip
Integrated Viterbi accelerator that reduces Viterbi butter y update in four
instruction cycles for GSM channel decoding
Powerful single-cycle instructions (dual operand, parallel instructions, conditional
instructions)
Low-power standby modes
SungKyunKwan Univ.
VADA Lab.
19
Power Estimation Techniques
•
•
•
•
Circuit Simulation (SPICE): a set of input vectors, accurate, memory
and time constraints
Monte Carlo: randomly generated input patterns, normal distributed
power per time interval T using a simulator switch level simulation
(IRSIM): defined as no. of rising and falling transitions over total
number of inputs
Powermill (transistor level): steady-state transitions, hazards and
glitches, transient short circuit current and leakage current; measures
current density and voltage drop in the power net and identifies
reliability problem caused by EM failures, ground bounce and
excessive voltage drops.
DesignPower (Synopsys): simulation-based analysis is within 8-15% of
SPICE in terms of percentage difference (Probability-based analysis is
within 15-20% of SPICE).
SungKyunKwan Univ.
VADA Lab.
20
Cache/Memory Management
•
•
•
•
•
•
•
Clock and memory consumes between 15% to 45% of the total power in digital
computers
As block size increases, the energy required to service miss increases due to
increased memory access external-memory access (530 mA) vs. on-chip
access(300mA): Replacing excessive accesses to background memory by
foreground memory
Cache vertical partitioning (buffering): multi-level variable-size caches Caches
are powerdown when idle.
Cache horizontal partitioning (subarray access): several segments can be
powered individually. Only the cache sub-bank where the requested data is
located consumes power in each cache access.
Using distributed memory instead of a single centralized memory
Locality of reference to eliminate expensive data transfer across high
capacitance busses
Cache misses consume more energy (directed-mapping or k-associated
mapping?), page faults consume more energy
SungKyunKwan Univ.
VADA Lab.
21
Power Management
•
Power Management Scheme by
Enabling Clock
•
Power Management Scheme by
adding Clock Generation block
enable 1
block 1
block 1
clock management
enable 1
enable 2
block 1
clk
block 1
clk
enable 2
enable 3
enable 3
block 1
SungKyunKwan Univ.
block 1
VADA Lab.
22
3. Architectural Level Design
SungKyunKwan Univ.
VADA Lab.
23
Architectural-level Synthesis
• Translate HDL models into sequencing graphs.
• Behavioral-level optimization:
– Optimize abstract models independently from the
implementation parameters.
•
Architectural synthesis and optimization:
– Create macroscopic structure:
• data-path and control-unit.
– Consider area and delay information
• Hardware compilation:
– Compile HDL model into sequencing graph.
– Optimize sequencing graph.
– Generate gate-level interconnection for a cell library. of the
implementation.
SungKyunKwan Univ.
VADA Lab.
24
Power Measure of P
SungKyunKwan Univ.
VADA Lab.
25
System-Level Solutions
•
•
•
•
•
Spatial locality: an algorithm can be partitioned into natural clusters
based on connectivity
Temporal locality:average lifetimes of variables (less temporal
storage, probability of future accesses referenced in the recent past).
Precompute physical capacitance of Interconnect and switching
activity (number of bus accesses)
Architecture-Driven Voltage Scaling: Choose more parallel
architecture
Supply Voltage Scaling : Lowering V dd reduces energy, but
increase delays
SungKyunKwan Univ.
VADA Lab.
26
Software Power Issues
•
•
•
•
•
•
•
•
•
Upto 40% of the on-chip power is dissipated on the buses !
System Software : OS, BIOS, Compilers
Software can affect energy consumption at various levels InterInstruction Effects
Energy cost of instruction varies depending on previous instruction
For example, XORBX 1; ADDAX DX;
Iest = (319:2+313:6)=2 = 316:4mA Iobs =323:2mA
The difference defined as circuit state overhead
Need to specify overhead as a function of pairs of instructions
Due to pipeline stalls, cache misses
Instruction reordering to improve cache hit ratio
SungKyunKwan Univ.
VADA Lab.
27
Avoiding Wastful Computation
•
•
•
•
•
•
•
•
Preservation of data correlation
Distributed computing / locality of reference
Application-specific processing
Demand-driven operation
Transformation for memory size reduction
Consider arrays A and C are already available in memory
When A is consumed another array B is generated; when C is
consumed a scalar value D is produced.
Memory Size can be reduced by executing the j loop before the i loop
so that C is consumed before B is generated and the same memory
space can be used for both arrays.
SungKyunKwan Univ.
VADA Lab.
28
Avoiding Wastful Computation
SungKyunKwan Univ.
VADA Lab.
29
Architecture Lower Power Design
• Optimum Supply Voltage Architecture through Hardware
Duplication (Trading Area for Lower Power) and/or Pipelining
– complex and fewer instruction requires less encoding, but larger
decode logic!
• use small complex instruction with smaller instruction length
(e.g., Hitachi SH: 16-bit fixed-length, arithmetic instruction uses
only two operands, NEC V800: variable-length instruction
decoding overhead )
• Superscalar: CPI < 1: parallel instruction execution. VLIW
architecture.
SungKyunKwan Univ.
VADA Lab.
30
Variable Supply Voltage Block Diagram
•
•
•
SungKyunKwan Univ.
Computational work varies with
time. An approach to reduce
the energy consumption of
such systems beyond shut
down involves the dynamic
adjustment of supply voltage
based on computational
workload.
The basic idea is to lower
power supply when the a fixed
supply for some fraction of
time.
The supply voltage and clock
rate are increased during high
workload period.
VADA Lab.
31
Power Reduction using Variable Supply
•Circuits with a fixed supply voltage
work at a fixed speed and idle if the
data sample requires less than the
maximum amount of computation.
• Power is reduced in a linear fashion
since the energy per operation is
fixed. If the work load for a given
sample period is less than peak, then
the delay of the processing element
can be increased by a factor of
1/workload without loss in throughput,
allowing the processor to operate at a
lower supply voltage. Thus, energy
per operation varies.
SungKyunKwan Univ.
VADA Lab.
32
Data Driven Signal Processing
The basic idea of averaging two
samples are buffered and their
work loads are averaged.
The averaged workload is then
used as the effective workload to
drive the power supply.
Using a pingpong buffering
scheme, data samples In +2, In +3
are being buffered while In, In +1
are being processed.
SungKyunKwan Univ.
VADA Lab.
33
Architecture of Microcoded Instruction
Set Processor
SungKyunKwan Univ.
VADA Lab.
34
Power and Area
1.5V and 10MHz clock rate: instruction and data memory accesses
account for 47% of the total power consumption.
SungKyunKwan Univ.
VADA Lab.
35
Datapath Parallelization
SungKyunKwan Univ.
VADA Lab.
36
Memory Parallelization
At first order P= C * f/2 * Vdd2
SungKyunKwan Univ.
VADA Lab.
37
Pipelined Micro-P
SungKyunKwan Univ.
VADA Lab.
38
Architecture Trade-Off
Ppipeline =
(1.15C)( 0.58V)2 (f)
= 0.39P
Pparallel =
(2.15C)(0.58V)2 (0.5f)
= 0.36P
PIPLELINED Implementation
SungKyunKwan Univ.
VADA Lab.
39
Through WAVE PIPELINING
SungKyunKwan Univ.
VADA Lab.
40
Different Classes of RISC Micro-P
SungKyunKwan Univ.
VADA Lab.
41
Application Specific Coprocessor
•
•
•
•
•
•
DSP's are increasingly called upon to perform tasks for which they are
not ideally suited, for example, Viterbi decoding.
They may also take considerably more energy than a custom solution.
Use the DSP for portions of algorithms for which it is well suited, and
craft an application-specic coprocessor (i.e., custom hardware) for
other tasks.
This is an example of the dierence between power and energy
The application-specic coprocessor may actually consume a more
power than the DSP, but it may be able to accomplish the same task in
far less time, resulting in a net energy savings.
Power consumption varies dramatically with the instruction being
executed.
SungKyunKwan Univ.
VADA Lab.
42
Clock per Instruction (CPI)
SungKyunKwan Univ.
VADA Lab.
43
SUPERPIPELINE micro-P
SungKyunKwan Univ.
VADA Lab.
44
VLIW Architecture
Compiler takes the responsibility for finding
the operations that can be issued in parallel
and creating a single very long instruction
containing these operations. VLIW instruction
decoding is easier than superscalar
instruction due to the fixed format and to no
instruction dependency.
The fixed format could present more
limitations to the combination of operations.
Intel P6: CISC instructions are combined on
chip to provide a set of micro-operations (i.e.,
long instruction word) that can be executed in
parallel.
As power becomes a major issue in the
design of fast -Pro, the simple is the better
architecture.
VLIW architecture, as they are simpler than
N-issue machines, could be considered as
promising architectures to achieve
simultaneously
high-speed and low-power.
SungKyunKwan Univ.
VADA Lab.
45
Synchronous VS. Asynchronous SYSTEMS
•
•
Synchronous system: A signal path starts from a clocked flip- flop
through combinational gates and ends at another clocked flip- flop. The
clock signals do not participate in computation but are required for
synchronizing purposes. With advancement in technology, the systems
tend to get bigger and bigger, and as a result the delay on the clock
wires can no longer be ignored. The problem of clock skew is thus
becoming a bottleneck for many system designers. Many gates switch
unnecessarily just because they are connected to the clock, and not
because they have to process new inputs. The biggest gate is the clock
driver itself which must switch.
Asynchronous system (self-timed): an input signal (request) starts the
computation on a module and an output signal (acknowledge) signifies
the completion of the computation and the availability of the requested
data. Asynchronous systems are potentially response to transitions on
any of their inputs at anytime, since they have no clock with which to
sample their inputs.
SungKyunKwan Univ.
VADA Lab.
46
Synchronous VS. Asynchronous SYSTEMS
•
•
•
•
•
•
•
More difficult to implement, requiring explicit synchronization between
communication blocks without clocks
If the signal feeds directly to conventional gate-level circuitry, invalid
logic levels could propagate throughout the system.
Glitches, which are filtered out by the clock in synchronous designs,
may cause an asynchronous design to malfunction.
Asynchronous designs are not widely used, designers can't find the
supporting design tools and methodologies they need.
DCC Error Corrector of Compact cassette player saves power of 80%
as compared to the synchronous counterpart.
Offers more architectural options/freedom encourages distributed,
localized control offers more freedom to adapt the supply voltage
SungKyunKwan Univ.
VADA Lab.
47
Asynchronous Modules
SungKyunKwan Univ.
VADA Lab.
48
Example: ABCS protocol
6% more logics
SungKyunKwan Univ.
VADA Lab.
49
Control Synthesis Flow
SungKyunKwan Univ.
VADA Lab.
50
PIPELINED SELF-TIMED micro P
SungKyunKwan Univ.
VADA Lab.
51
Programming Style
SungKyunKwan Univ.
VADA Lab.
52
Speed vs. Power Optimization
SungKyunKwan Univ.
VADA Lab.
53
VON NEUMANN VERSUS HARVARD
SungKyunKwan Univ.
VADA Lab.
54
Low Vdd Main Memories
SungKyunKwan Univ.
VADA Lab.
55
CACHE MEMORIES
SungKyunKwan Univ.
VADA Lab.
56
Low Power Memory
•
•
•
Hierarchical Word Line: Divide the memory in different blocks and access the bit cells
of the desired block
Selective precharge: Many bit lines are discharged even when these locations are not
accessed. Only bit lines which will be accesses are precharged.
Minimization of Non-zero Terms in the ROM table: Zero terms do not switch bit lines
and reduce capacitance in both bit lines and row lines.
–
–
–
–
•
Inverted ROM: If the number of ones is very high, the whole ROM core can be inverted.
Inverted Row: A given row is inverted if more than half of the bits are non-zero terms. An extra
bit is required to perfoem encoding.
Sign magnitude representation: ROM is used to store the coefficients of a digital filter. As a
result, a significant amount of the non-zero terms are due to the sign extension of the negative
coefficients. The main drawback of this type is that a conversion to two’s complement is
required at the end of a cycle, which slows down the ROM.
Sign magnitude and inverted block:
Difference Encoding: reduce the size of the ROM core. If the value between adjacent
data do not change significantly, the ROM core stores the difference between the data.
SungKyunKwan Univ.
VADA Lab.
57
Low Power Memory
•
•
•
•
Smaller ROMS: in 102 tap filter, more than 70% of the coefficients are below 18 bits.
Still the largest coefficients are below 18 bits. Still the largest coefficient goes up 24 bits.
A better implementation can be achieved if the large coefficients are stored in a wide
ROM with fewer address; the small coefficients are stored in narrow ROM with many
addresses. A similar approach is applied for locations in ROM which are often accessed.
Loations that are accesses frequently are stored in a small, fast ROM, while the other
locations are stored in a larger ROM.
NMOS precharge: bit lines are precharged to Vdd - Vt. A drawback of this technique is
degradation of noise margins and the body bias effect.
Buffer Sizing: a large set of buffers is required in the control logic to drive the address
lines through the decoder, generate the contol signals for the column multiplexers, drive
the row lines and drive the precharge signals.
Voltage scaling:
CLVdd
2CLVdd
Tdelay 

I
Cox (W / L)(Vdd  Vt ) 2
SungKyunKwan Univ.
VADA Lab.
58
Memory Architecture
SungKyunKwan Univ.
VADA Lab.
59
Exploiting Locality for Low-Power Design
•A spatially local cluster: group of algorithm operations that are tightly
connected to each other in the flow graph representation.
• Two nodes are tightly connected to each other on the flow graph representation
if the shortest distance between them, in terms of number of edges traversed, is low.
•Power consumption (mW) in
the maximally time-shared
and fully-parallel versions of
the QMF sub-band coder filter
• Improvement of a factor of
10.5 at the expense of a 20%
increase in area
• The interconnect elements
(buses, multiplexers, and
buffers) consumes 43% and
28% of the total power in
the time-shared and parallel
versions.
SungKyunKwan Univ.
VADA Lab.
60
Cascade filter layouts
(a)Non-local implementation from Hyper (b)Local implementation from Hyper-LP
SungKyunKwan Univ.
VADA Lab.
61
Stage-Skip Pipeline
•The power savings is
achieved by stopping
the instruction fetch
and decode stages of
the processor during
the loop execution
except its first iteration.
•DIB = Decoded
Instruction Buffer
• 40 % power savings
using DSP or RISC
processor.
SungKyunKwan Univ.
VADA Lab.
62
Stage-Skip Pipeline
•Selector: selects the output from
either the instruction decoder or
DIB
• The decoded instruction signals
for a loop are temporarily stored
in the DIB and are reused in each
iteration
of the loop.
•The power wasted in the
conventional pipeline is saved in
our pipeline by stopping the
instruction fetching and decoding
for each loop execution.
SungKyunKwan Univ.
VADA Lab.
63
Stage-Skip Pipeline
Majority of execution
cycles in signal
processing programs
are used for loop
execution :
40% reduction in
power with area
increase 2%.
SungKyunKwan Univ.
VADA Lab.
64
Parallel LIFO Scenario
SungKyunKwan Univ.
VADA Lab.
65
Parallel-serial Converter
SungKyunKwan Univ.
VADA Lab.
66
D- flip- flop Parallelization
SungKyunKwan Univ.
VADA Lab.
67
State Machine
SungKyunKwan Univ.
VADA Lab.
68
Frequency Multipliers and Dividers
SungKyunKwan Univ.
VADA Lab.
69
Data Reuse Exploration
•
•
•
MH(memory hierarchy) introduces copies of data from larger to smaller memories in
DFG.
Power consumption is decreased because data is now read mostly from smaller
memories, while it is increased because extra memory transfers are introduced.
Moreover, adding another layer of hierarchy has a negative effect on the area and
interconnect cost.
SungKyunKwan Univ.
VADA Lab.
70
Instruction Decoding
•
Architecture of Control Logic in
Microprocessor
– State Transition Diagram
S0
e
– Binary Code Mapping
– Hardware Implementation
primary input
S1
primary output
Combinational
Logic
S2
S3
S4
Sn
present state
next state
state register
SungKyunKwan Univ.
VADA Lab.
71
Optimizing Power using Transformation
INPUT FLOWGRAPH
LOCAL TRANSFORMATION
PRIMITIVES
Associativity,
Distributivity,
Retiming,
Common Sub-expression
OUTPUT FLOWGRAPH
SEARCH MECHANISM
simulated Rejectionless,
Steepest Decent,
Heuristics
GLOBAL
TRANSFORMATION
PRIMITIVES
Retiming,
Pipelining,
Look-Ahead,
Associativity
POWER
ESTIMATION
SungKyunKwan Univ.
VADA Lab.
72
Summary of Results
EXAMPLE
POWER
REDUCTION
AREA
INCREASE
OPTIMUM
VOLTAGE
FIR11
11
1.1
1.5V
DCT
8
5
1.5V
IIR7
7.5
6.4
1.4V
VOLTERRA2
8.6
1
1.7V
Optimum voltage for low-power is around 1.5V
SungKyunKwan Univ.
VADA Lab.
73
Data- flow based transformations
•
•
•
•
•
•
•
•
•
Tree Height reduction.
Constant and variable propagation.
Common subexpression elimination.
Code motion
Dead-code elimination
The application of algebraic laws such as commutability, distributivity
and associativity.
Most of the parallelism in an algorithm is embodied in the loops.
Loop jamming, partial and complete loop unrolling, strength reduction
and loop retiming and software pipelining.
Retiming: maximize the resource utilization.
SungKyunKwan Univ.
VADA Lab.
74
Tree-height reduction
•Example of tree-height reduction
using commutativity and
associativity
SungKyunKwan Univ.
• Example of tree-height
reduction using
distributivity
VADA Lab.
75
Sub-expression elimination
• Logic expressions:
– Performed by logic optimization.
– Kernel-based methods.
•
Arithmetic expressions:
–
–
–
–
Search isomorphic patterns in the parse trees.
Example:
a= x+ y; b = a+ 1; c = x+ y;
a= x+ y; b = a+ 1; c = a;
SungKyunKwan Univ.
VADA Lab.
76
Examples of other transformations
•
Dead-code elimination:
– a= x; b = x+ 1; c = 2 * x;
– a= x; can be removed if not referenced.
•
Operator-strength reduction:
– a= x2 ; b = 3 * x;
– a= x * x; t = x<<1; b = x+ t;
•
Code motion:
– for ( i = 1; i < a * b) { }
– t = a * b; for ( i = 1; i < t) { }
SungKyunKwan Univ.
VADA Lab.
77
Strength reduction
X
X
**
X
X
+
*
A
+
A
+*
X
B
X2 + AX + B
+
B
X(X + A) + B
X
X
X
+
*
+
*
*
X
*
A
+
+
X
A
+
*
*
X
+
+
B
X
+
+
* +
B
C
SungKyunKwan Univ.
VADA Lab.
78
Strength Reduction
SungKyunKwan Univ.
VADA Lab.
79
Control- flow based transformations
•
Model expansion.
– Expand subroutine flatten
hierarchy.
– Useful to expand scope of other
optimization techniques.
– Problematic when routine is
called more than once.
– Example:
– x= a+ b; y= a * b; z = foo( x, y) ;
– foo( p, q) {t =q-p; return(t);}
– By expanding foo:
– x= a+ b; y= a * b; z = y-x;
SungKyunKwan Univ.
• Conditional expansion
• Transform conditional into
parallel execution with test at the
end.
• Useful when test depends on
late signals.
• May preclude hardware sharing.
• Always useful for logic
expressions.
• Example:
•y= ab; if ( a) x= b+d; else x= bd;
can be expanded to: x= a( b+ d) +
a’bd;
•y= ab; x= y+ d( a+ b);
VADA Lab.
80
Pipelining
SungKyunKwan Univ.
VADA Lab.
81
Associativity Transformation
SungKyunKwan Univ.
VADA Lab.
82
FIR Parallelization
SungKyunKwan Univ.
VADA Lab.
83
FIR PARALLELIZATION
SungKyunKwan Univ.
VADA Lab.
84
FIR Filter Parallelization
SungKyunKwan Univ.
VADA Lab.
85
FIR parallelization: two working phases
SungKyunKwan Univ.
VADA Lab.
86
IIR filter recursive function
SungKyunKwan Univ.
VADA Lab.
87
Recursive Function
SungKyunKwan Univ.
VADA Lab.
88
Interlaced Accumulation Programming for Low
Power
SungKyunKwan Univ.
VADA Lab.
89
4. Register Transfer Level Design
SungKyunKwan Univ.
VADA Lab.
90
FIR3 Block Diagram and Flow Graph
SungKyunKwan Univ.
VADA Lab.
91
High-Level Power Estimation
• Pcore = PDP + PMEM + PCNTR + PPROC
• PDP = PREG +PMUX +PFU + +PFU, where PREG is the
power of the registers
• PMUX is the power of multiplexers
• PFU is the power of functional units
• PINT is the power of physical interconnet capacitance
Cint    Ctotal / N , where  is the average activity
(the total number of interconne ct accesses multiplied by
an average signal transitio n probabilit y), Ctotal is the total
estimated capacitanc e of the chip and N is an estimate of the
number of physical interconne ts (HYPER).
SungKyunKwan Univ.
VADA Lab.
92
High-Level Power Estimation: PREG
•
•
•
•
•
•
•
•
•
Compute the lifetimes of all the variables in the given VHDL code.
Represent the lifetime of each variable as a vertical line from statement i through
statement i + n in the column j reserved for the corresponding varibale v j .
Determine the maximum number N of overlapping lifetimes computing the
maximum number of vertical lines intersecting with any horizontal cut-line.
Estimate the minimal number of N of set of registers necessary to implement the
code by using register sharing. Register sharing has to be applied whenever a
group of variables, with the same bit-width b i .
Select a possible mapping of variables into registers by using register sharing
Compute the number w i of write to the variables mapped to the same set of
registers. Estimate n i of each set of register dividing w i by the number of
statements S: i =wi/S; hence TR imax = n i f clk .
Power of latches and flip flops is consumed not only during output transitions,
but also during all clock edges by the internal clock buffers
The non-switching power PNSK dissipated by internal clock buffers accounts for
30% of the average power for the 0.38-micron and 3.3 V operating system.
In total,
N
PREG   ( Pk  PNSK ), Pk  nk PtkTRk  , PNSK  nk PNk ( f clk  TRk ) ,
k 1
SungKyunKwan Univ.
VADA Lab.
93
PCNTR
• After scheduling, the control is defined and optimized by the hardware mapper and
further by the logic synthesis process before mapping to layout.
• Like interconnect, therefore, the control needs to be estimated statistically.
• Global control model:
CFSM  1 N states   2 ,
For a 1.2 technolo gy, 1 is 4.9fF and  2 is 22.1fF.
The total number of transitio ns is strongly dependent on the
number of states.
Local control model: the local controller account for a larger percentage of
the total capacitance than the global controller.
Clc   0  1 Ntrans   2 N states   3 B f ,
For a 1.2 tech.,  0,  72, 1,  0.15,  2,  8.3,  3,  0.55.
Where Ntrans is the number of tansitions, nstates is the number of states, Bf is the bus
factor, and Clc is the capacitance switched in any local controller in one sample
period. Bf is the ratio of the number of bus accesses to the number of busses.
SungKyunKwan Univ.
VADA Lab.
94
Ntrans
•
•
The number of transitions depends on assignment, scheduling, optimizations, logic
optimization, the standard cell library used, the amount of glitchings and the statistics of
the inputs.
N trans   1   2 ( N nodes  N edges )   3 ( S  N Exu )
where N transis the number of transitio ns on the outputs of
the loal controller s, S is the number of control cycles per sample
period, N edges and N nodes are the number of edges and nodes in
the CDFG and N Exu is an estimate for the total number of
execution units. For a 1.2  tech.  1  178.7,  2  7.2,  3  2.0.
SungKyunKwan Univ.
VADA Lab.
95
Behavioral Synthesis
•
loop unrolling : localize the data to reduce the activity of the inputs of the
functional units or two output samples are computed in parallel based on two
input samples.
Yn1  X n1  A  Yn2
Yn  X n  A  Yn1  X n  A  ( X n1  A  Yn2 )
Neither the capacitance switched nor the voltage is altered. However, loop unrolling
enables several other transformations (distributivity, constant propagation, and
pipelining). After distributivity and constant propagation,
Yn 1  X n 1  A  Yn 2
Yn  X n  A  Yn 1  A2  Yn 2
The transformation yields critical path of 3, thus voltage can be dropped.
• Clock Selection : Choose optimal system clock period Eliminate slacks/improve resource
utilization and Enable greater voltage scaling
• Module selection : For each operation, choose library template
• Flow graph restructuring : pull out operations on the critical cycle.
SungKyunKwan Univ.
VADA Lab.
96
High-Level Power Estimation: PMUX and PFU
SungKyunKwan Univ.
VADA Lab.
97
Critical Path
•
•
•
Longest delayed path from input to
output in combinational logic
Determine operating clock
frequency
Resizing non-critical path transistor
(In-Place Optimization)
• Critical path in Synchronous
Sequential logic
D
Q
D
Q
D
Q
D
Q
path A
tcycle,min  t ff,max  tlogic,max  t setup,max  t skew,max
D
Q
D
Q
path B
tcycle,min : min.value of clock period
t ff,max : max.value of flipflop delay
tlogic,max : max.value of critical path delay
t setup,max : max.value of setup time of flipflop
t skew,max : max.value of clock skew
SungKyunKwan Univ.
clk
Combinational
Logic
clk
VADA Lab.
98
Loop Unrolling for Low Power
SungKyunKwan Univ.
VADA Lab.
99
Retiming
Flip- flop insertion to
minimize hazard
activity moving a
flip- flop in a circuit
SungKyunKwan Univ.
VADA Lab.
100
Exploiting spatial locality for interconnect
power reduction
Global
Local
Adder1
Adder2
SungKyunKwan Univ.
VADA Lab.
101
Balancing maximal time-sharing and
fully-parallel implementation
A fourth-order
parallel-form
IIR filter
(a) Local assignment
(2 global transfers),
SungKyunKwan Univ.
(b) Non-local assignment
(20 global transfers)
VADA Lab.
102
Retiming/pipelining for Critical path
in
Biquad
Biquad
+
Biquad
1st
Order
out
Biquad
Biquad
+
D
+
+
Minimal Area Time-multiplexed
Solution
(meeting the throughput
constraint)
D
Retiming and pipelining
in
D
Biquad
Biquad
D
Biquad
+
D
1st
Order
D
Biquad
D
out
Biquad
+
D
D
+
D
+
D
"Fastest" Solution
Supply voltage can be reduced keeping throughput fixed.
SungKyunKwan Univ.
VADA Lab.
103
Effective Resource Utilization
7
S
+
S
7
+
+
D
5
1
2
D
6
+
+
+
5
1
D
+
2
+
6
Retiming
D
D
3
4
D
3
Before
CYCLE
1
2
1
1
D
4
AFTER
Multipliers
Adder Multipliers Adder
1
1, 3
2
8
2, 4
5
6
1
6, 8
3
7
7
4
5
Can reducd interconnect capacitance.
SungKyunKwan Univ.
VADA Lab.
104
Hazard propagation elimination by
clocked sampling
By sampling a steady state signal at a register input,
no more glitches are propagated through the next
combinational logics.
SungKyunKwan Univ.
VADA Lab.
105
Latched Retiming
SungKyunKwan Univ.
VADA Lab.
106
Latched retiming
SungKyunKwan Univ.
VADA Lab.
107
Regularity
•
Common patterns enable the design of less complex architecture and therefore
simpler interconnect structure (muxes, buffers, and buses). Regular designs
often have less control hardware.
A1
A1
A1
A2
A2
A1
A2
A2
A1
A2
+
+
+
+
+
+
+
+
+
+
*
M1
*
M1
*
M1
<
S1
<
<<
*
S1
*
M1
*
M1
*
M1
<
S1
<
<<
S1
*
A1
+
M1
A2
+
S1
<<
*
(a)±ÔÄ¢Àû ¸ðµâÇÒ´ç
A1
+
MUX
M1
A2
+
MUX
*
<<
S1
(b)ºñ±ÔÄ¢Àû ¸ðµâÇÒ´ç
SungKyunKwan Univ.
VADA Lab.
108
Module Selection
•
•
•
•
Select the clock period, choose proper hardware modules for all
operations(e.g., Wallace or Booth Multiplier), determine where to
pipeline (or where to put registers), such that a minimal hardware cost
is obtained under given timing and throughput constraints.
Full pipelining: ineffective clock period mismatches between the
execution times of the operators. performing operations in sequence
without immediate buffering can result in a reduction of the critical path.
Clustering operations into non-pipelining hardware modules, the
reusability of these modules over the complete computational graph be
maximized.
During clustering, more expensive but faster hardware may be
swapped in for operations on the critical path if the clustering violates
timing constraints
SungKyunKwan Univ.
VADA Lab.
109
Estimation
•
Estimate min and max bounds on the required resources to
–
–
•
•
delimit the design space min bounds to serve as an initial solution
serve as entries in a resource utilization table which guides the transformation,
assignment and scheduling operations
Max bound on execution time is tmax: topological ordering of DFG using ASAP
and ALAP
Minimum bounds on the number of resources for each resource class
Where NRi: the number of resources of class Ri
dRi : the duration of a single operation
ORi : the number of operations
SungKyunKwan Univ.
VADA Lab.
110
Exploring the Design Space
•
•
•
•
•
Find the minimal area solution constrained to the timing constraints
By checking the critical paths, it determine if the proposed graph
violates the timing constraints. If so, retiming, pipelining and tree height
reduction can be applied.
After acceptable graph is obtained, the resource allocation process is
initiated.
– change the available hardware (FU's, registers, busses)
– redistribute the time allocation over the sub-graphs
– transform the graph to reduce the hardware requirements.
Use a rejectionless probabilistic iterative search technique (a variant of
Simulated Annealing), where moves are always accepted. This
approach reduces computational complexity and gives faster
convergence.
SungKyunKwan Univ.
VADA Lab.
111
Data path Synthesis
SungKyunKwan Univ.
VADA Lab.
112
Scheduling and Binding
•
•
•
•
The scheduling task selects the control step, in which a given operation
will happen, i.e., assign each operation to an execution cycle
Sharing: Bind a resource to more than one operation.
– Operations must not execute concurrently.
Graph scheduled hierachically in a bottom-up fashion
Power tradeoffs
–
–
–
–
•
Shorter schedules enable supply voltage (Vdd) scaling
Schedule directly impacts resource sharing
Energy consumption depends what the previous instruction was
Reordering to minimize the switching on the control path
Clock selection
– Eliminate slacks
– Choose optimal system clock period
SungKyunKwan Univ.
VADA Lab.
113
ASAP Scheduling
• Algorithm
SungKyunKwan Univ.
• HAL Example
VADA Lab.
114
ALAP Scheduling
• Algorithm
SungKyunKwan Univ.
• HAL Example
VADA Lab.
115
Force Directed Scheduling
Used as priority function.
Force is related to concurrency.
Sort operations for least force.
Mechanical analogy:
Force = constant displacement.
constant = operation-type distribution.
displacement = change in probability.
SungKyunKwan Univ.
VADA Lab.
116
Force Directed Scheduling
SungKyunKwan Univ.
VADA Lab.
117
Example : Operation V6
SungKyunKwan Univ.
VADA Lab.
118
Force-Directed Scheduling
• Algorithm (Paulin)
SungKyunKwan Univ.
VADA Lab.
119
Force-Directed Scheduling Example
•
Probability of scheduling operations
into control steps
•
Operator cost for multiplications in a
•
Probability of scheduling operations
into control steps after operation o3 is
scheduled to step s2
•
Operator cost for multiplications in c
SungKyunKwan Univ.
VADA Lab.
120
List Scheduling
•
DFG with mobility labeling (inside <>)
•
ready operation list/resource constraint
SungKyunKwan Univ.
• The scheduled DFG
VADA Lab.
121
Static-List Scheduling
•
DFG
•
Partial schedule of five nodes
•
Priority list
The final
schedule
SungKyunKwan Univ.
VADA Lab.
122
Choosing Optimal Clock Period
SungKyunKwan Univ.
VADA Lab.
123
Supply Voltage Scaling
Lowering Vdd reduces
energy, but increase delays
SungKyunKwan Univ.
VADA Lab.
124
Shut-down을 이용한 Scheduling: |a-b|
a
1
b
>
a
b
-
b
1
2
3
>
a
b
a
-
2
a
b
Mux
b
a
-
a
1
Mux
SungKyunKwan Univ.
2
3
b
a
b
b
a
>
-
Mux
VADA Lab.
125
Loop Scheduling
• Sequential Execution
• Partial loop unrolling
• Loop folding
SungKyunKwan Univ.
VADA Lab.
126
Loop folding
• Reduce execution delay of a
loop.
• Pipeline operations inside a loop.
• Overlap execution of
operations.
• Need a prologue and
epilogue.
• Use pipeline scheduling for loop
graph model.
SungKyunKwan Univ.
VADA Lab.
127
DFG Restructuring
• DFG2
SungKyunKwan Univ.
• DFG2 after redundant operation
insertion
VADA Lab.
128
Minimizing the bit transitions for constants
during Scheduling
SungKyunKwan Univ.
VADA Lab.
129
Control Synthesis
•Synthesize circuit that:
•Executes scheduled
operations.
•Provides
synchronization.
•Supports:
• Iteration.
• Branching.
• Hierarchy.
• Interfaces.
SungKyunKwan Univ.
VADA Lab.
130
Allocation
◆Bind a resource to more than one operation.
SungKyunKwan Univ.
VADA Lab.
131
Optimum binding
SungKyunKwan Univ.
VADA Lab.
132
Example
SungKyunKwan Univ.
VADA Lab.
133
RESOURCE SHARING
•
•
•
•
Parallel vs. time-sharing buses (or execution units)
Resource sharing can destroy signal correlations and increase switching activity,
should be done between operations that are strongly connected.
Map operations with correlated input signals to the same units
Regularity: repeated patterns of computation (e.g., (+, * ), ( * ,*), (+,>))
simplifying interconnect (busses, multiplexers, buffers)
SungKyunKwan Univ.
VADA Lab.
134
Datapath interconnections
• Multiplexer-oriented datapath
SungKyunKwan Univ.
• Bus-oriented datapath
VADA Lab.
135
Sequential Execution
• Example of three micro-operations in the same clock period
SungKyunKwan Univ.
VADA Lab.
136
Insertion of Latch (out)
• Insertion of latches at the output ports of the functional units
SungKyunKwan Univ.
VADA Lab.
137
Insertion of Latch (in/out)
• Insertion of latches at both the input and output ports of the
functional units
SungKyunKwan Univ.
VADA Lab.
138
Overlapping Data Transfer(in)
• Overlapping read and write data transfers
SungKyunKwan Univ.
VADA Lab.
139
Overlapping of Data Transfer (in/out)
• Overlapping data transfer with functional-unit execution
SungKyunKwan Univ.
VADA Lab.
140
Register Allocation Using Clique Partitioning
• Scheduled DFG
• Lifetime intervals of variable
• Graph model
• Clique-partitioning solution
SungKyunKwan Univ.
VADA Lab.
141
Left-Edge Algorithm
• Register allocation using Left-Edge Algorithm
SungKyunKwan Univ.
VADA Lab.
142
Register Allocation: Left-Edge Algorithm
• Sorted variable lifetime intervals
SungKyunKwan Univ.
• Five-register allocation result
VADA Lab.
143
Register Allocation
•
•
•
•
•
•
•
Allocation : bind registers and functional modules to variables and
operations in the CDFG and specify the interconnection among
modules and registers in terms of MUX or BUS.
Reduce capacitance during allocation by minimizing the number of
functional modules, registers, and multiplexers.
Composite weight w.r.t transition activity and capacitance loads is
incorporated into CDFG.
Find the highest composite weight and merge the two nodes it joins,
i.e., maps the corresponding variable to the same register.
Allocation continues till no edges are left in the CDFG while updating
the composite weight values.
Set the maximum # of operations alive in any control step to be one.
Sequence operations/variables to enhance signal correlations
SungKyunKwan Univ.
VADA Lab.
144
Exploiting spatial locality for
interconnect power reduction
•
•
•
•
•
•
A spatially local cluster: group of algorithm operations that are tightly connected
to each other in the flowgraph representation.
Two nodes are tightly connected to each other on the flowgraph representaion if
the shortest distance between them, in terms of number of edges traversed, is
low.
A spatially local assignment is a mapping of the algorithm operations to specific
hardware units such that no operations in different clusters share the same
hardware.
Partitioning the algorithm into spatially local clusters ensures that the majority of
the data transfers take place within clusters (with local bus) and relatively few
occur between clusters (with global bus).
The partitioning information is passed to the architecture netlist and
floorplanning tools.
Local: A given adder outputs data to its own inputs Global: A given adder
outputs data to the aother adder's inputs
SungKyunKwan Univ.
VADA Lab.
145
Hardware Mapping
•
•
•
The last step in the synthesis process maps the allocated, assigned
and scheduled flow graph (called the decorated flow graph) onto the
available hardware blocks.
The result of this process is a structural description of the processor
architecture, (e.g., sdl input to the Lager IV silicon assembly
environment).
The mapping process transforms the flow graph into three structural
sub-graphs:
the data path structure graph
the controller state machine graph
the interface graph (between data path control inputs and the
controller output signals)
SungKyunKwan Univ.
VADA Lab.
146
Spectral Partitioning in High-Level Synthesis
•
•
•
•
•
•
•
The eigenvector placement obtained forms an ordering in which nodes tightly
connected to each other are placed close together.
The relative distances is a measure of the tightness of connections.
Use the eigenvector ordering to generate several partitioning solutions
The area estimates are based on distribution graphs.
A distribution graph displays the expected number of operations executed in
each time slot.
Local bus power: the number of global data transfers times the area of the
cluster
Global bus power: the number of global data transfer times the total area:
SungKyunKwan Univ.
VADA Lab.
147
Finding a good Partition
SungKyunKwan Univ.
VADA Lab.
148
Interconnection Estimation
•
•
•
For connection within a datapath (over-the-cell routing), routing between units
increases the actual height of the datapath by approximately 20-30% and that
most wire lengths are about 30-40% of the datapath height.
Average global bus length : square root of the estimated chip area.
The three terms represent white space, active area of the components, and
wiring area. The coefficients are derived statistically.
SungKyunKwan Univ.
VADA Lab.
149
Incorporating into HYPER-LP
SungKyunKwan Univ.
VADA Lab.
150
Experiments
SungKyunKwan Univ.
VADA Lab.
151
Datapath Generation
•
Register file recognition and the multiplexer reduction:
– Individual registers are merged as much as possible into register files
– reduces the number of bus multiplexers, the overall number of busses
(since all registers in a file share the input and output busses) and the
number of control signals (since a register file uses a local decoder).
•
•
•
Minimize the multiplexer and I/O bus, simultaneously (clique
partitioning is Np-complete, thus Simulated Annealing is used)
Data path partitioning is to optimize the processor floorplan
The core idea is to grow pairs of as large as possible isomorphic
regions from corresponding of seed nodes.
SungKyunKwan Univ.
VADA Lab.
152
Hardware Mapper
SungKyunKwan Univ.
VADA Lab.
153
Hyper's Basic Architecture Model
SungKyunKwan Univ.
VADA Lab.
154
Hyper's Crossbar Network
SungKyunKwan Univ.
VADA Lab.
155
Refined Architecture Model
SungKyunKwan Univ.
VADA Lab.
156
Bus Merging
SungKyunKwan Univ.
VADA Lab.
157
Fanin Bus Merging
SungKyunKwan Univ.
VADA Lab.
158
Fanout Bus merging
SungKyunKwan Univ.
VADA Lab.
159
Global bus Merging
SungKyunKwan Univ.
VADA Lab.
160
Test Example
SungKyunKwan Univ.
VADA Lab.
161
Control Signal Assignment
SungKyunKwan Univ.
VADA Lab.
162
Efficient High Level Synthesis Algorithm for
Lower Power Design
1998.5.19
임세진, 조 준 동
SungKyunKwan Univ.
VADA Lab.
목차
• 상위 수준 합성
• 기존의 상위 수준의 저전력 방법
• 최소 비용 할당 알고리즘( Minimum Cost Flow
Algorithm )
•
•
•
•
저전력을 위한 스케쥴링
레지스터 리소스 할당 방법
실험 방법 및 결과
결론
SungKyunKwan Univ.
VADA Lab.
저전력 설계의 필요성
 현재 IC회로의 전력 소모의 계속적인 증가
- Single Chip에서의 트랜지스터 수의 증가
- 회로의 복잡한 기능의 증가
- 클럭 속도의 증가
최근 저 전력 필요하는 시스템 등장
- 휴대용 셀룰러 전화기, 호출기, 노트북 컴퓨터, PDA
LCDs등의 Battery 전원의 제품 등장
- ULSI Microprocessors
- Parallel Computer
 기타
-특수 cooling 장치의 고비용과 제한된 회로의 열 발산
- Battery 수명의 느린 증가
SungKyunKwan Univ.
VADA Lab.
상위 수준 합성 단계
½Ã½ºÅÛ · ¹º§
Design Specification
CDFG
(Control Data Flow Graph)
µ¿ÀÛÀû · ¹º§
LOW POWER AND FAST
SCHEDULING
¾ÆÅ°ÅØÃÄ · ¹º§
· ÎÁ÷/ȸ· Î · ¹º§
REGISTER ALLOCATION
FOR LOW POWER
RESOURCE ALLOCATION
FOR LOW POWER
Fast and Enable resource
sharing for low power
scheduling
Minimizing switching activity
in Register
Minimizing switching activity
in resource and interconnection
DATAPATH GENERATION
AND CONTROLLER
SYNTHESIS
µð¹ÙÀ̽º/°øÁ¤ · ¹º§
WRITE VHDL
SungKyunKwan Univ.
VADA Lab.
상위 수준 합성 ( High Level Synthesis )
for(I=0;I<=2;I=I+1begin
@(posedge clk);
Control
if(fgb[I]%8; begin
p=rgb[I]%8;
g=filter(x,y)*8;
end
Datapath
............
Instructions
Scheduling
Operations
Hardware allocation
Variables
Memory inferencing
Arrays
Register sharing
signals
constraints
Control interencing
회로의 동작적
기술
SungKyunKwan Univ.
상위 수준 합성
Memory
Operators,
Registers,
Memory, Multiplexor
Control
RTL(register transfer
level) architecture
VADA Lab.
기본적인 상위 수준 합성 과정
Á¦¾î±¸°£ ¿¬»êÀÚ
+
+
2
+
3
<
4
*
+
CDFG
<
+
1
*
½ºÄÉÁ층
SungKyunKwan Univ.
<
+
Çϵå¿þ¾î
¶óÀ̺귯¸®
*
¸®¼Ò½ºÇÒ´ç
¸ðµâ ¹ÙÀεù
VADA Lab.
상위 레벨에서 제안된 저전력 방법
Sibling 연산의 연산자 공유 [ Fang , 96 ]
데이타 correlation 를 고려한 resource sharing [ Gebotys, 97 ]
FU 의 shut down 방법(Demand-driven operation) [ Alidina, 94 ]
 연산의 규칙성 이용 [ Rabaey, 96 ]
 Dual 전압 사용 [ Sarrafzadeh, 96 ]
 Spurious 연산의 최소화 [ Hwang, 96 ]
 Need to account for switching activity: depends on sequence of
operations/variables assigned to a resource
 제안된 알고리즘: 최소 비용의 흐름 알고리즘을 사용한 스위칭 동작 최소화
+ 연결구조 단순화를 통한 캐패시턴스 최소화 [Cho,97]
SungKyunKwan Univ.
VADA Lab.
레지스터의 전력 소모 모델
Power(Register) =
switching(x)(Cout,Mux+Cin,Register)+switching(y) x (Cout,Register+Cin,DeMux)
switching(x)=switching(y)이므로 Power(Register)=switching(y) x Ctotal
Control
Cout,MuxCin,Register
SungKyunKwan Univ.
DeMux
x
Register
MUX
i
j
k
Control
y
Cout,Register
Cin,DeMux
VADA Lab.
i*
j*
k*
레지스터와 리소스의 수 결정
control
step
1
a
b
c
d
a b c d e f g h
A1 +1
e
2
+2
A2 +3
g
f
3
A1
*1
M1
1
2
3
4
h
SungKyunKwan Univ.
VADA Lab.
회로의 CDFG 표현
a
e=a+b;
g=c+d;
f=e+b;
h=f*g;
b
c
+1
d
+2
e
g
+3
f
*1
h
CDFG( control
data flow graph )
SungKyunKwan Univ.
VADA Lab.
Spurious 연산을 최소화
a
b
c
a, e
b
c, f
+1
+2
h
d
d, g
*1
Power : 2.14mW(@25Mhz, 5V)
SungKyunKwan Univ.
a
b
c
d
e, b
b
c
d
+1
g
+2
Power : 1.49mW (@25MHz, 5V)
VADA Lab.
f
*1
h
173
저전력을 위한 스케쥴링 방법
control
step
control
step
1
1 A1
2
2 A1
3
3 A1
4
A1
5 A2
4
저전력을 고려하지 않은 스케쥴링
SungKyunKwan Univ.
1
1 A1
2
2 A2
3
3 A1
4
A2 4
5 A1
저전력을 고려한 스케쥴링
VADA Lab.
레지스터 할당을 위한 가중치와 캐패시티
입력( input ) : V, 레지스터 공유를 위한 네트워크
출력( output ) : 분리된 V개의 경로
Cij=1, Bij=L
Cij=1, Bij=W
W = - ( M - Wa*N )
: 에지 가중치
Wa : 두 노드간의 스위칭 확률
N : 스위칭 확률을 정수화하는 값
M : Wa * N의 최대값과 크거나 같은 정수
L : W의 최대값과 크거나 같은 정수
V : 레지스터의 갯수)
SungKyunKwan Univ.
VADA Lab.
최소 비용 흐름 알고리즘(Minimum Cost
Flows)의 목적함수와 제한조건
Minimize Z=
B X
ij
X  X
X  X
X  X
ij
i
jk
ij
 V
if j  s
k
ij
i
0
if j  s
jk
V
if j  t
k
ij
i
jk
k
0  X ij  Cij
s : 출발지(source)
t : 목적지(target)
i, j, k : 노드(node)
C ij : 제한양(capacity)
Bij :비용(cost)
V : 흐름의양(flow)
SungKyunKwan Univ.
VADA Lab.
최소 비용 흐름 알고리즘
단계 1 : flow = 0;
단계 2 : 네트워크 상에서 존재하는 흐름에 의해 결정되는 변형된 비용 Bij*를 다음과 같이
정의한다.
*
Bij  Bij
if 0  X ij  Cij
Bij  
if
0 X ij  Cij (flow saturation )
Bij   B ji
if
X ji  0
*
*
단계 3 : 단계 2에서 변형된 비용으로 S에서 T까지의 최단 경로 알고리즘을 사용
하여 개의 flow를 그경로를 통하여 보낸다.
 = min (1, 2)
1 = 모든 정방향 에지의 min{Cij - Xij}
2 = 모든 역방향 에지의 min{Xij}
단계 4: 현재흐름의 양을 만큼 증가시키고 단계 2로 돌아간다.
단계 5: 현재흐름의 양이 V일때까지 위 단계를 반복한다.
SungKyunKwan Univ.
VADA Lab.
초기 네트워크 그래프 구성
s
s
2/4
8/8
1/3
2
2/4
8/8
1
2/4
T
G (capacity/cost)
0
2/4
1/3
2
8/8
s
1
8/8
T
Residue graph:Gr
SungKyunKwan Univ.
0
0
2
0
1
0
T
Flow graph: Gf
VADA Lab.
최단 거리 S-1-2-T
s
s
2/4
8/8
1/3
2
2/4
8/8
1
T
<G>
SungKyunKwan Univ.
0
1/4
0/--
2
8/8
s
1
1/-3
1/4
8/8
T
< Gr >
1
1
2
1
1
0
T
< Gf >
VADA Lab.
최단 거리 S-1-T
s
1/4
8/8
2
s
1
1/-3
1/4
8/8
8/8
T
<G>
SungKyunKwan Univ.
2
s
1/-4
0
0/-1
1/-3
1/4
7/8
T
< Gr >
2
1
2
1
1
1
T
< Gf >
VADA Lab.
최단 거리 S-2-T
s
s
8/8
2
s
0
7/8
1
1/-3
1/4
7/8
T
<G>
SungKyunKwan Univ.
2
1
1/-3
0/--
7/8
T
< Gr >
2
1
2
1
1
1
T
< Gf >
VADA Lab.
최단 거리 S-2-1-T
s
s
7/8
2
s
2
6/8
1/-3
1
2
0/--
7/8
T
<G>
SungKyunKwan Univ.
1
6/8
T
< Gr >
2
0
1
2
2
1
2
T
< Gf >
VADA Lab.
결과
s
2
2
2
1
2
2
T
최소 비용 Z=48
SungKyunKwan Univ.
VADA Lab.
레지스터 호환 그래프에서 네트워크 형성
s
a
b
a
g
c
f
d
e
호환가능 그래프
SungKyunKwan Univ.
b
f
c
e
d
g
T
노드 분리 전의 네트워크 형성
VADA Lab.
노드분리와 알고리즘 적용
s
s
a
b
c
d
a*
b*
c*
d*
a
b
c
d
a*
b*
c*
d*
e
e
e*
노드 분리 후의
네트워크 형성
f
e*
g
f*
g*
T
SungKyunKwan Univ.
f
g
알고리즘 적용
결과
f*
g*
T
VADA Lab.
적용 결과
PATH 1 : S-a-e-f-T
REG1 : a, e, f
PATH 2 : S-b-T
REG2 : b
PATH 3 : S-c-g-T
REG3 : c, g
PATH 4 : S -d-T
REG4 : d
SungKyunKwan Univ.
VADA Lab.
저전력을 위한 리소스 할당 방법
a
b
cstep 0
두 연산자를 공유시
cstep 1
발생하는 일련의 입력
cstep 2
+1
e
+2
(a,e) +(b, b),
a
b
a
b
(a,b)+(b+e)
e
b
b
e
중 작은 경우
선택
SungKyunKwan Univ.
ADD1
ADD1
VADA Lab.
변수의 저장에 따른 멀티플렉서의 증가와 감소
a
e
a, e
muxÀÇ
»ç¿ë
ADD1
SungKyunKwan Univ.
muxÀÇ
°¨¼Ò
ADD1
VADA Lab.
리소스 할당을 위한 가중치와 캐패시티
입력( input ) : V, Network for
출력(output ) : 분리된 V개의 경로
Cij=1, Bij=L
Cij=1, Bij=W
resource sharing
W : -[ M - ( Waí*N + Wmux*K ) ] ( edge weight )
Wmux : 연결 구조 가중치
K : 정규화하기 위한 상수
Wmux
0: 변수 i, j가 같은 레지스터에 할당되고 모듈의
동일 입력단으로 할당될시
: 1: 변수 i, j가 다른 레지스터에 할당되고 모듈의
동일 입력단으로 할당될시
V : 리소스의 수
SungKyunKwan Univ.
VADA Lab.
리소스 할당을 위한
최소 흐름 비용 알고리즘 적용 과정
s
+1
(a-e, b-b)
호환 가능
그라프
+2
(a-d, b-c)
+1
+3
+2
PATH 1 :
S-1-2-T
adder 1 : +1 , +2
+3
T
s
s
+1
PATH 2 :
S-3-T
adder 2 : +3
노드 분리 전의
네트워크형성
+1*
+1
+2
+3
+2*
+3*
T
+1*
+2
+3
+2*
+3*
노드 분리 후의
네트워크 형성
T
SungKyunKwan Univ.
VADA Lab.
레지스터와 리소스 할당 후의 최종 데이터 경로
a
a, e, f
+1
b
d
c
b
d
c, g
+2
*1
h
SungKyunKwan Univ.
VADA Lab.
실험 과정
ȸ·ÎÀÇ µ¿ÀÛÀû
񃬣
Simulation
Flow graph Transformation
IN
+
D
+ OUT
D
Flow graph Database
D
IN
D
D
+
OUT
Estimation
Min Bounds On Hardware
2 Adders
6 Registers
2 Buses
Assignment / Scheduling
Time
Hardware Mapping
Adder1
Adder2
Shift
1
2
* *
*
* *
3
4
5
6
*
*
*
* *
* *
*
7
reg
+
reg
shif t
reg
Silicon Compilation
SungKyunKwan Univ.
VADA Lab.
스위칭율 계산 ( Hamming Distance ratio , Wsa)
 CDFG 기능적 시뮬레이션
 두 변수의 exclusive-OR
 bit-width로 정규화
 논리 수준이나 레이아웃 수준 보다 빠른 측정
SungKyunKwan Univ.
VADA Lab.
벤치마크 회로의 특성
benchmark
cascade
fir 11
iir5
wave
mult(*)
7
11
10
7
add(+)
7
10
10
8
SungKyunKwan Univ.
sub(-)
.
.
.
3
critical path
6
11
8
8
Resource allocation
+(2), *(2), reg(7)
+(2), *(2), reg(7)
+(2), *(2), reg(7)
+(2) *(2),sub(1), reg(7)
VADA Lab.
실험 결과
100
mW
25
90
80
%
P o w er R eductio n
20
70
60
N o n lo w P o w er
Lo w P o w er
50
40
15
10
30
5
20
10
0
0
E xam ple C ascade
Fir11
IIR
W ave
SungKyunKwan Univ.
Example
Cascade
Fir11
IIR
W ave
benchmark 회로
VADA Lab.
데이터 경로를 Compass에서 Placement & Routing 한 모습
( 0.6 마이크론 gate array 사용)
SungKyunKwan Univ.
VADA Lab.
결론
 스위칭 동작 최소화를 위해 해밍거리(Hamming distance) 의 목적 함수
 저전력 구현을 위한 스케쥴링, 리소스 할당과정
 평균 15%의 전력 감소
 제한된 시간내의 최적의 결과 알고리즘 적용
(polynomial time and optimal solution algorithm)
 고성능 저전력 TOP-DOWN 상위수준 설계에
적용 (DSP, Microcontroller, ASIC, etc)
SungKyunKwan Univ.
VADA Lab.
Cascade Filter
c2
*
A4
+
A5
IN[i]
+
c5
A0[i-2]
c1
A2
*
D
A0[i-1]
C0[i-2]
C4
c0
*
*
A1
D
A0[i]
SungKyunKwan Univ.
+
+
A3
+
C5
B
+
c4
C2
D
*
c3
*
C0[i-1]
D
C0[i]
C1
+
C3
+
OUT
VADA Lab.
198
Cascade Filter Scheduling
c1
1
2
A0[i-1]
c2
A0[i-2]
*
*
A2
A4
+
c0
A0[i-2]
*
IN
c5C0[i-2] c4
C0[i-1]
A5
3
4
+
+
A0[i]
A3
+
B
5
6
*
*
C4
C2
+
c3
C0[i-2]
C0[i-1]
**
C1
C5
+
+
c0[i]
C3
+
OUT
SungKyunKwan Univ.
VADA Lab.
199
Finite Impulse Response Filter
D
IN[i]
D
D
IN[i-1]
*
c0
c1
*
A0
c2
+
*
c3
B1
B0
A0
IN[i-2]
A1
+
D
IN[i-3]
*
c4
B2
A2
+
D
IN[i-4]
*
B1
c5
B3
A3
SungKyunKwan Univ.
+
D
IN[i-5]
*
c6
B4
A4
+
D
IN[i-6]
*
c7
B5
A5
+
D
IN[i-7]
*
c8
B6
A6
+
D
IN[i-8]
*
c9
B7
A7
+
D
IN[i-9]
*
c10
B8
A8
+
IN[i-10]
*
B9
A9
+
OUT
VADA Lab.
200
FIR Scheduling
IN[i]
c0
1
IN[i-1]
c1
*
*
B0
c2
IN[i-2]
A0
2
+
*
B1
A1
3
+
B2
A2
4
c3 IN[i-3]
*
+
B3
A3
5
c4 IN[i-4]
*
+
B4
A4
6
*
c6 IN[i-6]
+
*
B5
A5
7
c5 IN[i-5]
+
c7 IN[i-7]
*
B6
c8 IN[i-8]
A6
8
+
B7
A7
9
*
+
B8
10
c9 IN[i-9]
*
c10 IN[i-10]
A8
+
B9
*
A9
11
+
OU
T
SungKyunKwan Univ.
VADA Lab.
201
Infinite Impulse Response Filter
c3
c2
*
c1
+
*
A0[i-1]
D
A4
IN
A6
D
+
A0[i]
c0
*
*
A1
A5
*
C2
+
+
A3
+
c6
C0[i-2]
*
A0[i2]
A2
c7
C4
B
SungKyunKwan Univ.
+
c5
D
C1
c4
c9
c8
D0[i-1]
C6
*
C0[i-1]
D
*
C5
*
+
D2
C3
C0[i]
+
D
+
*
D
D0[i]
D1
+
OUT
VADA Lab.
202
IIR Filter Scheduling
c3A0[i-2]c1
1
2
3
4
A0[i-1]
* *
A2
A0[i-2] c0 A0[i-1]
c2
A6
+
A4
IN
* *
A1
+
A0[i]
5
6
7
8
B
C0[i-2]
c5
C0[i-1]
A5
+
+
c7
C2
A3
* *
+
+
C6
c6
C0[i-2]
c4
* *
C5
C4
C0[i]
C0[i-1]
C3
+
+
D
C1
c9
D2
D0[i-1]
*
+
D0[i]
c8
D1
D0[i-1]
*
+
OUT
SungKyunKwan Univ.
VADA Lab.
203
참고문헌
[1] D. Gajski and N. Dutt, High-level Synthesis : Introduction to Chip and System
Design. Kluwer Academic Publishers, 1992.
[2] G. D. Micheli, Synthesis and Optimization of Digital Circuits. New York : McGraw
Hill. Inc, 1994.
[3] A. P. Chandrakasan, S. Sheng, and R. W. Brodersen, "Low-Power
CMOS digital
design", IEEE J. of Solid-State Circuits, pp. 473-484, 1992.
[4] A. P. Chandrakasan, M. Potkonjak, R. Mehra, J. Rabaey, and R. W. Brodersen,
"Optimizing power using transformation," IEEE Tr. on CAD/ICAS, pp. 12-31, Jan.
1995.
[5] E. Musool and J. Cortadella, "Scheduling and resource binding for low power", Int'l
Symp on Synstem Syntheiss, pp. 104-109, Apr. 1995.
[6] Y. Fang and A. Albicki, "Joint scheduling and allocation for low
power," in Proc.
of Int'l Symp. on Circuits & Systems, pp. 556-559, May. 1996.
[7] J. Monteiro and Pranav Ashar, "Scheduling techniques to enable
power
management", 33rd Design Automation Conference, 1996.
[8] R. S. Martin, J. P. Knight, "Optimizing Power in ASIC Behavioral
Synthesis",
IEEE Design & Test of Computers, pp. 58-70, 1995.
SungKyunKwan Univ.
VADA Lab.
204
[9] R. Mehra, J. Rabaey, "Exploting Regularity for Low Power Design", IEEE Custom Integrated
Circuits Conference, pp.177-182. 1996.
[10] A. Chandrakasan, T. Sheng, and R. W. Brodersen, "Low Power CMOS Digital Design",
Journal of Solid State Circuits, pp. 473-484, 1992.
[11] R. Mehra and J. Rabaey, "Behavioral level power estimation and
exploration," in Proc.
of Int'l Symp. on Low Power Design, pp. 197-202, Apr. 1994.
[12] A. Raghunathan and N. K. Jha, "An iterative improvement algorithm for low power data
path synthesis," in Proc. of Int'l Conf. on Computer-Aided Design, pp. 597-602, Nov. 1995.
[13] R. Mehra, J. Rabaey, "Low power architectural synthesis and the impact of exploiting
locality," Journal of VLSI Signal Processing, 1996.
[14] M. B. Srivastava, A. P. Chandrakasan, and R. W. Brodersen, "Predictive system shutdown
and other architectural techniques for energy efficient programmable computation," IEEE Tr.
on VLSI Systems, pp. 42-55, Mar. 1996.
[15] A. Abnous and J. M. Rabaey, "Ultra low power domain specific multimedia processors," in
Proc. of IEEE VLSI Signal Processing Workshop, Oct. 1996.
SungKyunKwan Univ.
VADA Lab.
205
[16] M. C. Mcfarland, A. C. Parker, R. Camposano, "The high level synthesis of digital systems,"
Proceedings of the IEEE. Vol 78. No 2 , February, 1990.
[17] A. Chandrakasan, S. Sheng, R. Brodersen, "Low power CMOS digital design,", IEEE Solid
State Circuit, April, 1992.
[18] A. Chandrakasan, R. Brodersen, "Low power digital CMOS design, Kluwer Academic
Publishers, 1995.
[19] M. Alidina, J. Moteiro, S. Devadas, A. Ghosh, M. Papaefthymiou, "Precomputation based
sequential logic optimization for low power," IEEE International Conference
on
Computer Aided Design, 1994.
[20] J. Monterio, S. Devadas and A. Ghosh, "Retiming sequential circuits for low power," In
Proceeding of the IEEE International Conference on Computer Aided Design, November,
1993.
[21] F. J. Kurdahi, A. C. Parker, REAL: A Program for Register Allocation,: in Proc. of the 24th
Design Automation Conference, ACM/IEEE, June. pp. 210-215, 1987.
SungKyunKwan Univ.
VADA Lab.
206
5. Partitioning
SungKyunKwan Univ.
VADA Lab.
207
Partitioning in VLSI CAD
•
•
•
Partitioning is a technique widely used to solve diverse problems occurring in
VLSI CAD. Applications of partitioning can be found in logic synthesis, logic
optimization, testing, and layout synthesis.
High-quality partitioning is critical in high-level synthesis. To be useful, highlevel synthesis algorithms should be able to handle very large systems.
Typically, designers partition high-level design specifications manually into
procedures, each of which is then synthesized individually. However, logic
decomposition of the design into procedures may not be appropriate for highlevel and logic-level synthesis [60]. Different partitionings of the high-level
specifications may produce substantial differences in the resulting IC chip
areas and overall system performance.
Some technology mapping programs use partitioning techniques to map a
circuit specified as a network of modules performing simple Boolean
operations onto a network composed of specific modules available in an FPGA.
SungKyunKwan Univ.
VADA Lab.
208
Partitioning in VLSI CAD
•
•
Since the test generation problem for large circuits may be extremely intensive
computationally, circuit partitioning may provide the means to speed it up.
Generally, the problem of test pattern generation is NP-complete. To date, all
test generation algorithms that guarantee finding a test for a given fault exhibit
the worst-case behavior requiring CPU times exponentially increasing with the
circuit size. If the circuit can be partitioned into k parts (k not fixed), each of
bounded size c, then the worst-case test generation time would be reduced
linearly related to the circuit size.
Partitioning is often utilized in layout synthesis to produce and/or improve the
placement of the circuit modules. Partitioning is used to find strongly
connected subcircuits in the design, and the resulting information is utilized by
some placement algorithms to place in mutual proximity components
belonging to such subcircuits, thus minimizing delays and routing lengths.
SungKyunKwan Univ.
VADA Lab.
209
Partitioning in VLSI CAD
•
Another important class of partitioning problems occurs at the system design
level. Since IC packages can hold only a limited number of logic components
and external terminals, the components must be partitioned into subcircuits
small enough to be implemented in the available packages.
•
Partitioning has been used as well to estimate some properties of physical IC
designs, such as the expected IC area.
SungKyunKwan Univ.
VADA Lab.
210
Circuit Partitioning
•
The early attempts to solve the circuit partitioning problem were based on the
representation of the circuit as a graph G = (V,E), where V is a set of nodes (vertices)
representing the fundamental components, such as gates, flip-flops, inputs and outputs
and E is a set of edges representing nets present in the network. Graph partitioning
problems representing VLSI design problems usually involve separating the set of the
graph nodes into disjoint subsets while optimizing some objective function defined on
the graph vertices and edges. In the partitioned graph, edges can be divided into two
classes: inter-subset edges whose vertices belong to different subsets, and intra-subset
edges whose vertices belong to the same subset. The objective functions associated with
the graph partitioning problems usually treat these classes of edges in different ways.
•
One classic graph partitioning problem is the minimum cut (mincut) problem. Its
objective is to divide V into two disjoint parts, U and W, such that the number of the
inter-subset edges is minimized. The set e(U,W) is referred to as a cut set, and the
number of edges in cut set as the cut value.
SungKyunKwan Univ.
VADA Lab.
211
Circuit Partitioning
• graph and physical
representation
SungKyunKwan Univ.
VADA Lab.
212
VHDL example
process communication
Behavioral description
control/data flow graph
SungKyunKwan Univ.
VADA Lab.
213
Mincut Partitioning
•
•
An exact solution to the mincut problem was provided by Ford and Fulkerson [11], who
transformed the mincut problem into the maximum flow (maxflow) problem. The
maxflow-mincut algorithm finds a maximum flow in a network; the maxflow value is
equal to the mincut value. The first heuristic algorithm for a two-way graph partitioning
into equal-sized subsets was proposed by Kernighan and Lin, Their method consists of
choosing an initial partition randomly and reducing the cut value by exchanging
appropriately selected pairs of nodes from the subsets. After exchanging the positions,
nodes are locked in new positions. In subsequent steps, pair of unlocked nodes are
selected and exchanged until all nodes are locked. The execution of the algorithm stops,
when it riches the local minimum.
Most nets in digital circuits are multi-point connections among more than two modules
(logic gates, flip-flops, etc.). Therefore, modeling VLSI circuit partitioning problems as
graph partitioning problems may lead to poor results caused by inadequate
representation of multi-point nets which have to be decomposed into two-point
connections. One way to approximate circuit partitioning problems is to transform the
circuit into a weighted graph G' representation via a net model. For example, a multipoint net connecting n nodes may be modeled as a complete graph (clique) spanned on
these nodes, i.e., containing all possible edges among these nodes.
SungKyunKwan Univ.
VADA Lab.
214
Clustering (Cont’d)
• Clustering based on criterion B below the first cut-line,
then criterion A
• Clustering based on criterion A below the second cut-line,
then criterion B
SungKyunKwan Univ.
VADA Lab.
215
Clustering Example
• Two-cluster Partition
• Three-cluster Partition
SungKyunKwan Univ.
VADA Lab.
216
Complexity of Partitioning
In general, computing the optimal
partitioning is an NP-complete problem,
which means that the best known algorithms
take time which is an exponential function of
n=|N| and p, and it is widely believed that no
algorithm whose running time is a
polynomial function of n=|N| and p exists
(see ``Computers and Intractability'', M.
Garey and D. Johnson, W. H. Freeman, 1979,
for details.) Therefore we need to use
heuristics to get approximate solutions for
problems where n is large. The picture below
illustrates a larger graph partitioning problem;
it was generated using the spectral
partitioning algorithm as implemented in the
graph partitioning software by Gilbert et al,
described below. The partition is N = Nblue U
Nblack, with red edges connecting nodes in the
two partitions.
SungKyunKwan Univ.
VADA Lab.
217
Edge Separator and Vertex Separator
Bisecting a graph G=(N,E) can be done in two
ways. In the last section, we discussed finding the
smallest subset Es of E such that removing Es
from E divided G into two disconnected subgraphs
G1 and G2, with nodes N1 and N2 respectively,
where N1 U N2 = N and N1 and N2 are disjoint
and equally large. (If the number of nodes is odd,
we obviously cannot make |N1|=|N2|. So we will
call Es an edge separator if |N1| and |N2| are
sufficiently close; we will be more explicit about
how different |N1| and |N2| can be only when
necessary.) The edges in Es connect nodes in N1
to nodes in N2. Since removing Es disconnects G,
Es is called an edge separator. The other way to
bisect a graph is to find a vertex separator, a
subset Ns of N, such that removing Ns and all
incident edges from G also results in two
disconnected subgraphs G1 and G2 of G. In other
words N = N1 U Ns U N2, where all three subsets
of N are disjoint, N1 and N2 are equally large, and
no edges connect N1 and N2.
SungKyunKwan Univ.
The following figure illustrates these ideas. The
green edges, Es1, form an edge separator, as well
as the blue edges Es2. The red nodes, Ns, are a
vertex separator, since removing them and the
indicident edges (Es1, Es2, and the purple edges),
leaves two disjoint subgraphs.
Theorem. (Tarjan, Lipton, "A separator theorem for planar graphs",
SIAM J. Appl. Math., 36:177-189, April 1979). Let G=(N,E) be an
planar graph. Then we can find a vertex separator Ns, so that N =
N1 U Ns U N2 is a disjoint partition of N, |N1| <= (2/3)*|N|, |N2|
<= (2/3)*|N|, and |Ns| <= sqrt(8*|N|).
VADA Lab.
218
Kernighan and Lin Algorithm
•
•
B. Kernighan and S. Lin ("An effective heuristic
procedure for partitioning graphs", The Bell
System Technial Journal, pp. 291--308, Feb 1970),
which takes O(|N|3) time per iteration. A more
complicated and efficient implementation, which
takes only O(|E|) time per iteration, was presented
by C. Fiduccia and R. Mattheyses, "A linear-time
heuristic for improving network partitions",
Technical Report 82CRD130, General Electric Co.,
Corporate Research and Development Ceter,
Schenectady, NY 1982.
We start with an edge weighted graph
G=(N,E,WE), and a partitioning G = A U B into
equal parts: |A| = |B|. Let w(e) = w(i,j) be the
weight of edge e=(i,j), where the weight is 0 if no
edge e=(i,j) exists. The goal is to find equal-sized
subsets X in A and Y in B, such that exchanging X
and Y reduces the total cost of edges from A to B.
More precisely, we let T = sum[ a in A and b in
B ] w(a,b) = cost of edges from A to B and seek X
and Y such that new_A = A - X U Y and new_B
= B - Y U X has a lower cost new_T. To compute
new_T efficiently, we introduce:
SungKyunKwan Univ.
E(a) = external cost of a = sum[ b in B ] w(a,b)
I(a) = internal cost of a = sum[ a' in A, a'!=a]w(a,a')
D(a) = cost of a = E(a) - I(a) and analogously
E(b) = external cost of b = sum[ a in A ] w(a,b)
I(b) = internal cost of b = sum[ b' in B, b' !=b]w(b,b')
D(b) = cost of b = E(b) - I(b)
Then it is easy to show that swapping a in A and b in
B changes T to new_T = T - ( D(a) + D(b) 2*w(a,b) ) = T - gain(a,b)
In other words, gain(a,b) = D(a)+D(b)-2*w(a,b)
measures the improvement in the partitioning by
swapping a and b. D(a') and D(b') also change to
new_D(a') = D(a') + 2*w(a',a) - 2*w(a',b)
for all a' in A, a' !=a
new_D(b') = D(b') + 2*w(b',b) - 2*w(b',a)
for all b' in B, b' != b
VADA Lab.
219
Kernighan and Lin Algorithm
(0) Compute T = cost of partition N = A U B
... cost = O(|N|2)
Repeat
(1)
Compute costs D(n) for all n in N
... cost = O(|N|2)
(2)
Unmark all nodes in G
... cost = O(|N|)
(3)
While there are unmarked nodes
... |N|/2 iterations
(3.1)
Find an unmarked pair (a,b) maximizing
gain(a,b)
... cost = O(|N|2)
(3.2)
Mark a and b (but do not swap them)
... cost = O(1)
(3.3)
Update D(n) for all unmarked n, as
though a
and b had been swapped
... cost = O(|N|)
End while
SungKyunKwan Univ.
... At this point, we have computed a sequence of
pairs
... (a1,b1), ... , (ak,bk) and
... gains gain(1), ..., gain(k)
... where k = |N|/2, ordered by the order in
which
... we marked them
(4) Pick j maximizing Gain = sumi=1...j gain(i)
... Gain is the reduction in cost from swapping
... (a1,b1),...,(aj,bj)
(5) If Gain > 0 then
(5.2)
Update A = A - {a1,...,ak} U {b1,...,bk}
... cost = O(|N|)
(5.2)
Update B = B - {b1,...,bk} U {a1,...,ak}
... cost = O(|N|)
(5.3)
Update T = T - Gain
... cost = O(1)
End if
Until Gain <= 0
VADA Lab.
220
Spectral Partitioning
•
•
•
•
This is a powerful but expensive technique,
based on techniques introduced by Fiedler in
the 1970s, but popularized in 1990 by A.
Pothen, H. Simon, and K.-P. Liou,
"Partitioning sparse matrices with
eigenvectors of graphs", SIAM J. Matrix
Anal. Appl., 11:430--452. We will first
describe the algorithm, and then give three
related justifications for its efficacy. Let
G=(N,E) be an undirected, unweighted
graph without self edges (i,i) or multiple
edges from one node to another. We define
two matrices related to this graph.
Definition The incidence matrix In(G) of G
is an |N|-by-|E| matrix, with one row for each
node and one column for each edge.
Suppose edge e=(i,j). Then column e of In(G)
is zero except for the the i-th and j-th entries,
which are +1 and -1, respectively.
SungKyunKwan Univ.
Note that there is some ambiguity in this
definition, since G is undirected; writing edge
e=(i,j) instead of (j,i) is equivalent to
multiplying
column e of In(G) by -1. We will see that this
ambiguity will not be important to us.
Definition The Laplacian matrix L(G) of G is
an |N|-by-|N| symmetric matrix, with one row
and column for each node. It is defined as
follows.
(L(G))(i,j) = degree of node i if i=j (number
of incident edges) = -1 if i!=j and there is an
edge (i,j)
VADA Lab.
221
Spatial Locality: Hardware Partitioning
•
•
•
•
•
•
•
The interface logic should be properly partitioned for area and timing reasons. Minimization
of global busses leads to lower bus capacitance, and thus lower interconnect power.
Signal values within the clusters tend to be more highly correlated.
Data path should be partitioned into approximately equal size.
In the DSP area, data paths tens to occupy far more area than the control paths.
Wiring is still one of the domain area consumers
The method used to identify clusters is based on the eigenvalues and eigenvectors of the
Laplacian of the graph.
The eigen vector corresponding to the second smallest eigen value provides a 1-D
placement of the nodes which minimizes the mean-squared connection length.
SungKyunKwan Univ.
VADA Lab.
222
Spectral Partitioning in VLSI placement
SungKyunKwan Univ.
VADA Lab.
223
Spectral Partitioning in VLSI placement
•
Setting the derivative of the Lagrangian, L, to zero gives:
(Q  I ) x  0
•
•
The solution to the above equation are those is the eigenvalue and x is the corresponding
eigenvector.

The smallest eigenvalue 0 gives a trivial solution with all nodes at the same point. The
eigenvector corresponding to the second smallest eigenvalue minimizes the cost function
while giving a non-trivial solution
SungKyunKwan Univ.
VADA Lab.
224
Key Ideas in Spectral Partitioning
SungKyunKwan Univ.
VADA Lab.
225
Spectral Partitioning
SungKyunKwan Univ.
VADA Lab.
226
Spectral Partitioning
The following theorem state some important
facts about In(G) and L(G). It introduces us to
the idea that the eigenvalues and eigen
vectors of L(G) are related to the connectivity
of G.
Theorem 1. Given a graph G, its associated
matrices In(G) and L(G) have the following
properties.
1.L(G) is a symmetric matrix. This means
the eigenvalues of L(G) are real, and its
eigenvectors are real and orthogonal.
2.Let e=[1,...,1]', where ' means transpose,
i.e. the column vector of all ones. Then
L(G)*e = 0.
3.In(G)*(In(G))' = L(G). This is
independent of the signs chosen in each
column of In(G).
4.Suppose L(G)*v = lambda*v, where v is
nonzero. Then
SungKyunKwan Univ.
norm(In(G)'*v)2
lambda =
-----------------norm(v)2
where norm(z)2 = sumi z(i)2
= sum{all edges e=(i,j)} (v(i)-
v(j))2
---------------------------------sumi v(i)2
5. The eigenvalues of L(G) are
nonnegative:
0 <= lambda1 <= lambda2 <= ... <=
lambdan
6.The number of of connected
components of G is equal to the number
of lambdai) equal to 0.
In particular, lambda2 != 0 if and only if
G is connected.
VADA Lab.
227
Spectral Partitioning
Compute the eigenvector v2
corresponding to lambda2 of L(G)
for each node n of G
if v2(n) < 0
put node n in partition Nelse
put node n in partition N+
endif
endfor
First we show that this partition is at least
reasonable, because it tends to give
connected components N- and N+:
Theorem 2. (M. Fiedler, "A property of
eigenvectors of nonnegative symmetric
matrices and its application to graph
theory", Czech.Math. J. 25:619--637,
1975.) Let G be connected, and N- and
N+ be defined by the above algorithm.
Then N- is connected. If no v2(n) = 0,
N+ is also connected.
SungKyunKwan Univ.
There are a number of reasons lambda2 is called
the algebraic connectivity. Here is another.
Theorem 3. (Fiedler). Let G=(N,E) be a graph,
and G1=(N,E1) a subgraph, i.e. with the same
nodes and subset of the edges, so that G1 is "less
connected" than G. Then lambda2(L(G1)) <=
lambda2(L(G)), i.e. the algebraic connectivity of
G1 is also less than or equal to the algebraic
connectivity of G.
Motivation for spectral bisection, by analogy with
a vibrating string
How does a taut string vibrate when it is plucked?
From our background in either physics or music,
we know that it has certain modes of vibration or
harmonics. If we were to take snapshots of these
modes, they would look like this:
VADA Lab.
228
Spectral Partitioning
SungKyunKwan Univ.
VADA Lab.
229
Multilevel Kernighan-Lin
Given a matching, Gc is computed as follows.
We let there be a node r in Nc for each edge in
Gc is computed in step (1) of
Recursive_partition as follows. We define a Em. Then we construct Ec as follows:
matching of a graph G=(N,E) as a subset
Em of the edges. E with the property that no for r = 1 to |Em| ... for each node in Nc
let (i,j) be the edge in Em corresponding to
two edges in Em share an endpoint. A
node r
maximal matching is one to which no more
for each other edge e=(i,k) in E incident on i
edges can be added and remain a matching.
let ek be the edge in Em incident on k, and
We can compute a maximal matching by a
let rk be the corresponding node in Nc
simple random algorithm:
add the edge (r,rk) to Ec
end for
let Em be empty
for each other edge e=(j,k) in E incident on j
mark all nodes in N as unmatched
let ek be the edge in Em incident on k, and
for i = 1 to |N| ... visit the nodes in a
let rk be the corresponding node in Nc
random order
add the edge (r,rk) to Ec
if node i has not been matched,
end for
choose an edge e=(i,j) where j is also
unmatched,
end for
and add it to Em
if there are multiple edges between pairs of
mark i and j as matched
nodes of Nc, collapse them into single edges
end if
end for
SungKyunKwan Univ.
VADA Lab.
230
Multilevel Kernighan-Lin
Note that we can take node weights into
account by letting the weight of a node (i,j)
in Nc be the sum of the weights of the
nodes I and j. We can similarly take edge
weights into account by letting the weight
of an edge in Ec be the sum of the weights
of the edges "collapsed" into it.
Furthermore, we can choose the edge (i,j)
which matches j to i in the construction of
Nc above to have the large weight of all
edges incident on i; this will tend to
minimize the weights of the cut edges. This
is called heavy edge matching in METIS,
and is illustrated on the right.
SungKyunKwan Univ.
VADA Lab.
231
Multilevel Kernighan-Lin
Given a partition (Nc+,Nc-) from step
(2) of Recursive_partition, it is
easily expanded to a partition
(N+,N-) in step (3) by associating
with each node in Nc+ or Nc- the
nodes of N that comprise it. This is
again shown below:
Finally, in step (4) of
Recurive_partition, the
approximate partition from step (3)
is improved using a variation of
Kernighan-Lin.
SungKyunKwan Univ.
VADA Lab.
232
Multilevel Spectral Partitioning
Now we turn to the divide-and-conquer
algorithm of Barnard and Simon, which is
based on spectral partitioning rather than
Kernighan-Lin. The expensive part of
spectral bisection is finding the eigenvector
v2, which requires a possibly large number
of matrix-vector multiplications with the
Laplacian matrix L(G) of the graph G. The
divide-and-conquer approach of
Recursive_partition will dramatically
decrease the cost. Barnard and Simon
perform step (1) of Recursive_partition,
computing Gc = (Nc,Ec) from G=(N,E),
slightly differently than above: They find a
maximal independent subset Nc of N. This
means that N contains Nc and E contains
Ec, no nodes in Nc are directly connected
by edges in E (independence), and Nc is as
large as possible (maximality).
SungKyunKwan Univ.
There is a simple "greedy" algorithm for
finding an Nc:
Nc = empty set
for i = 1 to |N|
if node i is not adjacent to any node
already in Nc
add i to Nc
end if
end for
This is shown below in the case where G is
simply a chain of 9 nodes with nearest
neighbor connections, in which case Nc
consists simply of every other node of N.
VADA Lab.
233
hMETIS
•
•
•
hMETIS is a set of programs for partitioning hypergraphs such as those
corresponding to VLSI circuits. The algorithms implemented by hMETIS are
based on the multilevel hypergraph partitioning scheme described in
[KAKS97].
hMETIS produces bisections that cut 10% to 300% fewer hyperedges than
those cut by other popular algorithms such as PARABOLI, PROP, and CLIPPROP, especially for circuits with over 100,000 cells, and circuits with nonunit cell areaIt is extremely fast!A single run of hMETIS is faster than a single
run of simpler schemes such as FM, KL, or CLIP. Furthermore, because of its
very good average cut characteristics, it produces high quality partitionings in
significantly fewer runs. It can bisect circuits with over 100,000 vertices in a
couple of minutes on Pentium-class workstations.
The performance of hMETIS on the new ISPD98 benchmark suite can be
found in the paper by Chuck Alpert.
 http://www.users.cs.umn.edu/~karypis/metis/metis.html
SungKyunKwan Univ.
VADA Lab.
234
How good is Recursive Bisection?
•
•
Horst D. Simon and Shang-Hua Teng , Report RNR-93-012, August 1993
The most commonly used p-way partitioning method is recursive bisection. It
first "optimally" divides the graph (mesh) into two equal sized pieces and then
recursively divides the two pieces.We show that,due to the greedy nature and
the lack of global information,recursive bisection, in the worst case,may
produce a partition that is very far from the optimal one. Our negative result is
complemented by two positive ones.First, we show that for some important
classes of graphs that occur in practical applications,such as well shaped finite
element and finite difference meshes,recursive bisection is normally within a
constant factor of the optimal one. Secondly,we show that if the balanced
condition is relaxed so that each block in the partition is bounded by
(1+e)n/p,then there exists a approximately balanced recursive partitioning
scheme that finds a partition whose cost is within an 0(log p) factor of the cost
of the optimal p-way partition.
SungKyunKwan Univ.
VADA Lab.
235
Partitioning Algorithm with Multiple
Constraints
1998. 5. 19
조준동
SungKyunKwan Univ.
VADA Lab.
236
스위칭에 의한 충전과 방전
• 전체 전력소모의 최대 90%까지 차지
Vdd
PMOS
pull-up
network
charge
discharge
NMOS
pull-up
network
CL
short
circuit +
leakage
SungKyunKwan Univ.
VADA Lab.
237
저전력을 위한 분할
• 기존의 방법 : cut을 지나가는 간선의 수
• 저전력 : 간선의 스위칭 동작의 수
0.25
0.75
0.25
0.75
0.25
( a ) cut ÀÇ ¼ö·Î ÀÚ¸§
SungKyunKwan Univ.
0.25
( b ) ½ºÀ§Äª µ¿ÀÛÀÇ ¼ö·Î
ÀÚ¸§
VADA Lab.
238
최소비용흐름 알고리즘
• 주어진 양을 가장 적은 비용으로 원하는 목적지까지 보낼수 있는
방법
– 각 통로는 용량과 비용을 가짐
• Max-flow min-cut : 간선의 수만 고려
• Min-Cost flow : 간선마다 스위칭 동작의 가중치를 부여
– 비용 : 스위칭 동작 vs. 간선의 수
– 용량 : 간선에 흐를 수 있는 최대양
• 비용이 적을수록 선택되도록 큰 용량
Wi  Si  (1   )Ci
SungKyunKwan Univ.
VADA Lab.
239
Network and Mincost Flow
15 / 30
45 / 55
100 / 10
20 / 10
23 / 11
10 / 100
30 / 24
100 / 10
7 / 80
10 / 100
45 / 55
1/ 5
10 / 35
23 / 11
15 / 30
6 / 100
3/5
1 / 10
10 / 35
100 / 10
SungKyunKwan Univ.
VADA Lab.
240
그래프 변환 알고리즘
• Min-Cost Flow
경로를 찾음
• Cut 을 찾기 위해서
그래프의 변환이 필요
• 레벨에 따른
topological 정렬
Level 5
Level 4
Level 3
Level 2
Level 1
SungKyunKwan Univ.
VADA Lab.
241
그래프 변환 알고리즘
• 추가된 노드 및 간선
Level ( i+1 )
Sink
Source
Level ( i )
»õ·Î »ý¼ºµÈ °£¼±
±âÁ¸ÀÇ
SungKyunKwan Univ.
°£¼±
»õ·Î »ý¼ºµÈ ³ëµå
±âÁ¸ÀÇ ³ëµå
VADA Lab.
242
그래프 변환
Level 5
Level 4
S
Level 3
T
sink
Source
Level 2
Level 1
SungKyunKwan Univ.
VADA Lab.
243
Partitioning with constraints
k
k
W   Cij (i  j )
i 1 j 1
Alower  Ai  Aupper, Pi  Pupper,1  i  k
SungKyunKwan Univ.
VADA Lab.
244
Algorithm
Input: Flow f, Network
Output: Partition the network into f subnetworks
단계 1:
그래프에 Flow 를 push하여 최소비용흐름 알고리즘 수행;
만약 각각의 partition에 대하여 A_upper 또는 P_upper를 만족하면 마침;
그렇지않으면 f = f+1; 증가시키고 upper bound를 만족할 때까지
단계 1을 반복한다.
단계 2:
만약 A_lower 또는 P_lower를 만족하지 않는두개의 partition p, q 가 있고
Alower  Ap  Aq  Aupper
Plower  Pp  Pq  Pupper
라면 p와 q는 merge가 가능하고 모든 가능한{p,q} set에 대하여 최소비용매칭을 적용하
여 분할된 partition의 개수를 줄임.
SungKyunKwan Univ.
VADA Lab.
245
참고문헌
[1] J.D.Cho and P.D.Franzon, "High-Performance Design Automation for Multi-Chip Modules and
Packages", World Scientific Pub. Co. 1996
[2] H.J.M.Veendrick, "Short-Circuit Dessipation of Static CMOS Circuitry and its Impact on the Design of
Buffer Circuits" IEEE JSSCC, pp.468-473, August, 1984
[3] H.B.Bakoglu, "Circuits, Interconnections and Packaging for VLSI", pp.81-112, Addison-Wesley
Publishing Co., 1990
[4] K.M.hall. "An r-dimensional quadratic placement algorithm", Management Sci., vol.17, pp.219-229,
Nov, 1970
[5] Cadence Design Systems. "A Vision for Multi-Chip Module design in the nineties", Tech. Rep.
Cadence Design Systems Inc., Santa Clara, CA, 1993
[6] R.Raghavan, J.Cohoon, and S.Shani. "Single Bend Wiring", Journal of Algorithms, 7(2):232-257, June,
1986
[7] Kernighan, B.W. and S.lin. "An efficient heuristic procedure to partition graphs" Bell System
Technical Journal, 492:291-307, Feb. 1970
[8] Wei, Y.C. and C.K.Cheng "Ratio-Cut Partitioning for Hierachical Designs", IEEE Trans. on ComputerAided Design. 40(7):911-921, 1991
[9] S.W.Hadley, B.L.Mark, and A.Vanelli, "An Efficient Eigenvector Approach for Finding Netlist
Partitions", IEEE Trans. on Computer-Aided Design, vol. CAD-11, pp.85-892, July, 1992
[10] L.R.Fold, Jr. and D.R.Fulkerson. "Flows in Networks", Princeton University Press, Princeton, NJ,
1962
[11] Liu H. and D.F.Wong, "Network Flow Based Multi-Way Partitioning With Area and Pin Constraints",
IEEE/ACM Symposium on Physical Design, pp. 12-17, 1997
[12] Kirkpatrick, S. Jr., C.Gelatt, and M.Vecchi. "Optimization by simulated annealing", Science,
220(4598):498-516, May, 1983
[13] Pedram, M. "Power Minimization in IC Design: Principles and Applications," ACM Trans. on Design
Automation of Electronics Systems, 1(1), Jan. pp. 3-56, 1996.
[14] A.H.Farrahi and M.Sarrafzadeh. "FPGA Technology Mapping for Power Minimizatioin", In
International Workshop on Field-Programmable Logic and Applications, pp66-77, Sep. 1994
[15] M.A.Breur, "Min-Cut Placement", J.Design Automation and Fault-Tolerant Computing, pp.343-382,
Oct. 1977
SungKyunKwan Univ.
VADA Lab.
246
[16] M.Hanan and M.J.Kutrzberg. A Review of the Placement and the Quadratic Assignment
Problem, Apr. 1072.
[17] N.R.Quinn, "The Placement Problem as Viewed from the Physics of Classical
Mechanics", Proc. of the 12th Design Automation Conference, pp.173-178, 1975
[18] C.Sehen, and A.Sangiovanni-Vincentelli, "The Timber Wolf placement and routing
package", IEEE Journal of Solid-State Circuits, Sc-20, pp.501-522, 1985
[19] K.Shahookar, and P.Mazumder, "A Genetic Approach to Standard Cell Placement", First
European Design Automation Conference, Mar. 1990
[20] J.D.Cho, S.Raje, M.Sarrafzadeh, M.Sriram, and S.M.Kang, "Crosstalk Minimum Layer
Assignment", In Proc. IEEE Custom Integr. Circuits Conf., San Diego, CA, pp.29.7.1-29.7.4,
1993
[21] J.M.Ho, M.Sarrafzadeh, G,Vijayan, and C.K.Wong. "Layer Assignment for Multi-Chip
Modules", IEEE Trans. on Computer-Aided Design, CAD-9(12):1272-1277, Dec., 1991
[22] G.Devaraj. "Distributed placement and crosstalk driven router for multichip modules", In
MS Thesis, Univ. of Cincinnati, 1994
[23] J.D.Cho. "Min-Cost Flow based Minimum-Cost Rectilinear Steiner Distance-Preserving
Tree", International Symposium on Physical Desigh, pp-82-87, 1997
[24] A.Vitttal and M.Marek-Sadowska. "Minimal Delay Interconnection Design using
Alphabetic Trees", In Design Automation Conference, pp.392-396, 1994
[25] M.C.Golumbic. "Algorithmic Graph Theory and Perfect Graph", pp.80-103, New York :
Academic. 1980
[26] R.Vemuri. "Genetic Algorithms for partitioning, placement, and layer assignment for
multichip modules", Ph.D. Thesis, Univ. of Cincinnati, 1994
[27] J.L.Kennington and R.V.Helgason, "Algorithms for Network Programmin", John Wiley,
1980
[28] J.Y.Cho and J.D.Cho "Improving Performance and Routability Estimation in MCM
Placement", In InterPack'97, Hawaii, June, 1997
[29] J.Y.Cho and J.D.Cho "Partitioning for Low Power Using Min-Cost Flow Algorithm",
submitted to 한국반도체학술대회, Feb, 1998
SungKyunKwan Univ.
VADA Lab.
247
6. Logic Level Design
SungKyunKwan Univ.
VADA Lab.
248
Node Transition Activity
SungKyunKwan Univ.
VADA Lab.
249
Low Activity XOR Function
SungKyunKwan Univ.
VADA Lab.
250
GLITCH (Spurious transitions)
• 15-20% of the total
power is due to
glitching.
SungKyunKwan Univ.
VADA Lab.
251
Glitches
SungKyunKwan Univ.
VADA Lab.
252
Hazard Generation in Logic Circuits
•Static hazard: A transient pulse
of width w (= the delay of the
inverter).
• Dynamic hazard: the transient
consists of three edges, two
rising and one falling with w of
two units.
• Each input can have several
arriving paths.
SungKyunKwan Univ.
VADA Lab.
253
High-Performance PowerDistribution
• (S: Switching probability; C: Capacitance)
• Start with all logic at the lowest power level; then, successive
iterations of delay calculation, identifying the failing blocks, and
powering
• up are done until either all of the nets pass their delay criteria or
the
• maximum power level is reached.
• Voltage drops in ground and supply wires use up a more
serious fraction of the total noise margin
SungKyunKwan Univ.
VADA Lab.
254
Logic Transformation
•
•
•
•
•
•
•
Use a signal with low switching activity to reduce the activity on a highly active
signal.
Done by the addition of a redundant connection between the gate with low
activity (source gate) to the gate with a high switching activity (target gate).
Signals a, b, and g1 have very high switching activity and most of time its value
is zero
Suppose c and g1 are selected as the source and target of a new connection ` 1
is undetectable, hence the function of the new circuit remains the same.
Signal c has a long run of zero, and zero is the controlling value of the and gate
g1 , most of the switching activities at the input of g1 will not be seen at the
output, thus switching activity of the gate g1 is reduced.
The redundant connection in a circuit may result in some irredundant
connections becoming redundant.
By adding ` 1 , the connections from c to g3 become redundant.
SungKyunKwan Univ.
VADA Lab.
255
Logic Transformation
SungKyunKwan Univ.
VADA Lab.
256
Logic Transformation
SungKyunKwan Univ.
VADA Lab.
257
Frequency Reduction
◈ Power saving
 Reduces capacitance on the clock network
 Reduces internal power in the affected registers
 Reduces need for muxes(data recirculation)
◈ Opportunity
 Large opportunity for power reduction, dependent on;
 Number of registers gated
 percentage of time clock is enabled
◈ Cost
 Testability
 Complicates clock tree synthesis
 Complicates clock skew balancing
SungKyunKwan Univ.
VADA Lab.
258
GATED-CLOCK D-FLIP-FLOP
• Flip- op present a large internal capacitance on the internal clock node.
• If the DFF output does not switch, the DFF does not have to be clocked.
SungKyunKwan Univ.
VADA Lab.
259
Frequency Reduction
Clock Gating Example - When D is not equal to Q
32
data_in
32
data_in
D
reset
FSM
load_en
Q
D
data_out
data
_reg
clk
32
load_en
reset
FSM
clk
clk
load-en_latched
L
A
T
C
H
Q
data
_reg
clk_en
Before Clock Gating
After Clock Gating
SungKyunKwan Univ.
VADA Lab.
260
data_out
Frequency Reduction
◈ Clock Gating Example - Before Code
library ieee;
use ieee.std_logic_1164.all;
use ieee.std_logic_unsigned.all;
entity nongate is
port(clk,rst : in std_logic;
data_in : in std_logic_vector(31 downto 0);
data_out : out std_logic_vector(31 downto 0));
end nongate;
architecture behave of nongate is
signal load_en : std_logic;
signal data_reg : std_logic_vector(31 downto 0);
signal count : integer range 0 to 15;
begin
FSM : process
begin
wait until clk'event and clk='1';
if rst='0' then count <= 0;
elsif count=9 then
count <= 0;
else count <= count+1;
end if;
end process FSM;
SungKyunKwan Univ.
enable_logic : process(count,load_en)
begin
if(count=9) then
load_en <= '1';
else
load_en <= '0';
end if;
end process enable_logic;
datapath : process
begin
wait until clk'event and clk='1';
if load_en='1' then data_reg <= data_in;
end if;
end process datapath;
data_out <= data_reg;
end behave;
configuration cfg_nongate of nongate is
for behave
end for;
end cfg_nongate;
VADA Lab.
261
Frequency Reduction
◈ Clock Gating Example - After Code
library ieee;
use ieee.std_logic_1164.all;
use ieee.std_logic_unsigned.all;
entity gate is
port(clk,rst : in std_logic;
data_in : in std_logic_vector(31 downto 0);
data_out : out std_logic_vector(31 downto
0));
end gate;
architecture behave of gate is
signal load_en,load_en_latched,clk_en : std_logic;
signal data_reg : std_logic_vector(31 downto 0);
signal count : integer range 0 to 15;
begin
SungKyunKwan Univ.
VADA Lab.
262
Frequency Reduction
FSM : process
begin
wait until clk'event and clk='1';
if rst='0' then
count <= 0;
elsif count=9 then count <= 0;
else count <= count+1;
end if;
end process FSM;
enable_logic : process(count,load_en)
begin
if(count=9) then
load_en <= '1';
else load_en <= '0';
end if;
end process enable_logic;
deglitch : PROCESS(clk,load_en)
begin
SungKyunKwan Univ.
if(clk='0') then
load_en_latched <= load_en;
end if;
end process deglitch;
clk_en <= clk and load_en_latched;
datapath : process
begin
wait until clk_en'event and clk_en='1';
data_reg <= data_in;
end process datapath;
data_out <= data_reg;
end behave;
configuration cfg_gate of gate is
for behave
end for;
end cfg_gate;
VADA Lab.
263
Frequency Reduction
◈ Clock Gating Example - Report
SungKyunKwan Univ.
VADA Lab.
264
Frequency Reduction
◈ 4-bit Synchronous & Ripple counter - code
 4-bit Synchronous Counter
Library IEEE;
use IEEE.std_logic_1164.all;
use IEEE.std_logic_arith.all;
entity BINARY is
Port ( clk : In std_logic;
reset : In std_logic;
count : BUFFER UNSIGNED (3
downto 0));
end BINARY;
architecture BEHAVIORAL of BINARY is
begin
process(reset,clk,count)
begin
SungKyunKwan Univ.
if (reset = '0') then count <= "0000”
elsif (clk'event and clk = '1') then
if (count = UNSIGNED'("1111")) then
count <= "0000";
else count <=count+UNSIGNED'("1");
end if;
end if;
end process;
end BEHAVIORAL;
configuration
CFG_BINARY_BLOCK_BEHAVIORAL of
BINARY is
for BEHAVIORAL
end for;
end CFG_BINARY_BLOCK_BEHAVIORAL;
VADA Lab.
265
Frequency Reduction
 4-bit Ripple Counter
Library IEEE;
use IEEE.std_logic_1164.all;
use IEEE.std_logic_arith.all;
entity RIPPLE is
Port ( clk : In std_logic;
reset : In std_logic;
count : BUFFER UNSIGNED (3
downto 0));
end RIPPLE;
architecture BEHAVIORAL of RIPPLE is
signal count0, count1, count2 : std_logic;
begin
process(count)
begin
count0 <= count(0);
count1 <= count(1);
SungKyunKwan Univ.
count2 <= count(2);
end process;
process(reset,clk)
begin
if (reset = '0') then count(0) <= '0';
elsif (clk'event and clk = '1') then
if (count(0) = '1') then count(0) <= '0';
else count(0) <= '1';
end if;
end if;
end process;
process(reset,count0)
begin
if (reset = '0') then count(1) <= '0';
elsif (count0'event and count0 = '1') then
VADA Lab.
266
Frequency Reduction
if (count(1) = '1') then count(1) <= '0';
else count(1) <= '1';
end if;
end if;
end process;
process(reset,count1)
begin
if (reset = '0') then count(2) <= '0';
elsif (count1'event and count1 = '1') then
if (count(2) = '1') then count(2) <= '0';
else count(2) <= '1';
end if;
end if;
end process;
if (count(3) = '1') then count(3) <= '0';
else count(3) <= '1';
end if;
end if;
end process;
end BEHAVIORAL;
configuration
CFG_RIPPLE_BLOCK_BEHAVIORAL of RIPPLE
is
for BEHAVIORAL
end for;
end CFG_RIPPLE_BLOCK_BEHAVIORAL;
process(reset,count2)
begin
if (reset = '0') then count(3) <= '0';
elsif (count2'event and count2 = '1') then
SungKyunKwan Univ.
VADA Lab.
267
Frequency Reduction
◈ 4-bit Synchronous & Ripple counter - Report
SungKyunKwan Univ.
VADA Lab.
268
Bus-Invert Coding for Low Power I/O
An eight-bit bus on which all
eight lines toggle at the same
time and which has a high peak
(worst-case) power dissipation.
•There are 16 transitions over
16 clock cycles (average 1
transition per clock cycle).
SungKyunKwan Univ.
VADA Lab.
269
Peak Power Dissipation
An eight-bit bus on which the
eight lines toggle at different
moments and which has a low
peak power dissipation. There
are the same 16 transitions
over 16 clock cycles and thus
the same average power
dissipation
SungKyunKwan Univ.
VADA Lab.
270
Bus-Invert - Coding for low power
•
•
•
•
•
The Bus-Invert method proposed here uses one extra control bit called
invert. By convention then invert = 0 the bus value will equal the data
value. When invert = 1 the bus value will be the inverted data value.
The peak power dissipation can then be decreased by half by coding
the I/O as follow
1. Compute the Hamming distance (the number of bits in which they
differ) between the present bus value (also counting the present invert
line) and the next data value.
2. If the Hamming distance is larger than n=2, set invert = 1 (and thus
make the next bus value equal to the inverted next data value).
3. Otherwise, let invert = 0 (and let the next bus value equal to the next
data value).
4. At the receiver side the contents of the bus must be conditionally
inverted according to the invert line, unless the data is not stored
encoded as it is (e.g. in a RAM). In any case the value of invert must be
transmitted over the bus (the method increases the number of bus lines
from n to n + 1).
SungKyunKwan Univ.
VADA Lab.
271
Example
A typical eight-bit synchronous data
bus. The transitions between two
consecutive time-slots are \clean".
There are 64 transitions for a period
of 16 time slots. This represents an
average of 4 transitions per time slot,
or 0.5 transitions per bus line per time
slot.
SungKyunKwan Univ.
VADA Lab.
272
Bus encoding
The same sequence of data coded
using the Bus
Invert method. There are now only
53 transitions over a period of 16
time slots. This represents an
average of 3.3 transitions per time
slot, or 0.41 transitions per bus line
per time slot.
The maximum number of
transitions for any time slot is now
4.
SungKyunKwan Univ.
VADA Lab.
273
Comparisons
Comparison of unencoded I/O and coded I/O with one or more invert lines.
The comparison looks at the average and maximum number of transitions
per time-slot, per bus-line per time-slot, and I/O power dissipation for
different bus-widths.
SungKyunKwan Univ.
VADA Lab.
274
Remarks
•
•
•
•
The increase in the delay of the data-path: By looking at the power-delay
product which removes the effect of frequency (delay) on power dissipation, a
clear improvement is obtained in the form of an absolute lower number of
transitions. It is also relatively easy to pipeline the bus activity. The extra pipeline
stage and the extra latency must then be considered.
The increased number of I/O pins. As was mentioned before ground-bounce is a
big problem for simultaneous switching in high speed designs. That is why
modern microprocessors use a large number of Vdd and GND pins. The BusInvert method has the side-effect of decreasing the maximum ground-bounce by
approximately 50%. Thus circuits using the Bus Invert method can use a lower
number of Vdd and GND pins and by using the method the total number of pins
might even decrease.
Bus-Invert method decreases the total power dissipation although both the total
number of transitions increases (by counting the extra internal transitions) and
the total capacitance increases (because of the extra circuitry). This is
possible because the transitions get redistributed very nonuniformly, more on
the low-capacitance side and less on the high-capacitance side.
SungKyunKwan Univ.
VADA Lab.
275
References
[1] H. B. Bakoglu, Circuits, Interconnections and Packaging for
VLSI, Addison-Wesley, 1990.
[2] T. K. Callaway, E. E. Swartzlander, \Estimating the Power Consumption of CMOS Adders", 11th Symp. on Comp. Arithmetic,
pp. 210-216, Windsor, Ontario, 1993.
[3] A. P. Chandrakasan, S. Sheng, R. W. Brodersen, \Low-Power
CMOS Digital Design", IEEE Journal of Solid-State Circuits,
pp. 473-484, April 1992.
[4] A. P. Chandrakasan, M. Potkonjak, J. Rabaey, R. W. Brodersen,
\HYPER-LP: A System for Power Minimization Using Architectural Transformations", ICCAD-92, pp.300-303, Nov. 1992,
Santa Clara, CA.
[5] A. P. Chandrakasan, M. Potkonjak, J. Rabaey, R. W. Brodersen,
\An Approach to Power Minimization Using Transformations",
IEEE VLSI for Signal Processing Workshop, pp. , 1992, CA.
[6] S. Devadas, K. Keutzer, J. White, \Estimation of Power Dissipation in CMOS Combinational Circuits", IEEE Custom Integrated Circuits Conference, pp. 19.7.1-19.7.6, 1990.
[7] D. Dobberpuhl et al. \A 200-MHz 64-bit Dual-Issue CMOS Microprocessor", IEEE Journal of Solid-State Circuits, pp. 15551567, Nov. 1992.
[8] R. J. Fletcher, \Integrated Circuit Having Outputs Congured
for Reduced State Changes", U.S. Patent no. 4,667,337, May,
1987.
SungKyunKwan Univ.
[9] D. Gajski, N. Dutt, A. Wu, S. Lin, High-Level
Synthesis, Introduction to Chip and System Design,
Kluwer Academic Publishers, 1992.
[10] J. S. Gardner, \Designing with the IDT
SyncFIFO: the Architecture of the Future", 1992
Synchronous (Clocked) FIFO Design Guide,
Integrated Device Technology AN-60, pp. 7-10, 1992,
Santa Clara, CA.
[11] A. Ghosh, S. Devadas, K. Keutzer, J. White,
\Estimation of Average Switching Activity in
Combinational and Sequential Circuits",
Proceedings of the 29th DAC, pp. 253-259, June
1992, Anaheim, CA.
[12] J. L. Hennessy, D. A. Patterson, Computer
Architecture - A
Quantitative Approach, Morgan Kaufmann
Publishers, Palo
Alto, CA, 1990.
[13] S. Kodical, \Simultaneous Switching Noise",
1993 IDT High-Speed CMOS Logic Design Guide,
Integrated Device Technology AN-47, pp. 41-47,
1993, Santa Clara, CA.
[14] F. Najm, \Transition Density, A Stochastic
Measure of Activity in Digital Circuits", Proceedings
of the 28th DAC, pp. 644-649, June 1991, Anaheim,
CA.
VADA Lab.
276
References
[16] A. Park, R. Maeder, \Codes to Reduce Switching
Transients Across VLSI I/O Pins", Computer
Architecture News, pp. 17-21, Sept. 1992.
[17] Rambus - Architectural Overview, Rambus Inc.,
Mountain View, CA, 1993. Contact
[email protected].
[18] A. Shen, A. Ghosh, S. Devadas, K. Keutzer, \On
Average Power Dissipation and Random Pattern
Testability", ICCAD-92, pp. 402-407, Nov. 1992,
Santa Clara, CA.
[19] M. R. Stan, \Shift register generators for circular
FIFOs", Electronic Engineering, pp. 26-27,
February 1991, Morgan Grampian House,
London, England.
[20] M. R. Stan, W. P. Burleson, \Limited-weight
codes for low power I/O", International
Workshop on Low Power Design, April 1994,
Napa, CA.
SungKyunKwan Univ.
[21] J. Tabor, Noise Reduction Using Low Weight and
Constant Weight Coding Techniques, Master's Thesis,
EECS Dept., MIT, May 1990.
[22] W.-C. Tan, T. H.-Y. Meng, \Low-power polygon
renderer for computer graphics", Int. Conf. on
A.S.A.P., pp. 200-213, 1993.
[23] N. Weste, K. Eshraghian, Principles of CMOS
VLSI Design, A Systems Perspective, AddisonWesley Publishing Company, 1988.
[24] R. Wilson, \Low power and paradox", Electronic
Engineering Times, pp. 38, November 1, 1993.
[25] J. Ziv, A. Lempel, A universal Algorithm for
Sequential Data Compression", IEEE Trans. on Inf.
Theory, vol. IT-23, pp. 337-343, 1977.
VADA Lab.
277
DesignPower Gate Level Power
Model
◈ Switching Power
 Power dissipated when a load capacitance(gate+wire) is charged or
discharged at the driver’s output
 If the technology library contains the correct capacitance
value of the cell and if capacitive_load_unit attribute is
specified then no additional information is needed for
switching power modeling
 Output pin capacitance need not be modeled if the switching
power is incorporated into the internal power
2
V
Psw 
 [  Ci  TRi ]
2
forall
nets
SungKyunKwan Univ.
VADA Lab.
278
DesignPower Gate Level Power
Model
◈ Internal Power
 power dissipated internal to a library cell
 Modeled using energy lookup table indexed by input
transition time and output load
 Library cells may contain one or more internal energy lookup
tables
P int   E int i ( outputload, inputtransition) TRi ]
forall
Cells
SungKyunKwan Univ.
VADA Lab.
279
DesignPower Gate Level Power
Model
◈ Leakage Power
 Leakage power model supports a signal value for each library cell
 State dependent leakage power is not supported
Pleak

 Pleaki
fo ra ll
C ells
SungKyunKwan Univ.
VADA Lab.
280
Operand Isolation
m
Significant Power Dissipation
m
D
n
m
Q
Register
Data_out
Bank
• Combinational logic
dissipates significant power
when output is unused
FSM
EN
m
m
n
D
n
Latch
m
Q
Register
G
Bank
FSM
Data_out
• Inputs to combination logic
held stable when output is
unused
EN
SungKyunKwan Univ.
VADA Lab.
281
Operation Isolation Example -Diagram
Data_Mul
8
Data_Add
a
8
D
ADD
do
MUL
Q
b
16
8
Before
DataReg
c
rst
FSM
Load_En Load_En_Latched
D
Q
Latch
G
Operand Isolation
Clk_En
clk
Data_Add
Iso_Data_Add
Data_Mul
8
a
D
8
ADD
Q
Latch
G
D
do
ADD
Q
b
16
8
After
DataReg
c
rst
FSM
Load_En
D
Load_En_Latched
Q
Clk_En
Operand Isolation
Latch
G
clk
SungKyunKwan Univ.
VADA Lab.
282
Operand Isolation Example - Before Code
Library IEEE;
Use IEEE.STD_LOGIC_1164.ALL;
Use IEEE.STD_LOGIC_SIGNED.ALL;
Signal Data_Add : std_logic_vector(7 downto 0);
Signal Data_Mul : std_logic_vector(15 downto 0);
Begin
Entity Logic is
Port(
a, b, c : in std_logic_vector(7 downto 0);
do : out std_logic_vector(15 downto 0);
rst : in std_logic;
clk : in std_logic
);
End Logic;
Process(clk,rst)
Architecture Behave of Logic is
Signal Count : integer;
Signal Load_En : std_logic;
-- Counter Logic in FSM
Begin
If(clk='1' and clk'event) then
If(rst='0') then
Count <= 0;
Elsif(Count=9) then
Count <= 0;
Else
Count <= Count + 1;
End If;
End If;
End Process;
Signal Load_En_Latched : std_logic;
Signal Clk_En : std_logic;
SungKyunKwan Univ.
VADA Lab.
283
Operand Isolation Example - Before Code
Process(Count)
-- Enable Logic in FSM
Begin
If(Count=9) then
Load_En <= '1';
Else
Load_EN <= '0';
End If;
End Process;
Process(clk,Load_En)
-- Latch(for Deglitch) Logic
Begin
If(clk='0') then
Load_En_Latched <= Load_En;
End If;
End Process;
clk_En <= clk and Load_En_Latched;
SungKyunKwan Univ.
Data_Add <= a + b;
Data_Mul <= Data_Add * c;
Process(Data_Mul,Clk_En) -- Data Reg Logic
Begin
If(Clk_En='1' and Clk_En'event) then
Do <= Data_Mul;
End If;
End Process;
End Behave;
Configuration CFG_Logic of Logic is
for Behave
End for;
End CFG_Logic;
VADA Lab.
284
Operand Isolation Example - After Code
Library IEEE;
Use IEEE.STD_LOGIC_1164.ALL;
Use IEEE.STD_LOGIC_SIGNED.ALL;
Entity Logic1 is
Port(
a, b, c : in std_logic_vector(7 downto 0);
do : out std_logic_vector(15 downto 0);
rst : in std_logic;
clk : in std_logic
);
End Logic1;
Architecture Behave of Logic1 is
Signal Count : integer;
Signal Load_En : std_logic;
Signal Load_En_Latched : std_logic;
Signal Clk_En : std_logic;
SungKyunKwan Univ.
Signal Data_Add : std_logic_vector(7 downto 0);
Signal Data_Mul : std_logic_vector(15 downto 0);
Signal Iso_Data_Add : std_logic_vector(7 downto 0);
Begin
Process(clk,rst)
-- Counter Logic in FSM
Begin
If(clk='1' and clk'event) then
If(rst='0') then
Count <= 0;
Elsif(Count=9) then
Count <= 0;
Else
Count <= Count + 1;
End If;
End If;
End Process;
VADA Lab.
285
Operand Isolation Example - After Code
Process(Count)
-- Enable Logic in FSM
Begin
If(Count=9) then
Load_En <= '1';
Else
Load_EN <= '0';
End If;
End Process;
Process(Load_En_Latched,Data_Add)
-- Latch
Begin
-- for Operand Isolation
If(Load_En_Latched='1' and
Load_En_Latched'event) then
Iso_Data_Add <= Data_Add;
End If;
End Process;
Data_Mul <= Iso_Data_Add * c;
Process(clk,Load_En)
-- Latch(for Deglitch) Logic
Begin
If(clk='0') then
Load_En_Latched <= Load_En;
End If;
End Process;
Process(Data_Mul,Clk_En) -- Data Reg Logic
Begin
If(Clk_En='1' and Clk_En'event) then
Do <= Data_Mul;
End If;
End Process;
clk_En <= clk and Load_En_Latched;
End Behave;
Data_Add <= a + b;
SungKyunKwan Univ.
VADA Lab.
286
Operand Isolation Example - Report
Before Code
SungKyunKwan Univ.
After Code
VADA Lab.
287
Precomputation
•
Power saving
– Reduces power dissipation of combinational logic
– Reduces internal power to precomputed registers
• Opportunity
– Can be significant, dependent on;
• percentage of time latch precomputation is successful
•
Cost
– Increase area
– Impact circuit timing
– Increase design complexity
• number of bits to precompute
– Testability
• may generate redundant logic
SungKyunKwan Univ.
VADA Lab.
288
Precomputation
Register
Bank
p
/
n
/
n
Register
Bank
p
/
Entire function is
computed.
Data_out
/
n-m
/
Register
Bank
p
/
n-m
/
p
/
EN
m
/
D
Register
Bank
Register
Bank
1
/
Q
m
/
/
p
/
SungKyunKwan Univ.
p
/
Data_out
Smaller function is
defined,
Enable is precomputed.
VADA Lab.
289
Precomputation
• Before Precomputation Diagram
a
b
8
/
8
/
8
/
a>b
8
/
1
/
1
/
Data_out
CLK
SungKyunKwan Univ.
VADA Lab.
290
Precomputation
• After Precomputation Diagram
a(6:0)
7
/
a(6:0)
b(6:0)
7
/
8
/
7
/
b(6:0)
1
/
7
/
a>b
1
/
Data_out
8
/
Latch
1
a(7) /
a(7)
1
/
b(7)
1
/
1
b(7) /
CLK
SungKyunKwan Univ.
VADA Lab.
291
Precomputation
• Before Precomputation - Report
SungKyunKwan Univ.
VADA Lab.
292
Precomputation
• After Precomputation - Report
SungKyunKwan Univ.
VADA Lab.
293
Precomputation Example - Before
Code
Library IEEE;
Use IEEE.STD_LOGIC_1164.ALL;
Entity before_precomputation is
port ( a,b : in std_logic_vector(7 downto
0);
CLK: in std_logic;
D_out: out std_logic);
end before_precomputation;
Architecture
Behav
before_precomputation is
of
signal a_in, b_in : std_logic_vector(7
downto 0);
signal comp : std_logic;
SungKyunKwan Univ.
Begin
process (a,b,CLK)
Begin
if (CLK = '1' and CLK'event)
then
a_in <= a;
b_in<= b;
end if;
if (a_in > b_in) then
comp <= '1';
else
comp <= '0';
end if;
if (CLK'event and CLK='1')
then
D_out <= comp;
end if;
end process;
end Behav;
VADA Lab.
294
Precomputation Example - After
Code
Begin
process(a,b,CLK)
Begin
Library IEEE;
Use IEEE.STD_LOGIC_1164.ALL;
Entity after_precomputation is
port (a, b : in std_logic_vector(7 downto
0);
CLK: in std_logic;
D_out: out std_logic);
end after_precomputation;
if (CLK='1' and CLK'event) then
a_in(7) <= a(7);
b_in(7) <= b(7);
end if;
Architecture
Behav
after_precomputation is
if (CLK='0') then
pcom_D <= pcom;
end if;
of
signal a_in, b_in : std_logic_vector(7
downto 0);
signal pcom, pcom_D : std_logic;
signal CLK_en, comp : std_logic;
SungKyunKwan Univ.
pcom <= a xor b;
CLK_en <= pcom_D and CLK;
VADA Lab.
295
Precomputation - Example After
Code
if (CLK_en='1' and CLK_en'event)
then
a_in(6 downto 0) <= a(6
downto 0);
b_in(6 downto 0) <= b(6
downto 0);
end if;
if (CLK='1' and CLK'event) then
D_out <= comp;
end if;
end process;
end Behav;
if (a_in > b_in) then
comp <= '1';
else
comp <= '0';
end if;
SungKyunKwan Univ.
VADA Lab.
296
Peak Power Reduction
•
•
Peak Power has relation to EMI
Reducing concurrent switching makes
peak power reduction
– Adjust delay  within the speed of
system clock in Bus/Port driver
– Consider the power consumption
of delay element
– Maintaining total power
consumption, we improve EMI in
peak power reduction
• Before Peak Power Reduction
Itotal
n bits wide
E1
• After Peak Power Reduction
t
Itotal
E  Vdd   I totoldt

n bits wide
E2
t
t
(n-1)/ 
SungKyunKwan Univ.
VADA Lab.
297
Factoring Example
Function :
f = ad + bc + cd
The function f is not on the critical path.
The signal a,b,c and d are all the same bit width.
Signal b is a high activity net.
The two factorings below are equivalent from both a timing
and area criteria.
Net Result : network toggling and power is reduced.
f = b(a+c) + cd
f = b(a+c) + cd
a
a
c
b
b
f
f
c
c
b
d
d
SungKyunKwan Univ.
VADA Lab.
298
Block diagram of low-voltage,
high-speed of LSI
• Power Management
Processor controls the
low-Vt circuit using the
sleep signal.
• Extend the sleep
period as much as
possible, because
leakage power is
reduced during this
time
SungKyunKwan Univ.
VADA Lab.
299
Operations of low-V t LSI
Request signal from an
I/O device, output the
results, waits for the next
request signal. During the
waiting
period, the low-Vt circuit
can sleep.
SungKyunKwan Univ.
VADA Lab.
300
Waking/Sleeping operation
Waking operation
SungKyunKwan Univ.
Sleeping operation
VADA Lab.
301
Creating sleep period: Operation during
calculation
•Heavy operations such as
voice CODEC, and light
operations such as
datacollection can be
distributed to both the low-Vt
circuit and the PMP, and the
low Vt circuit can sleep when
the PMP is executing light
operations.
• reduce the power by 10%
SungKyunKwan Univ.
VADA Lab.
302
Low Power Logic Gate
Resynthesis on Mapped
Circuit
김현상 조준동
전기전자컴퓨터공학부
성균관대학교
SungKyunKwan Univ.
VADA Lab.
303
Transition Probability
•
•
•
Transition Probability: Prob. Of a transition at the output of a gate, given a
change at the inputs
Use signal probabilities
Example: F = X’Y + XY’
– Signal Prob. Of F: Pf = Px(1-Py)+(1-Px)Py
– Transistion Prob. Of F = 2Pf(1-Pf)
– Assumption of independence of inputs
•
•
Use BDDs to compute these
References: Najm’91
SungKyunKwan Univ.
VADA Lab.
304
Technology Mapping
•Implementing a Boolean network in terms of gates from a
given library
•Popular technique: Tree-based mapping
•Library gates and circuits decomposed into canonical
patterns
•Pattern matching and dynamic programming to find the best
cover
•NP-complete for general DAG circuits
•Ref: Keutzer’87, Rudell’89
•Idea: High transition probability points are hidden within
gates
SungKyunKwan Univ.
VADA Lab.
305
Low Power Cell Mapping
•
Example of High Switching
Activity Node
•
Internal Mapping in Complex
Gate
A
A
B
B
Y
C
Y
C
Q
D
SungKyunKwan Univ.
D
VADA Lab.
306
Signal Probability vs. Power
p(x) > 0.5
power : P(x)  p(x) (1-p(x))
p(x) < 0.5
0.0
SungKyunKwan Univ.
0.5
signal probability :p(x)
1.0
VADA Lab.
307
Spatial Correlation
P(x) = 0.25
P(x) = 0.25
P(b) = 0.5
a
x
z
y
P(x) = 0.25
x
P(c) = 0.5
P(z) = 0.4375
b
z
P(z) = 0.375
y
P(d) = 0.5
c
SungKyunKwan Univ.
P(y) = 0.25
VADA Lab.
308
Low Power Logic Synthesis
RTL Description
Logic Synthesis
Technology Independent
Optimization
Logic Equation
Timing & Power
Analysis Tools
Technology Mapping
Connection of Gates
Resynthesis on Mapped
Circuit
Gate Level Description
SungKyunKwan Univ.
VADA Lab.
309
Technology Mapping
h
l
h
h
h
l
l
(a)
l
(b)
l
h : high switching activity node
l
l : low switching activity node
(c)
SungKyunKwan Univ.
VADA Lab.
310
Tree Decomposition
f
f
Low Power
(b)
(a)
critical path
primary input
gate(AND)
f
SungKyunKwan Univ.
output
VADA Lab.
311
Huffman Algorithm
23
13
y3
y1
5
x5
8
2
3
x1
x2 x3
SungKyunKwan Univ.
10
4
y2
4
x4
VADA Lab.
312
Depth-Constrained Decomposition
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Algorithm
problem : minimize SUM from i=1 to m p_t (x_i )
input : 입력 시그널 확률(p1, p2,íñíñíñ, pn), 높이(h), 말단 노드의 수(n), 게이트당 fanin
limit(k)
output : k-ary 트리 topology
Begin
sort (signal probability of p1, p2,íñíñíñ, pn);
while (n!=0)
if (h>logkn)
assign k nodes to level L(=h+1);
/*레벨 L(=h+1)에 노드 k개만큼 할당*/
h=h-1, n=n-(k-1);
/*upward*/
else if (h<logkn)
assign k nodes to level L(=h+2);
/*이전 레벨 L(=h+2)에 노드 k개만큼 할당*/
h=h, n=n-(k-1);
/*downward*/
else (h=logkn)
assign the remaining nodes to level L(=h+1);
/*complete; 레벨 L(=h+1)에 나머지 노드를 모두 할당하고
complete k-ary 트리 구성*/
•
•
•
for (bottom level L; L>1; L--)
min_edge_weight_matching (nodes in level L);
End
SungKyunKwan Univ.
VADA Lab.
313
Example
level L=0
h=1
level L=1
h=2
x
x
y
x
y
e
f
0.5
0.6
level L=2
h=3
a
b
a
b
c
d
a
b
c
d
0.1
0.2
0.1
0.2
0.3
0.4
0.1
0.2
0.3
0.4
x
y
e
f
0.5
0.6
level L=3
x
y
a
b
c
d
a
d
b
c
0.1
0.2
0.3
0.4
0.1
0.4
0.2
0.3
before matching
SungKyunKwan Univ.
e
f
0.5
0.6
after matching
VADA Lab.
314
After Decomposition
K 1 =2
16
14
SIS
SIS+OURS
Improvement Ratio
Value, Ratio
12
10
8
6
4
2
0
h=3
6
h=4
10
h=6
h=5
h=7
h=5
20
h=7
h=9
Fanin, Height
SungKyunKwan Univ.
VADA Lab.
315
After Tech. Mapping
K 1 =3, k 2 =3
80
SIS+LEVEL MAP
70
SIS+OURS+LEVEL MAP
Improvement Ratio
Power(mW), Ratio
60
50
40
30
20
10
0
h=2
6
h=3
h=3
10
h=4
h=5
h=3
15
h=4
h=5
h=5
20
h=6
h=7
h=8
Fanin, Height
SungKyunKwan Univ.
VADA Lab.
316
7. Circuit Level Design
SungKyunKwan Univ.
VADA Lab.
317
Buffer Chain
•
•
Delay analysis of buffer chain
(W / L) k 1   (W / L) k
n
Ck   k Cin   k 1C p
n
Td   t d (k )     t0  n   t0
k 1
Pk  Ck  Vdd  f  Vdd  f   i 1  (  Cin  C p )
2
k 1
n
2
PT   Pk  Vdd  f  (  Cin  C p )
C L    Cin
n
2
k 1
ln( C L / Cin )
ln(  )
ln( C L / Cin )
Td    t0 
ln(  )
 (Td )
0

( ) optimum  e  2.72 ,
n
size 1
Delay analysis considering
parasitic capacitance,Cp
Eff 
 n 1
 1
 n 1
 n 1   n  2 
 1
  2 ~ 10 (typical : e)
Ck,Pk: stage k buffer output의 total capacitance, power
PT: buffer chain의 power consumption
Pn: load capacitance CL의 power consumption
(n) optimum  ln( C L / Cin )
size 
size 
i-2
Eff: power efficiency pn/pT
size i-1
size n-1
input
C in
stage 1
stage 1
C in
i-1 C in
stage (i-1)
SungKyunKwan Univ.
iC in
stage i
C in = n C in
stage n
VADA Lab.
318
Slew Rate
•
Determining rise/fall time
I short
I mean
t3
t

2 2
   I short (t )dt   I short (t )dt 
T  t1

t2
t

4 2
   I short (t )dt 
T  t1

t

4 2 
   (Vin  Vt ) 2 dt 
T  t1 2

where,  n   p   , Vtp  Vtn  Vt
PSC  I mean  Vdd 
where, t r  t f  

2
(Vdd  2Vt ) 3 f
Period T
Vin
Vdd +V tp
tr
tf
Vtn
Imax
Imean
t1 t2 t3
SungKyunKwan Univ.
VADA Lab.
319
Slew Rate(Cont’d)
•
Power consumption of Short circuit current in Oscillation Circuit
Vo
Vo
Vdd
Vdd
Vi
Vi
Vdd
Vi
SungKyunKwan Univ.
Vdd
Vo
VADA Lab.
320
Pass Transistor Logic
•
Reducing Area/Power
– Macro cell(Large part in chip
area)
XOR/XNOR/MUX(Primitive) 
Pass Tr. Logic
– Not using charge/discharge scheme
 Appropriate in Low Power Logic
•
CPL
– Basic Scheme
A
B
B
A
B
B
AB
•
Pass Tr logic Family
– CPL (Complementary Pass
Transistor Logic)
– DPL (Dual Pass Transistor Logic)
– SRPL (Swing Restored Pass
Transistor Logic)
AB
– Inverter Buffering
A
B
B
A
B
B
Vdddd
V
AB
AB
p-MOS Latch
SungKyunKwan Univ.
VADA Lab.
321
Pass Transistor Logic(Cont’d)
•
DPL
– Pass Tr Network + Dual p-MOS
– Enables rail-to-rail swing
– Characteristics
• Increasing input
capacitance(delay)
• Increasing driving ability for
existing 2 ON-path
• equals CPL in input loading
capacitance
•
A
A
B
B
B
SRPL
– Pass Tr network + Cross
coupled inverter
– Restoring logic level
– Inverter size must not be too
big
n-MOS CPL
network
B
A
A
AB
SungKyunKwan Univ.
AB
VADA Lab.
322
Dynamic Logic
•
•
•
Using Precharge/Evaluation scheme
Family
– Domino logic
– NORA(NO RAce) logic
Characteristics
– Decreasing input loading
capacitance
– Power consumption in precharge
clock
– Increasing useless switching in
precharging period
precharge
evaluation
•
Basic architecture of Domino logic
P1
A
CL
A
B
N
Logic Block
clk
clk
A
C in
N1
B
SungKyunKwan Univ.
VADA Lab.
323
Input Pin Ordering
•
•
•
Reorder the equivalent inputs to a
transistor based on critical path delays
and power consumption
N- input Primitive CMOS logic
– symmetrical in function level
– antisymmetrical in Tr level
• capacitance of output stage
• body effect
Scheme
– The signal that has many transition
must be far from output
– If it is hard to estimate switching
frequency, we must determine pin
ordering considering path and path
delay balance from primary input
to input of Tr.
SungKyunKwan Univ.
•
Example of N-input CMOS logic
CL
A
B
C1
C
C2
D
C3
Experimentd with gate array of TI
For a 4-input NAND gate in TI’s BiCMOS gate
array library (with a load of 13 inverters), the delay
varies by 20% while power dissipation by 10%
between a good and bad ordering
VADA Lab.
324
INPUT PIN Reordering
VDD
A
B
C
MPA
MPB
1
D
MPC
1
A
MNA
MPD
CL Simulation result
( tcycle=50ns, tf/tr=1ns)
1
1
1
1
B
MNB
CB
: A가 critical input인 경우
=38.4uW,
1
1
1
1
C
MNC
CC
D가 critical input인 경우
=47.2uW
D
MND
CD
1
(a) (b)
1
(c) (d)
SungKyunKwan Univ.
VADA Lab.
325
Sensitization
•
Definition
– sensitization : input signal that
forces output transition event
– sensitization vector : the other
inputs if one signal is
sensitized
Y
 [ f ] X i 0  [ f ] X i 1
X i
 f ( X 1 ,, X i l ,0, X i 1 ,, X n ) 
 f ( X 1 ,, X i l ,1, X i 1 ,, X n )
SungKyunKwan Univ.
•
Example
X1
X2
X3
Y  ( X1  X 2 )  X 3
Y
 [ f ] X1 0  [ f ] X1 1
X 1
 X2X3  X3  X2X3
VADA Lab.
326
Sensitization(Cont’d)
•
Considering Sensitization in
Combinational logic:Remove
unnecessary transitions in the C.L
Q
Considering Sensitization in
Sequential logic: Also reduces the
power consumption in the flipflops.
X1
D
Q
Q
Y
Combinational
Logic
Xn
•
Xn
D
Q
E
E
X1
Q
X1
Y
E
Q
D
Q
Y
Combinational
Logic
Combinational
Logic
Xn
Y
Combinational
Logic
Xn
D
Q
E
clk
SungKyunKwan Univ.
VADA Lab.
327
TTL-Compatible
• TTL level signal  CMOS
input
•
Vdd
IDTTL1
Characteristic Curve of CMOS
Inverter
Vo
V dd = 3.3V
IDTTL2
Vin
TTL
INPUT
Vi
Vo
1.4V
Ileak = avg(I
V dd = 3.3V
,I )
d1 d2
PTTL  NTTL Vdd  ( I DTTL1  I DTTL 2 )
wher e NTTL : number of TTL compatible input pad
V IL = 0.8V
SungKyunKwan Univ.
V IH = 2.0V
Vi
V dd = 3.3V
VADA Lab.
328
TTL Compatible(Cont’d)
•
CMOS output signal  TTL input
Chip Boundary
Chip Boundary
– Because of sink current IOL,
CMOS gets a large amount of
heat
IOL
– Increased chip operating
temperature
– Power consumption of whole
system
SungKyunKwan Univ.
Input Pad
VOL
Output Pad
VADA Lab.
329
INPUT PIN Reordering
◈ To reduce the power dissipation one should place
the
input with low transition density near the ground
end.
(a) If MNA turns off , only CL needs to be charged
(b) If MND turns off , all CL, CB, CC and CD needs to be charged
(c) If the critical input is rising and placed near output node,
the
initial charge of CB, CC and CD are zero and the delay time
of CL
discharging is less than (d)
(d) If the critical input is rising and placed near ground end, the
charge of
CB, CC and CD must dischagge before the charge of CL
discharge
to
SungKyunKwan
Univ.
VADA Lab.
zero
330
저전력 Booth Multiplier 설계
성균관대학교
전기전자컴퓨터공학부
김 진 혁, 이 준 성, 조 준 동
SungKyunKwan Univ.
VADA Lab.
Modified Booth 곱셈기
• Multibit Recoding을 사용하여 부분합의 갯수를 1/2로 줄여 고속의
곱셈을 가능하게 한다.
• 피승수(multiplicand) : X , 승수(multiplier) : Y
Recoded digit = Y2i-1 + Y2i -2Y2i+1 ( Y-1=0 )
Y2i+1
0
0
0
0
1
1
1
1
Y2i
0
0
1
1
0
0
1
1
Y2I-1
0
1
0
1
0
1
0
1
Recoded
Digit
0
+1
+1
+2
-2
-1
-1
0
Operation
on X
0X
+1X
+1X
+2X
-2X
-1X
-1X
0X
Recoded Digit
Operation on X
0
:
Add 0 to the partial product
+1
:
Add X to the partial product
+2
:
Shift left X one position and add it
to the partial product
-1
:
Add two’s complement of X to the
partial product
-2
:
Take two’s complement of X and
shift left one position
< Generation and operation of recoded digit >
SungKyunKwan Univ.
VADA Lab.
Modified Booth 곱셈기 - 예
•
Example
sign
extension
(-107)
10010101 = X
(+105)
01101001 = Y
1111111110010101
00000011010110
000001101011
1100101010
Operation
Bits recoded
+1
-2
-1
+2
010
100
101
011
1101010000011101 = P (-11235)
SungKyunKwan Univ.
VADA Lab.
333
Wallace Tree - 4:2 Compressor
X7
Y7
..............
..............
X0
Y0
: Zero
: Bit jumping level
: partial product
: bit generated by
compressor
1st stage
2nd stage
Two summands to
be added
(a)
Y
1st stage
(block A)
1st stage
(block B)
2nd stage
(block C)
8
4*8 Partial Product generators
4
X3 , X2 , X1 , X0
8 4-2 compressors
4*8 Partial Product generators
4
X7 , X6 , X5 , X4
8 4-2 compressors
11 4-2 compressors
16-bit adder
P15
P0
(b)
SungKyunKwan Univ.
VADA Lab.
334
Multipliers - Area
•
16-bit Multiplier Area
2
Multiplier
type
Area(mm )
Gate count
Array
4.2
2,378
Wallace
8.1
2,544
Modified booth
8.5
3,375
SungKyunKwan Univ.
VADA Lab.
335
Multiplier - Delay
•
Average Power Dissipation (16-bit)
Multiplier
type
Power(mW)
Logic
transitions
Array
43.5
7,224
Wallace
32.0
3,793
Modified booth
41.3
3,993
SungKyunKwan Univ.
VADA Lab.
336
Multiplier - Power
•
Worst-Case Delay (16-bit)
Multiplier
type
Delay(ns)
Gate
delays
Array
92.6
50
Wallace
54.1
35
Modified booth
45.4
32
SungKyunKwan Univ.
VADA Lab.
337
Instruction Level Power Analysis
•
•
Estimate power dissipation of instruction sequences and power dissipation of a
program
Eb : base cost of individual instructions
Es : circuit state change effects
EM  Eb Es
E b  B i  N i
•
Es 
O i j  N i j
,
,
EM : the overall energy cost of a program
Bi : the base cost of type i instruction
Ni : the number of type i instruction
Oi,j : the cost occurred when a type i instruction is followed by
a type j instruction
Ni,j : the number of occurrences when a type i instruction is
immediately followed by a type j instruction
SungKyunKwan Univ.
VADA Lab.
Instruction ordering
•
•
Develop a technique of operand swapping
Recoding weight : necessary operation cost of operands
W
to tal
 W
i
i
W i W
•
•
b
W
in ter
Wtotal : total recoding weight of input operand
Wi : weight of individual recoded digit i in Booth Multiplier
Wb : base weight of an instruction
Winter : inter-operation weight of instructions
Therefore, if an operand has lower Wtotal , put it in the second
input(multiplier).
SungKyunKwan Univ.
VADA Lab.
RESULT
Instruction
Base
Name
cost
Circuit State Effects[pJ]
when switched
LOAD
1.46
ADD
0.86
2’s
0.77
0.18
cost
SHIFT
1.20
1.08
0.73
LOAD
3.25
0.31
0.49
0.61
ADD
1.91
0.27
0.34
2’s
1.72
LOAD
ADD
2’s
complement
SHIFT
0.40
2.67
2.38
1.63
0.58
1.11
1.44
0.55
0.78
[pJ]
complement
SHIFT
Base
Name
2’s
complement
ADD
[pJ]
LOAD
Instruction
Circuit State Effects[pJ]
when switched
complement
0.29
0.15
SHIFT
< 4 b y 4 m ultip lier >
0.65
0.38
< 8 b y 8 m ultip lier >
Circuit State Effects[pJ]
when switched
Instruction
Name
Base
cost
[pJ]
LOAD
ADD
2’s
complement
SHIFT
LOAD
4.81
0.59
3.96
3.57
2.40
ADD
2.83
1.02
1.63
2.12
2’s
complement
2.55
1.00
1.14
SHIFT
0.96
0.78
< 12 b y 12 m ultip lier >
SungKyunKwan Univ.
VADA Lab.
Conclusion
% of instances with
circuit states effects
9.0%
reduction
Power[pJ]
35
12
30
10
12bit
8bit
0
4bit
5
bits
SungKyunKwan Univ.
4
2
0
average
10
6
12bit
4.0%
reduction
8
8bit
20
15
circuit
states
effects not
considered
circuit
states
effects
considered
12.0%
reduction
4bit
25
bits
VADA Lab.
8. Layout Level Design
SungKyunKwan Univ.
VADA Lab.
342
Device Scaling of Factor of S
•
•
•
•
•
•
•
•
•
Constant scaled wire increases coupling capacitance by S and wire resistance
by S
Supply Voltage by 1/S, Theshold Voltage by 1/S, Current Drive by 1/S
Gate Capaitance by 1/S, Gate Delay by 1/S
Global Interconnection Delay, RC load+para by S
Interconnect Delay: 50-70% of Clock Cycle
Area: 1/S2
Power dissipation by 1/S - 1/S2
( P = nCVdd2f, where nC is the sum of capacitance times #transitions)
SIA (Semiconductor Industry Association): On 2007, physical limitation: 0.1 m
20 billion transistors, 10 sqare centimeters
SungKyunKwan Univ.
, 12 or 16 inch wafer
VADA Lab.
343
Delay Variations at Low-Voltage
• At high supply voltage, the delay increases with temperature
(mobility is decreasing with temperature) while at very low
supply voltages the delay decreases with temperature (VT is
decreasing with temperature).
• At low supply voltages, the delay ratio between large and
minimum transistor widths W increases in several factors.
• Delay balancing of clock trees based on wire snaking in order
to avoid clock-skew. In this case, at low supply voltages, slightly
VT variations can significantly modify the delay balancing.
SungKyunKwan Univ.
VADA Lab.
344
Quarter Micron Challenge
•
•
•
•
•
•
•
•
•
•
•
•
•
Computers/peripherals (SOC): 1996 ($50 Billion) 1999 ($70 Billion)
Wiring dominates delay: wire R comparable to gate driver R; wire/wire coupling
C > C to ground
Push beyond 0.07 micron
Quest for area(past), speed-speed (now), power-power-power(future)
Accelerated increases of clock frequencies
Signal integrity-based tools
Design styles (chip + packages)
System-level design(system partitioning)
Synthesis with multiple constraints (power,area,timing)
Partitioning/MCM
Increasing speed limits complicate clock and power distribution
Design bounded by wires, vias, via resistance, coupling
Reverse scaling: adding area/spacing as needed: widening, thickening of wires,
metal shielding & noise avoidance - adding metal
SungKyunKwan Univ.
VADA Lab.
345
CLOCK POWER CONSUMPTION
•Clock power consumption is as
large as the logic power; Clock
Signal carrying the heaviest load
and switching at high frequency,
clock distribution is a major
source of power dissipation.
• In a microprocessor, 18% of
the total power is consumed by
clocking
• Clock distribution is designed
as a hierarchical clock tree,
according to the decomposition
principle.
SungKyunKwan Univ.
VADA Lab.
346
Power Consumption per block in
typical microprocessor
SungKyunKwan Univ.
VADA Lab.
347
Crosstalk
SungKyunKwan Univ.
VADA Lab.
348
Solution for Clock Skew
•
•
•
•
•
•
•
•
•
•
•
Dynamic Effects on Skew
Capacitance Coupling
Supply Voltage Deviation (Clock
driver and receiver voltage
difference)
Capacitance deviation by circuit
operation
Global and local temperature
Layout Issues: clocks routed first
Must aware of all sources of delay
Increased spacing
Wider wires
Insert buffers
Specialized clock need net
matching
Two approaches: Single Driver, Htree driver
SungKyunKwan Univ.
•
•
•
•
Gated Clocks: The local clocks that
are conditionally enabled so that the
registers are only clocked during the
write cycles. The clock is partitioned
in different blocks and each block is
clocked with its own clock.
Gating the clocks to infrequently
used blocks does not provide and
acceptable level of power savings
Divide the basic clock frequency to
provide the lowest clock frequency
needed to different parts of the
circuit
Clock Distribution: large clock buffer
waste power. Use smaller clock
buffers with a well-balanced clock
tree.
VADA Lab.
349
PowerPC Clocking Scheme
SungKyunKwan Univ.
VADA Lab.
350
CLOCK DRIVERS IN THE DEC ALPHA
21164
SungKyunKwan Univ.
VADA Lab.
351
DRIVER for PADS or LARGE CAPACITANCES
Off-chip power (drivers and pads) are increasing and is very difficult
to reduce such a power, as the pads or drivers sizes cannot be
decreased with the new technologies.
SungKyunKwan Univ.
VADA Lab.
352
Layout-Driven Resynthesis for Lower Power
SungKyunKwan Univ.
VADA Lab.
353
Low Power Process
• Dynamic Power Dissipation
Vdd
C djp
Pd    C L  Vdd  f
2
I ds 

2
(Vgs  Vt )
2
Vin
C ovp
Vo
C ovn
C djn
n
C gate  Cox  (W  L)
i 1
m
Cin   (C gate ) j
D
j 1
Cov  CGD 0  W
Cdj  C j  AD  C jsw  PD
AD  W  D, PD  2(W  D)
SungKyunKwan Univ.
Drain
W
C jb
C jsw
VADA Lab.
354
Crosstalk
•
•
•
In deep-submicron layouts, some of the netlengths for connection between
modules can be so long that they have a resistance which is comparable to the
resistance of the driver.
Each net in the mixed analog/digital circuits is identified depending upon its
crosstalk sensitivity
– 1. Noisy = high impedance signal that can disturb other signals, e.g., clock
signals.
– 2. High-Sensitivity = high impedance analog nets; the most noise sensitive
nets such as the input nets to operational amplifiers.
– 3. Mid-Sensitivity = low/medium impedance analog nets.
– 4. Low-Sensitivity = digital nets that directly affect the analog part in some
cells such as control
signals.
– 5. Non-Sensitivity = The most noise insensitive nets such as pure digital
nets,
The crosstalk between two interconnection wires also depends on the
frequencies (i.e., signal activities) of the signals traveling on the wires. Recently,
deep-submicron designs require crosstalk-free channel routing.
355
SungKyunKwan Univ.
VADA Lab.
Power Measure in Layout
•
•
•
•
•
The average dynamic power consumed by a CMOS gate is given below, where
C_l is the load capacity at the output of the node, V_dd is the supply voltage,
T_cycle is the global clock period, N is the number of transitions of the gate
output per clock cycle, C_g is the load capacity due to input capacitance of
fanout gates, and C_w is the load capacity due to the interconnection tree
formed between the driver and its fanout gates.
Pav = (0.5 Vdd2) / (Tcycle Cl N) = (0.5 Vdd2) / (Tcycle (Cg + Cw )N)
Logic synthesis for low power attempts to minimize SUMi Cgi Ni
Physical design for low power tries to minimize
SUMi Cwi Ni
. Here Cwi consists of Cxi + CsI, where Cxi is the capacitance of net i due to its
crosstalk, and CsI is the substrate capacitance of net i. For low power layout
applications, power dissipation due to crosstalk is minimized by ensuring that
wires carrying high activity signals are placed sufficiently far from the other wires.
Similarly, power dissipation due to substrate capacitance is proportional to the
wirelength and its signal activity.
SungKyunKwan Univ.
VADA Lab.
356
이중 전압을 이용한 저전력
레이아웃 설계
성균관대학교
전기전자컴퓨터공학부
김 진 혁, 이 준 성, 조 준 동
SungKyunKwan Univ.
VADA Lab.
목
•
•
•
•
•
•
•
•
•
차
연구목적
연구배경
Clustered Voltage Scaling 구조
Row by Row Power Supply 구조
Mix-And-Match Power Supply 구조
Level Converter 구조
Mix-And-Match Power Supply 설계흐름
실험결과
결론
SungKyunKwan Univ.
VADA Lab.
연 구 목 적 및 배경
•
조합회로의 전력 소모량을 줄이는
이중 전압 레이아웃 기법 제안
•
이중 전압 셀을 사용할 때, 한 cell
row에 같은 전압의 cell이 배치되면
서 증가하는 wiring 과 track 의 수를
줄임
•
최소 트랜지스터 개수를 사용하는
Level Converter 회로의 구현
SungKyunKwan Univ.
•
디바이스의 성능을 유지하면서
이중 전압을 사용하는 Clustered
Voltage Scaling [Usami, ’95]을 적
용
•
제안된 Mix-And-Match Power
Supply 레이 아웃 구조는 기존의
Row by Row Power Supply
[Usami, ’97] 레이 아웃 구조를
개선하여 전력과 면적을 줄임
VADA Lab.
359
Clustered Voltage Scaling
• 저전력 netlist 를 생성
G5
F/F
S 5>0
G4
Slack(S i) = R i - A i
G3
G6
G2
S 6>0
S 4>0
G8
S 2<0
S 3>0
LC1
S 8<0
G1
S 1>0
F/F
G7
S 7<0
S 9>0
: VDDL
S 11<0
F/F
: VDDH
LC2
G11
G10
SungKyunKwan Univ.
S 10<0
G9
: Level Converter
VADA Lab.
Row by Row Power Supply 구조
standard
cell
VDDL
VDDH
VDDL
cell
VDDL
VDDH
standard cell
standard cell
VDDL
VDDH cell
module
VSS
VDDL cell
SungKyunKwan Univ.
VDDH
VSS
VDDH cell
VADA Lab.
Mix-And-Match Power Supply 구조
standard cell
VDDL
VDDH
cell
VDDH
VDDL VDDL
cell
VDDH
standard cell
standard cell
module
VDDH cell
SungKyunKwan Univ.
VDDL
cell
VDDH
cell
VDDL
VDDL
VDDH
VDDH
VSS
VSS
VDDL cell
VADA Lab.
구조비교
Conventional
Circuit
RRPS
MAMPS
VDDL
VDDH
VDDH
VDDL
VDDH
module
SungKyunKwan Univ.
module
module
VADA Lab.
363
Level Converter 구조
• Transistor의 갯수 : 6개
4개
• 전력과 면적면에서 효과적
VDDH
VDDH
VDDH
OUT
VDDL
VSS/VDDL
VSS/VDDH
IN
Vth=1.5V
기 존
SungKyunKwan Univ.
Vth=2.0V
제 안
VADA Lab.
Mix-And-Match Power Supply
Design Flow
Single voltage netlist
Multiple voltage scaling
Netlist with multiple supply voltage
(OPUS)
Assign supply voltage to each cell
Physical placement
(Aquarius XO)
Routing
Synthesis timing, power and area
SungKyunKwan Univ.
(PowerMill)
VADA Lab.
실험결과
전체 Area
전체 Power
Area
(%)
power
(%)
100
47%
10%
15%
100
2%
Conventional
circuit
RRPS
MAMPS
SungKyunKwan Univ.
Conventional
circuit
RRPS
MAMPS
VADA Lab.
결
론
• 단일 전압 회로와 비교하여 49.4%의 Power 감소를
Area overhead가 발생
얻은 반면 5.6%의
• 기존의 RRPS 구조보다 10%의 Area 감소와 2%의 Power 감소
• 제안된 Level Converter는 기존의 Level Converter보다 30%의 Area 감소와
35%의 Power 감소
SungKyunKwan Univ.
VADA Lab.
9. CAD tools
SungKyunKwan Univ.
VADA Lab.
368
Low Power Design Tools
•
Transistor Level Tools (5-10% of silicon)
– SPICE, PowerMill(Epic), ADM(Avanti/Anagram), Lsim Power Analyst(mentor)
•
Logic Level Tools (10-15%)
– Design Power and PowerGate (Synopsys), WattWatcher/Gate (Sente), PowerSim
(System Sciences), POET (Viewlogic), and QuickPower (Mentor)
•
Architectural (RTL) Level Tools (20-25%)
– WattWatcher/Architect (Sente): 20-25% accuracy
•
Behavioral (spreadsheet) Level Tools (50-100%)
– Active area of academic research
SungKyunKwan Univ.
VADA Lab.
369
Commercial synthesis systems
SungKyunKwan Univ.
VADA Lab.
370
Research synthesis systems
AArchitectural
synthesis.
L - Logic
synthesis.
SungKyunKwan Univ.
VADA Lab.
371
Low-Power CAD sites
•
•
•
•
•
•
Alternative System Concepts, Inc, : 7X power reduction throigh optimization,
contact http://www.ee.princeton.edu and Jake Karrfalt at [email protected] or
(603) 437-2234. Reduction of glitch and clock power; modeling and
optimization of interconnect power; power optimization for data-dominated
designs with limited control flow.
Mentor Graphics QuickPower: Hierarchical of determining overall benet of
exchanging the blocks for lower power. powering down or disabling blocks when
not in use by gated-clock
choose candidates for power-down Calculate the effect of the power-down logic
http://www.mentorg.com
Synopsys's Power Compiler http://www.synopsys.com/products/power/power_ds
Sente's WattWatcher/Architect (first commerical tool operating at the
architecture level(20-25 %accuracy). http://www.powereda.com
Behavioral Tool: Hyper-LP (Optimization), Explore (Estimation) by J. Rabaey
SungKyunKwan Univ.
VADA Lab.
372
Design Power(Synopsys)
•
•
•
DesignPower(TM) provides a single, integrated environment for power
analysis in multiple phases of the design process:
–
Early, quick feedback at the HDL or gate level through probabilistic
analysis.
–
Improved accuracy through simulation-based analysis for gate level
and library exploration.
DesignPower estimates switching, internal cell and leakage power. It accepts
user-defined probabilities, simulation toggle data or a combination of both as
input. DesignPower propagates switching information through sequential
devices, including flip-flops and latches.
It supports sequential, hierarchical, gated-clock, and multiple-clock designs.
For simulation toggle data, it links directly to Verilog and VHDL simulators,
including Synopsys' VSS.
SungKyunKwan Univ.
VADA Lab.
373
10. References
SungKyunKwan Univ.
VADA Lab.
374
References
[1] Gary K. Yeap, "Practical Low Power Digital VLSI Design",
Kluwer Academic Publishers.
[2] Jan M. Rabaey, Massoud Pedram, "Low Power Design Methodologies",
Kluwer Academic Publishers.
[3] Abdellatif Bellaouar, Mohamed I. Elmasry, "Low-Power Digital VLSI Design
Circuits And Systems", Kluwer Academic Publishers.
[4] Anantha P. Chandrakasan, Robert W. Brodersen, "Low Power Digital CMOS
Design", Kluwer Academic Publishers.
[5] Dr. Ralph Cavin, Dr. Wentai Liu, "1996 Emerging Technologies : Designing
Low Power Digital Systems"
[6] Muhammad S. Elrabaa, Issam S. Abu-Khater, Mohamed I. Elmasry,
"Advanced Low-Power Digital Circuit Techniques",
Kluwer Academic Publishers.
SungKyunKwan Univ.
VADA Lab.
375
References
•
•
•
•
•
[BFKea94] R. Bechade, R. Flaker, B. Kaumann, and et. al. A 32b 66 mhz 1.8W
Microprocessor". In IEEE Int. Solid-State Circuit Conference, pages 208-209,
1994.
[BM95] Bohr and T. Mark. Interconnect Scaling - The real limiter to high
performance ULSI". In proceedings of 1995 IEEE international electron devices
meeting, pages 241-242, 1995.
[BSM94] L. Benini, P. Siegel, and G. De Micheli. Saving Power by Synthesizing
Gated Clocks for Sequential Circuits". IEEE Design and Test of Computers,
11(4):32-41, 1994.
[GH95] S. Ganguly and S. Hojat. Clock Distribution Design and Verification for
PowerPC Microprocessor". In International Conference on Computer-Aided
Design, page Issues in Clock Designs, 1995.
[MGR96] R. Mehra, L. M. Guerra, and J. Rabaey. Low Power Architecture
Synthesis and the Impact of Exploiting Locality". In Journal of VLSI Signal
Processing,, 1996.
SungKyunKwan Univ.
VADA Lab.
376