lp1 - VADA Lab. Introduction

Download Report

Transcript lp1 - VADA Lab. Introduction

Lower Power Design Guide
1998. 6.7
성균관대학교 조 준 동 교수
http://vlsicad.skku.ac.kr
Contents
1. Intoduction
Trends for High-Level Lower Power Design
2. Power Management
Clock/Cache/Memory Management
3. Architecture Level Design
Architecture Trade offs, Transformation
4. RTL Level Design
Retiming, Loop-Unrolling, Clock Selection, Scheduling, Resource Sharing,
Register Allocation
5. partitioning
6. Logic Level Design
7. Circuit Level Design
8. Quarter Sub Micron Layout Design
Lower Power Clock Designs
9. CAD tools
10. References
1. Introduction
Motivation
•
•
•
•
Portable Mobile (=ubiquitous
=nomadic)
Systems with limited for heat
sinks
Lowering power with fixed
performance: DSPs in modems
and cellular phones
Reliability: Increasing power !
increasing electromigration, 40year reliability guarantee
(product life cycle of
telecommunication industries)
•
•
•
•
Adding fans to reduce power
cause reliability to plummet.
Higher power leads to higher
packaging costs: 2-watt
package can be four times
greater than a 1-watt package
Myriad Constraints: timing,
power, testability, area,
packaging, time-to-market.
Ad-Hoc Design: Lack a
systematic process leading to
universal applicability.
Power!Power!Power!
Power Dissipation in VLSI’s
I/O
I/O
memory
clock
MPU1
clock
memory
MPU1
I/O
ASSP1
memory
logic
MPU1: low-end microprocessor for embedded use
MPU2: high-end CPU with large amount of cache
ASSP1: MPEG2 decoder
ASSP2: ATM switch
clock
clock
logic
ASSP2
I/O
logic
memory
Current Design Issues in Lower Power Problem
Energy-hungry Function by Network
Server:
•
Infopad (univ. of California,
Berkeley), weight < 1 pound,
•
0.5W (reflective color display) +
0.5W (computation,communication,
I/O support) = 1W (Alpha chip: 25W
StrongARM: 215 MHz at 2.0V:0.3W)
•
runtime 50 hours, target:
100MIPS/mW.
•
Deep-sub micron (0.35 - 0.18) with
low voltage for portable full motion
video terminal; 0:5m : 40 AA NiMH;
1m : 1 AA NiMH
•
•
•
•
System-On-A-Chip to reduce
external Interconnection
Capacitances
Power Management: shut down
idle units
Power Optimization Techniques
in Software,
Architecture,Logic/Circuit,
Layout Phases to reduce
operations, frequency,
capacitance, switching activity
with maintaining the same
throughput.
Battery Trends
Road-Map in Semiconductor
Device Integration
Road-Map in Semiconductor Device Complexity
Power Component
•
•
Static: Leakage current(<< 1%)
Dynamic:
–
–
Short Circuit power(10-30%): Short
circuit ow during transitions,
Switching (or capacitive) power(7090%): Charging/discharging of
capacitive loads during transitions
Vdd vs Delay
•use architecture optimization to compensate for slower
operation, e.g., Parallel Processing and Pipelining for concurrent
increasing and critical path reducing.
•Scale down device sizes to compensate for delay (Interconnects
do not scale proportionately and can become dominant)
Good Design Methodologies
Synthesis and Optimization
Pareto point
2. Power Management
Power Consumption in Multimedia Systems
•
•
LCD: 54.1%, HDD 16.8%, CPU
10.7%, VGA/VRAM 9.6%, SysLogic
4.5%, DRAM 1.1%, Others: 3.2%
5-55 Mode:
–
–
•
Display mode: CPU is in sleepmode (55 minutes), LCD (VRAM +
LCDC)
CPU mode: Display is idle ( 5
minutes), Looking up - data retrival
Handwrite recognition - biggest
power (memory, system bus active)
Power Management
•
DPM
(Dynamic Power Management):
stops the clock switching of a
specific unit generated by
clock generators. The clock
regenerators produce two
clocks, C1 and C2 . The logic:
0.3%, 10-20% of power savings.
•
•
•
SPM
(Static Power Management):
saving of the power dissipation
in the steady mode. When the
system (or subsystem)
remains idle for a significant
period time, then the entire
chip
(or subsystem) is shut-down.
Identify power hungry modules
and look for opportunities to
reduce power
If f is increased, one has to
increase the transistor size or
Vdd.
•
•
•
•
Power
Management([email protected])
use right supply and right frequency to each part of the system If
one has to wait on the occurence of some input, only a small
circuit could wait and wake-up the main circuit when the input
occurs.
Another technique is to reduce the basic frequency for tasks that can
be executed slowly.
PowerPC 603 is a 2-issue (2 instructions read at a time) with 5 parallel
execution units. 4 modes:
– Full on mode for full speed
– Doze mode in which the execution units are not running
– Nap mode which also stops the bus clocking and the Sleep mode which
stops the clock generator
– Sleep mode which stops the clock generator with or without the PLL
(20-100mW).
•
Superpipelined MIPS R4200 : 5-stage pipleline, MIPS R4400: 8 stage,
2 execution units, f/2 in reduce mode.
TI
•
•
•
•
•
•
•
Two DSPs: TMS320C541, TMS320C542 reduce power and chip count and
system cost for wireless communication applications
C54X DSPs, 2.7V, 5V, Low-Power Enhanced Architecture DSP (LEAD) family:
Three different power down modes, these devices are well-suited for wireless
communications products such as digital cellular phones, personal digital
assistants, and wireless modem,low power on voice coding and decoding
The TMS320LC548 features:
– 15-ns (66 MIPS) or 20-ns (50 MIPS) instruction cycle times
– 3.0- and 3.3-V operation
32K 16-bit words of RAM and 2K 16-bit words of boot ROM on-chip
Integrated Viterbi accelerator that reduces Viterbi butterfly update in four
instruction cycles for GSM channel decoding
Powerful single-cycle instructions (dual operand, parallel instructions, conditional
instructions)
Low-power standby modes
Power Estimation Techniques
•
•
•
•
Circuit Simulation (SPICE): a set of input vectors, accurate, memory
and time constraints
Monte Carlo: randomly generated input patterns, normal distributed
power per time interval T using a simulator switch level simulation
(IRSIM): defined as no. of rising and falling transitions over total
number of inputs
Powermill (transistor level): steady-state transitions, hazards and
glitches, transient short circuit current and leakage current; measures
current density and voltage drop in the power net and identifies
reliability problem caused by EM failures, ground bounce and
excessive voltage drops.
DesignPower (Synopsys): simulation-based analysis is within 8-15%
of SPICE in terms of percentage difference (Probability-based analysis
is within 15-20% of SPICE).
Cache/Memory Management
•
•
•
•
•
•
•
Clock and memory consumes between 15% to 45% of the total power in digital
computers
As block size increases, the energy required to service miss increases due to
increased memory access external-memory access (530 mA) vs. on-chip
access(300mA): Replacing excessive accesses to background memory by
foreground memory
Cache vertical partitioning (buffering): multi-level variable-size caches
Caches are powerdown when idle.
Cache horizontal partitioning (subarray access): several segments can be
powered individually. Only the cache sub-bank where the requested data is
located consumes power in each cache access.
Using distributed memory instead of a single centralized memory
Locality of reference to eliminate expensive data transfer across high
capacitance busses
Cache misses consume more energy (directed-mapping or k-associated
mapping?), page faults consume more energy
Power Management
•
Block Power Management (Sleep,
standby mode) Scheme by
Enabling Clock
•
Clock Power Management Scheme
by adding Clock Generation block
enable 1
block 1
block 1
clock management
enable 1
enable 2
block 1
clk
block 1
clk
enable 2
enable 3
enable 3
block 1
block 1
3. Architectural Level Design
Architectural-level Synthesis
• Translate HDL models into sequencing graphs.
• Behavioral-level optimization:
– Optimize abstract models independently from the
implementation parameters.
•
Architectural synthesis and optimization:
– Create macroscopic structure:
• data-path and control-unit.
– Consider area and delay information
• Hardware compilation:
– Compile HDL model into sequencing graph.
– Optimize sequencing graph.
– Generate gate-level interconnection for a cell library. of the
implementation.
Power Measure of P
System-Level Solutions
•
•
•
•
•
Spatial locality: an algorithm can be partitioned into natural clusters
based on connectivity
Temporal locality: average lifetimes of variables (less temporal
storage, probability of future accesses referenced in the recent past).
Precompute physical capacitance of Interconnect and switching
activity (number of bus accesses)
Architecture-Driven Voltage Scaling: Choose more parallel
architecture
Supply Voltage Scaling : Lowering V dd reduces energy, but
increase delays
Software Power Issues
•
•
•
•
•
•
•
•
•
Upto 40% of the on-chip power is dissipated on the buses !
System Software : OS, BIOS, Compilers
Software can affect energy consumption at various levels InterInstruction Effects
Energy cost of instruction varies depending on previous instruction
For example, XORBX 1; ADDAX DX;
Iest = (319:2+313:6)=2 = 316:4mA Iobs =323:2mA
The difference defined as circuit state overhead
Need to specify overhead as a function of pairs of instructions
Due to pipeline stalls, cache misses
Instruction reordering to improve cache hit ratio
Avoiding Wastful Computation
•
•
•
•
•
•
Preservation of data correlation
Distributed computing / locality of reference
Application-specific processing
Demand-driven operation
Bus-Inverted Coding
Transformation for memory size reduction
–
–
Consider arrays A and C are already available in memory
When A is consumed another array B is generated; when C is consumed a
scalar value D is produced.
– Memory Size can be reduced by executing the j loop before the i loop so
that C is consumed before B is generated and the same memory space
can be used for both arrays.
Avoiding Wastful Computation
Architecture Lower Power Design
• Optimum Supply Voltage Architecture through Hardware
Duplication (Trading Area for Lower Power) and/or
Pipelining
– complex and fewer instruction requires less encoding, but larger
decode logic!
• Use small complex instruction with smaller instruction
length (e.g., Hitachi SH: 16-bit fixed-length, arithmetic
instruction uses only two operands, NEC V800: variable-length
instruction decoding overhead )
• Superscalar: CPI < 1: parallel instruction execution. VLIW
architecture.
Variable Supply Voltage Block Diagram
•
•
•
Computational work varies with
time. An approach to reduce
the energy consumption of
such systems beyond shut
down involves the dynamic
adjustment of supply voltage
based on computational
workload.
The basic idea is to lower
power supply when the a
fixed supply for some
fraction of time.
The supply voltage and
clock rate are increased
during high workload period.
Power Reduction using Variable Supply
•Circuits with a fixed supply voltage
work at a fixed speed and idle if the
data sample requires less than the
maximum amount of computation.
Power is reduced in a linear fashion
since the energy per operation is
fixed.
• If the work load for a given
sample period is less than peak,
then the delay of the processing
element can be increased by a
factor of 1/workload without loss
in throughput, allowing the
processor to operate at a lower
supply voltage. Thus, energy per
operation varies.
Data Driven Signal Processing
The basic idea of averaging two
samples are buffered and their work
loads are averaged.
The averaged workload is then
used as the effective workload to
drive the power supply.
Using a pingpong buffering
scheme, data samples In +2, In +3
are being buffered while In, In +1
are being processed.
Architecture of Microcoded Instruction Set
Processor
Power and Area
1.5V and 10MHz clock rate: instruction and data memory accesses
account for 47% of the total power consumption.
Datapath Parallelization
Memory Parallelization
At first order P= C * f/2 * Vdd2
Pipelined Micro-P
Architecture Trade-Off
Ppipeline =
(1.15C)( 0.58V)2 (f)
= 0.39P
NON-PIPLELINED Implementation
Pparallel =
(2.15C)(0.58V)2 (0.5f)
= 0.36P
PIPLELINED Implementation
Through WAVE PIPELINING
Different Classes of RISC Micro-P
Application Specific Coprocessor
•
•
•
•
•
•
DSP's are increasingly called upon to perform tasks for which they are
not ideally suited, for example, Viterbi decoding.
They may also take considerably more energy than a custom solution.
Use the DSP for portions of algorithms for which it is well suited, and
craft an application-specic coprocessor (i.e., custom hardware) for
other tasks.
This is an example of the difference between power and energy
The application-specific coprocessor may actually consume a more
power than the DSP, but it may be able to accomplish the same task
in far less time, resulting in a net energy savings.
Power consumption varies dramatically with the instruction being
executed.
Clock per Instruction (CPI)
SUPERPIPELINE micro-P
VLIW Architecture
Compiler takes the responsibility for finding the
operations that can be issued in parallel and
creating a single very long instruction
containing these operations. VLIW instruction
decoding is easier than superscalar instruction
due to the fixed format and to no instruction
dependency.
The fixed format could present more limitations
to the combination of operations.
Intel P6: CISC instructions are combined on
chip to provide a set of micro-operations
(i.e., long instruction word) that can be
executed in parallel.
As power becomes a major issue in the
design of fast -Pro, the simple is the better
architecture.
VLIW architecture, as they are simpler than
N-issue machines, could be considered as
promising architectures to achieve
simultaneously
high-speed and low-power.
Synchronous VS. Asynchronous
SYSTEMS
•
•
Synchronous system: A signal path starts from a clocked flip- flop
through combinational gates and ends at another clocked flip- flop. The
clock signals do not participate in computation but are required for
synchronizing purposes. With advancement in technology, the systems
tend to get bigger and bigger, and as a result the delay on the clock
wires can no longer be ignored. The problem of clock skew is thus
becoming a bottleneck for many system designers. Many gates switch
unnecessarily just because they are connected to the clock, and not
because they have to process new inputs. The biggest gate is the clock
driver itself which must switch.
Asynchronous system (self-timed): an input signal (request) starts the
computation on a module and an output signal (acknowledge) signifies
the completion of the computation and the availability of the requested
data. Asynchronous systems are potentially response to transitions on
any of their inputs at anytime, since they have no clock with which to
sample their inputs.
Asynchronous SYSTEMS
• More difficult to implement, requiring explicit synchronization
between communication blocks without clocks
• If the signal feeds directly to conventional gate-level circuitry,
invalid logic levels could propagate throughout the system.
• Glitches, which are filtered out by the clock in synchronous
designs, may cause an asynchronous design to malfunction.
• Asynchronous designs are not widely used, designers can't find
the supporting design tools and methodologies they need.
• DCC Error Corrector of Compact cassette player saves power
of 80% as compared to the synchronous counterpart.
• Offers more architectural options/freedom encourages
distributed, localized control offers more freedom to adapt the
supply voltage
Asynchronous Modules
Example: ABCS protocol
6% more logics
Control Synthesis Flow
PIPELINED SELF-TIMED micro P
Programming Style
Speed vs. Power Optimization
VON NEUMANN VERSUS HARVARD
Low Vdd Main Memories
CACHE MEMORIES
Low Power Memory
•
•
•
Hierarchical Word Line: Divide the memory in different blocks and access the bit cells
of the desired block
Selective precharge: Many bit lines are discharged even when these locations are not
accessed. Only bit lines which will be accesses are precharged.
Minimization of Non-zero Terms in the ROM table: Zero terms do not switch bit
lines and reduce capacitance in both bit lines and row lines.
–
–
–
–
•
Inverted ROM: If the number of ones is very high, the whole ROM core can be inverted.
Inverted Row: A given row is inverted if more than half of the bits are non-zero terms. An
extra bit is required to perfoem encoding.
Sign magnitude representation: ROM is used to store the coefficients of a digital filter. As a
result, a significant amount of the non-zero terms are due to the sign extension of the negative
coefficients. The main drawback of this type is that a conversion to two’s complement is
required at the end of a cycle, which slows down the ROM.
Sign magnitude and inverted block:
Difference Encoding: reduce the size of the ROM core. If the value between adjacent
data do not change significantly, the ROM core stores the difference between the data.
Low Power Memory
• Smaller ROMS: in 102 tap filter, more than 70% of the coefficients are
below 18 bits. Still the largest coefficients are below 18 bits. Still the
largest coefficient goes up 24 bits. A better implementation can be
achieved if the large coefficients are stored in a wide ROM with fewer
address; the small coefficients are stored in narrow ROM with many
addresses. A similar approach is applied for locations in ROM which are
often accessed. Loations that are accesses frequently are stored in a small,
fast ROM, while the other locations are stored in a larger ROM.
• NMOS precharge: bit lines are precharged to Vdd - Vt. A drawback of
this technique is degradation of noise margins and the body bias effect.
• Buffer Sizing: a large set of buffers is required in the control logic to drive
the address lines through the decoder, generate the contol signals for the
column multiplexers, drive the row lines and drive the precharge signals.
• Voltage scaling:
CV
2C V
Tdelay 
L dd
I

L dd
Cox (W / L)(Vdd  Vt ) 2
Memory Architecture
Exploiting Locality for Low-Power Design
•A spatially local cluster: group of algorithm operations that are tightly
connected to each other in the flow graph representation.
• Two nodes are tightly connected to each other on the flow graph representation
if the shortest distance between them, in terms of number of edges traversed, is low.
•Power consumption (mW) in
the maximally time-shared
and fully-parallel versions of
the QMF sub-band coder filter
• Improvement of a factor of
10.5 at the expense of a 20%
increase in area
• The interconnect elements
(buses, multiplexers, and
buffers) consumes 43% and
28% of the total power in
the time-shared and parallel
versions.
Cascade filter layouts
(a)Non-local implementation from Hyper (b)Local implementation from
Hyper-LP
Stage-Skip Pipeline
•The power savings
is achieved by
stopping the
instruction fetch and
decode stages of the
processor during
the loop execution
except its first iteration.
•DIB = Decoded
Instruction Buffer
• 40 % power savings
using DSP or RISC
processor.
Stage-Skip Pipeline
•Selector: selects the
output from either the
instruction decoder or DIB
• The decoded instruction
signals for a loop are
temporarily stored in the
DIB and are reused in each
iteration of the loop.
•The power wasted in the
conventional pipeline is
saved in our pipeline by
stopping the instruction
fetching and decoding for
each loop execution.
Stage-Skip Pipeline
Majority of
execution
cycles in signal
processing
programs are
used for loop
execution :
40% reduction
in power with
area increase
2%.
Parallel LIFO Scenario
Parallel-serial Converter
D- flip- flop Parallelization
State Machine
Frequency Multipliers and Dividers
Data Reuse Exploration
• MH(memory hierarchy) introduces copies of data from
larger to smaller memories in DFG.
• Power consumption is decreased because data is now read
mostly from smaller memories, while it is increased
because extra memory transfers are introduced.
• Moreover, adding another layer of hierarchy has a negative
effect on the area and interconnect cost.
State/Instruction Encoding
•
Architecture of Control Logic in
Microprocessor
– State Transition Diagram
S0
e
– Binary Code Mapping
– Hardware Implementation
primary input
S1
primary output
Combinational
Logic
S2
S3
S4
Sn
present state
next state
state register
If e has higher switching prob. (e.g., S0 =branch,
S1=compare), then encode S0 and S1 with gray
code style.
Optimizing Power using Transformation
INPUT FLOWGRAPH
LOCAL TRANSFORMATION
PRIMITIVES
Associativity,
Distributivity,
Retiming,
Common Sub-expression
OUTPUT FLOWGRAPH
SEARCH MECHANISM
simulated Rejectionless,
Steepest Decent,
Heuristics
POWER
ESTIMATION
GLOBAL
TRANSFORMATION
PRIMITIVES
Retiming,
Pipelining,
Look-Ahead,
Associativity
Summary of Results
EXAMPLE
POWER
REDUCTION
AREA
INCREASE
OPTIMUM
VOLTAGE
FIR11
11
1.1
1.5V
DCT
8
5
1.5V
IIR7
7.5
6.4
1.4V
VOLTERRA2
8.6
1
1.7V
Optimum voltage for low-power is around 1.5V
Data- flow based transformations
•
•
•
•
•
•
Tree Height reduction.
Constant and variable propagation.
Common subexpression elimination.
Code motion
Dead-code elimination
The application of algebraic laws such as commutability,
distributivity and associativity.
• Most of the parallelism in an algorithm is embodied in the loops.
• Loop jamming, partial and complete loop unrolling, strength
reduction and loop retiming and software pipelining.
• Retiming: maximize the resource utilization.
Tree-height reduction
•Example of tree-height
• Example of tree-height
reduction using
reduction with distributivity
commutativity and
associativity
Sub-expression elimination
• Logic expressions:
– Performed by logic optimization.
– Kernel-based methods.
•
Arithmetic expressions:
–
–
–
–
Search isomorphic patterns in the parse trees.
Example:
a= x+ y; b = a+ 1; c = x+ y;
a= x+ y; b = a+ 1; c = a;
Examples of other transformations
•
Dead-code elimination:
– a= x; b = x+ 1; c = 2 * x;
– a= x; can be removed if not referenced.
•
Operator-strength reduction:
– a= x2 ; b = 3 * x;
– a= x * x; t = x<<1; b = x+ t;
•
Code motion:
– for ( i = 1; i < a * b) { }
– t = a * b; for ( i = 1; i < t) { }
Strength reduction
X
X
**
X
X
+
*
A
+
A
+*
X
B
X2 + AX + B
+
B
X(X + A) + B
X
X
X
+
*
+
*
*
X
*
A
+
X
A
+
*
C
+
*
X
+
+
B
X
+
+
* +
B
Strength Reduction
Control- flow based transformations
•
Model expansion.
– Expand subroutine flatten
hierarchy.
– Useful to expand scope of other
optimization techniques.
– Problematic when routine is
called more than once.
– Example:
– x= a+ b; y= a * b; z = foo( x, y) ;
– foo( p, q) {t =q-p; return(t);}
– By expanding foo:
– x= a+ b; y= a * b; z = y-x;
• Conditional expansion
• Transform conditional into
parallel execution with test at the
end.
• Useful when test depends on
late signals.
• May preclude hardware sharing.
• Always useful for logic
expressions.
• Example:
•y= ab; if ( a) x= b+d; else x= bd;
can be expanded to: x= a( b+ d) +
a’bd;
•y= ab; x= y+ d( a+ b);
Pipelining
Associativity Transformation
FIR Parallelization
FIR PARALLELIZATION
FIR Filter Parallelization
FIR parallelization: two working phases
IIR filter recursive function
Recursive Function
Interlaced Accumulation Programming for Low
Power
4. Register Transfer Level Design
FIR3 Block Diagram and Flow Graph
High-Level Power Estimation
• Pcore = PDP + PMEM + PCNTR + PPROC
• PDP = PREG +PMUX +PFU + +PFU, where PREG is the
power of the registers
• PMUX is the power of multiplexers
• PFU is the power of functional units
• PINT is the power of physical interconnet capacitance
Cint    Ctotal / N , where  is the average activity
(the total number of interconne ct accesses multiplied by
an average signal transitio n probabilit y), Ctotal is the total
estimated capacitanc e of the chip and N is an estimate of the
number of physical interconne ts (HYPER).
High-Level Power Estimation: PREG
•
•
•
•
•
•
•
•
•
Compute the lifetimes of all the variables in the given VHDL code.
Represent the lifetime of each variable as a vertical line from statement i through
statement i + n in the column j reserved for the corresponding varibale v j .
Determine the maximum number N of overlapping lifetimes computing the
maximum number of vertical lines intersecting with any horizontal cut-line.
Estimate the minimal number of N of set of registers necessary to implement the
code by using register sharing. Register sharing has to be applied whenever a
group of variables, with the same bit-width b i .
Select a possible mapping of variables into registers by using register sharing
Compute the number w i of write to the variables mapped to the same set of
registers. Estimate n i of each set of register dividing w i by the number of
statements S: i =wi/S; hence TR imax = n i f clk .
Power of latches and flip flops is consumed not only during output transitions,
but also during all clock edges by the internal clock buffers
The non-switching power PNSK dissipated by internal clock buffers accounts for
30% of the average power for the 0.38-micron and 3.3 V operating system.
In total,
N
PREG   ( Pk  PNSK ), Pk  nk PtkTRk  , PNSK  nk PNk ( f clk  TRk ) ,
k 1
PCNTR
• After scheduling, the control is defined and optimized by the hardware mapper and
further by the logic synthesis process before mapping to layout.
• Like interconnect, therefore, the control needs to be estimated statistically.
• Global control model:
CFSM  1 N states   2 ,
For a 1.2 technolo gy, 1 is 4.9fF and  2 is 22.1fF.
The total number of transitio ns is strongly dependent on the
number of states.
Local control model: the local controller account for a larger percentage of
the total capacitance than the global controller.
Clc   0  1 Ntrans   2 N states   3 B f ,
For a 1.2 tech.,  0,  72, 1,  0.15,  2,  8.3,  3,  0.55.
Where Ntrans is the number of tansitions, nstates is the number of states, Bf is the bus
factor, and Clc is the capacitance switched in any local controller in one sample
period. Bf is the ratio of the number of bus accesses to the number of busses.
Ntrans
•
•
The number of transitions depends on assignment, scheduling, optimizations, logic
optimization, the standard cell library used, the amount of glitchings and the statistics of
the inputs.
N trans   1   2 ( N nodes  N edges )   3 ( S  N Exu )
where N transis the number of transitio ns on the outputs of
the loal controller s, S is the number of control cycles per sample
period, N edges and N nodes are the number of edges and nodes in
the CDFG and N Exu is an estimate for the total number of
execution units. For a 1.2  tech.  1  178.7,  2  7.2,  3  2.0.
Behavioral Synthesis
•
loop unrolling : localize the data to reduce the activity of the inputs of the
functional units or two output samples are computed in parallel based on two
input samples.
Yn1  X n1  A  Yn2
Yn  X n  A  Yn1  X n  A  ( X n1  A  Yn2 )
Neither the capacitance switched nor the voltage is altered. However, loop unrolling
enables several other transformations (distributivity, constant propagation, and
pipelining). After distributivity and constant propagation,
Yn1  X n 1  A  Yn 2
Yn  X n  A  Yn1  A2  Yn2
The transformation yields critical path of 3, thus voltage can be dropped.
• Clock Selection : Choose optimal system clock period Eliminate slacks/improve resource
utilization and Enable greater voltage scaling
• Module selection : For each operation, choose library template
• Flow graph restructuring : pull out operations on the critical cycle.
High-Level Power Estimation: PMUX and PFU
Critical Path
•
•
•
Longest delayed path from input to
output in combinational logic
Determine operating clock
frequency
Resizing non-critical path transistor
(In-Place Optimization)
• Critical path in Synchronous
Sequential logic
D
Q
D
Q
D
Q
D
Q
path A
tcycle,min  t ff,max  tlogic,max  t setup,max  t skew,max
D
Q
D
Q
path B
tcycle,min : min.value of clock period
t ff,max : max.value of flipflop delay
tlogic,max : max.value of critical path delay
t setup,max : max.value of setup time of flipflop
t skew,max : max.value of clock skew
clk
Combinational
Logic
clk
Loop Unrolling for Low Power
Retiming
Flip- flop insertion to
minimize hazard
activity moving a
flip- flop in a circuit
Exploiting spatial locality for interconnect
power reduction
Global
Local
Adder1
Adder2
Balancing maximal time-sharing and
fully-parallel implementation
A fourth-order
parallel-form
IIR filter
(a) Local assignment
(2 global transfers),
(b) Non-local assignment
(20 global transfers)
Retiming/pipelining for Critical path
in
Biquad
Biquad
+
Biquad
1st
Order
out
Biquad
Biquad
+
D
+
+
Minimal Area Time-multiplexed
Solution
(meeting the throughput
constraint)
D
Retiming and pipelining
in
D
Biquad
Biquad
D
Biquad
+
D
1st
Order
D
Biquad
D
out
Biquad
+
D
D
+
D
+
D
"Fastest" Solution
Supply voltage can be reduced keeping throughput fixed.
Effective Resource Utilization
7
S
+
S
7
+
+
D
5
1
2
D
6
+
+
+
5
1
D
+
2
+
Retiming
D
D
3
4
D
3
Before
CYCLE
1
2
1
1
D
AFTER
Multipliers
Adder Multipliers Adder
1
1, 3
2
8
2, 4
5
6
1
6, 8
3
7
-
Can reducd interconnect capacitance.
7
4
5
4
6
Hazard propagation elimination by
clocked sampling
By sampling a steady state signal at a register input,
no more glitches are propagated through the next
combinational logics.
Latched Retiming
Latched retiming
Regularity
•
Common patterns enable the design of less complex architecture and therefore
simpler interconnect structure (muxes, buffers, and buses). Regular designs
often have less control hardware.
A1
A1
A1
A2
A2
A1
A2
A2
A1
A2
+
+
+
+
+
+
+
+
+
+
*
M1
*
M1
*
M1
<
S1
<
<<
*
S1
*
M1
*
M1
*
M1
<
S1
<
<<
S1
*
A1
+
M1
A2
+
S1
<<
*
(a)±ÔÄ¢Àû ¸ðµâÇÒ´ç
A1
+
MUX
M1
A2
+
MUX
*
<<
S1
(b)ºñ±ÔÄ¢Àû ¸ðµâÇÒ´ç
Module Selection
•
•
•
•
Select the clock period, choose proper hardware modules for all
operations(e.g., Wallace or Booth Multiplier), determine where to
pipeline (or where to put registers), such that a minimal hardware cost
is obtained under given timing and throughput constraints.
Full pipelining: ineffective clock period mismatches between the
execution times of the operators. performing operations in sequence
without immediate buffering can result in a reduction of the critical path.
Clustering operations into non-pipelining hardware modules, the
reusability of these modules over the complete computational graph be
maximized.
During clustering, more expensive but faster hardware may be
swapped in for operations on the critical path if the clustering violates
timing constraints
Estimation
•
Estimate min and max bounds on the required resources to
–
–
•
•
delimit the design space min bounds to serve as an initial solution
serve as entries in a resource utilization table which guides the transformation,
assignment and scheduling operations
Max bound on execution time is tmax: topological ordering of DFG using ASAP
and ALAP
Minimum bounds on the number of resources for each resource class
Where NRi: the number of resources of class Ri
dRi : the duration of a single operation
ORi : the number of operations
Exploring the Design Space
•
•
•
•
•
Find the minimal area solution constrained to the timing constraints
By checking the critical paths, it determine if the proposed graph
violates the timing constraints. If so, retiming, pipelining and tree height
reduction can be applied.
After acceptable graph is obtained, the resource allocation process is
initiated.
– change the available hardware (FU's, registers, busses)
– redistribute the time allocation over the sub-graphs
– transform the graph to reduce the hardware requirements.
Use a rejectionless probabilistic iterative search technique (a variant of
Simulated Annealing), where moves are always accepted. This
approach reduces computational complexity and gives faster
convergence.
Data path Synthesis
Scheduling and Binding
•
•
•
•
The scheduling task selects the control step, in which a given operation
will happen, i.e., assign each operation to an execution cycle
Sharing: Bind a resource to more than one operation.
– Operations must not execute concurrently.
Graph scheduled hierachically in a bottom-up fashion
Power tradeoffs
–
–
–
–
•
Shorter schedules enable supply voltage (Vdd) scaling
Schedule directly impacts resource sharing
Energy consumption depends what the previous instruction was
Reordering to minimize the switching on the control path
Clock selection
– Eliminate slacks
– Choose optimal system clock period
ASAP Scheduling
• Algorithm
• HAL Example
ALAP Scheduling
• Algorithm
• HAL Example
Force Directed Scheduling
Used as priority function.
Force is related to concurrency.
Sort operations for least force.
Mechanical analogy:
Force = constant displacement.
constant = operation-type distribution.
displacement = change in probability.
Force Directed Scheduling
Example : Operation V6
Force-Directed Scheduling
• Algorithm (Paulin)
Force-Directed Scheduling Example
•
Probability of scheduling operations
into control steps
•
Operator cost for multiplications in a
•
Probability of scheduling operations
into control steps after operation o3 is
scheduled to step s2
•
Operator cost for multiplications in c
List Scheduling
•
DFG with mobility labeling (inside <>)
•
ready operation list/resource constraint
• The scheduled DFG
Static-List Scheduling
•
DFG
•
Partial schedule of five nodes
•
Priority list
The final
schedule
Loop folding
• Reduce execution delay of a
loop.
• Pipeline operations inside a
loop.
• Overlap execution of
operations.
• Need a prologue and
epilogue.
• Use pipeline scheduling for loop
graph model.
DFG Restructuring
• DFG2
• DFG2 after redundant operation
insertion
Minimizing the bit transitions for constants
during Scheduling
Control Synthesis
•Synthesize circuit that:
•Executes scheduled
operations.
•Provides
synchronization.
•Supports:
• Iteration.
• Branching.
• Hierarchy.
• Interfaces.
Allocation
◆Bind a resource to more than one operation.
Optimum binding
Example
RESOURCE SHARING
•
•
•
•
Parallel vs. time-sharing buses (or execution units)
Resource sharing can destroy signal correlations and increase switching activity,
should be done between operations that are strongly connected.
Map operations with correlated input signals to the same units
Regularity: repeated patterns of computation (e.g., (+, * ), ( * ,*), (+,>))
simplifying interconnect (busses, multiplexers, buffers)
Datapath interconnections
• Bus-oriented datapath
• Multiplexer-oriented datapath
Sequential Execution
• Example of three micro-operations in the same clock period
Insertion of Latch (out)
• Insertion of latches at the output ports of the functional units
Insertion of Latch (in/out)
• Insertion of latches at both the input and output ports of the
functional units
Overlapping Data Transfer(in)
• Overlapping read and write data transfers
Overlapping of Data Transfer (in/out)
• Overlapping data transfer with functional-unit execution
Register Allocation Using Clique Partitioning
• Scheduled DFG
• Lifetime intervals of variable
• Graph model
• Clique-partitioning solution
Left-Edge Algorithm
• Register allocation using Left-Edge Algorithm
Register Allocation: Left-Edge Algorithm
• Sorted variable lifetime intervals
• Five-register allocation result
Register Allocation
•
•
•
•
•
•
•
Allocation : bind registers and functional modules to variables and
operations in the CDFG and specify the interconnection among
modules and registers in terms of MUX or BUS.
Reduce capacitance during allocation by minimizing the number of
functional modules, registers, and multiplexers.
Composite weight w.r.t transition activity and capacitance loads is
incorporated into CDFG.
Find the highest composite weight and merge the two nodes it joins,
i.e., maps the corresponding variable to the same register.
Allocation continues till no edges are left in the CDFG while updating
the composite weight values.
Set the maximum # of operations alive in any control step to be one.
Sequence operations/variables to enhance signal correlations
•
•
•
•
•
•
Exploiting spatial locality for
interconnect power reduction
A spatially local cluster: group of algorithm operations that are tightly connected
to each other in the flowgraph representation.
Two nodes are tightly connected to each other on the flowgraph representaion if
the shortest distance between them, in terms of number of edges traversed, is
low.
A spatially local assignment is a mapping of the algorithm operations to specific
hardware units such that no operations in different clusters share the same
hardware.
Partitioning the algorithm into spatially local clusters ensures that the majority of
the data transfers take place within clusters (with local bus) and relatively few
occur between clusters (with global bus).
The partitioning information is passed to the architecture netlist and
floorplanning tools.
Local: A given adder outputs data to its own inputs Global: A given adder
outputs data to the aother adder's inputs
Hardware Mapping
•
•
•
The last step in the synthesis process maps the allocated, assigned
and scheduled flow graph (called the decorated flow graph) onto the
available hardware blocks.
The result of this process is a structural description of the processor
architecture, (e.g., sdl input to the Lager IV silicon assembly
environment).
The mapping process transforms the flow graph into three structural
sub-graphs:
the data path structure graph
the controller state machine graph
the interface graph (between data path control inputs and the
controller output signals)
Spectral Partitioning in High-Level Synthesis
•
•
•
•
•
•
•
The eigenvector placement obtained forms an ordering in which nodes tightly
connected to each other are placed close together.
The relative distances is a measure of the tightness of connections.
Use the eigenvector ordering to generate several partitioning solutions
The area estimates are based on distribution graphs.
A distribution graph displays the expected number of operations executed in
each time slot.
Local bus power: the number of global data transfers times the area of the
cluster
Global bus power: the number of global data transfer times the total area:
Finding a good Partition
Interconnection Estimation
•
•
•
For connection within a datapath (over-the-cell routing), routing between units
increases the actual height of the datapath by approximately 20-30% and that
most wire lengths are about 30-40% of the datapath height.
Average global bus length : square root of the estimated chip area.
The three terms represent white space, active area of the components, and
wiring area. The coefficients are derived statistically.
Experiments
Datapath Generation
•
Register file recognition and the multiplexer reduction:
– Individual registers are merged as much as possible into register files
– reduces the number of bus multiplexers, the overall number of busses
(since all registers in a file share the input and output busses) and the
number of control signals (since a register file uses a local decoder).
•
•
•
Minimize the multiplexer and I/O bus, simultaneously (clique
partitioning is Np-complete, thus Simulated Annealing is used)
Data path partitioning is to optimize the processor floorplan
The core idea is to grow pairs of as large as possible isomorphic
regions from corresponding of seed nodes.
Hardware Mapper
Test Example
Control Signal Assignment
Incorporating into HYPER-LP