L12-high level synthe.. - VADA

Transcript L12-high level synthe.. - VADA

L12 : Lower Power High Level
Synthesis(3)
1999. 8
성균관대학교 조 준 동 교수
http://vada.skku.ac.kr
Matrix-vector product algorithm
Retiming
Flip- flop insertion to
minimize hazard
activity moving a
flip- flop in a circuit
Exploiting spatial locality for interconnect
power reduction
Global
Local
Adder1
Adder2
Balancing maximal time-sharing and
fully-parallel implementation
A fourth-order
parallel-form
IIR filter
(a) Local assignment
(2 global transfers),
(b) Non-local assignment
(20 global transfers)
Retiming/pipelining for Critical path
in
Biquad
Biquad
+
Biquad
1st
Order
out
Biquad
Biquad
+
D
+
+
Minimal Area Time-multiplexed
Solution
(meeting the throughput
constraint)
D
Retiming and pipelining
in
D
Biquad
Biquad
D
Biquad
+
D
1st
Order
D
Biquad
D
out
Biquad
+
D
D
+
D
+
D
"Fastest" Solution
Supply voltage can be reduced keeping throughput fixed.
Effective Resource Utilization
7
S
+
S
7
+
+
D
5
1
2
D
6
+
+
+
5
1
D
+
2
+
Retiming
D
D
3
4
D
3
Before
CYCLE
1
2
1
1
AFTER
Multipliers
Adder Multipliers Adder
1
1, 3
2
8
2, 4
5
6
1
6, 8
3
7
7
4
5
Can reducd interconnect capacitance.
D
4
6
Hazard propagation elimination by
clocked sampling
By sampling a steady state signal at a register input,
no more glitches are propagated through the next
combinational logics.
Regularity
•
Common patterns enable the design of less complex architecture and therefore
simpler interconnect structure (muxes, buffers, and buses). Regular designs
often have less control hardware.
A1
A1
A1
A2
A2
A1
A2
A2
A1
A2
+
+
+
+
+
+
+
+
+
+
*
M1
*
M1
*
M1
<
S1
<
<<
*
S1
*
M1
*
M1
*
M1
<
S1
<
<<
S1
*
A1
+
M1
A2
+
S1
<<
*
(a)±ÔÄ¢Àû ¸ðµâÇÒ´ç
A1
+
MUX
M1
A2
+
MUX
*
<<
S1
(b)ºñ±ÔÄ¢Àû ¸ðµâÇÒ´ç
Module Selection
•
•
•
•
Select the clock period, choose proper hardware modules for all
operations(e.g., Wallace or Booth Multiplier), determine where to
pipeline (or where to put registers), such that a minimal hardware cost
is obtained under given timing and throughput constraints.
Full pipelining: ineffective clock period mismatches between the
execution times of the operators. performing operations in sequence
without immediate buffering can result in a reduction of the critical path.
Clustering operations into non-pipelining hardware modules, the
reusability of these modules over the complete computational graph be
maximized.
During clustering, more expensive but faster hardware may be
swapped in for operations on the critical path if the clustering violates
timing constraints
Estimation
•
Estimate min and max bounds on the required resources to
–
–
•
•
delimit the design space min bounds to serve as an initial solution
serve as entries in a resource utilization table which guides the transformation,
assignment and scheduling operations
Max bound on execution time is tmax: topological ordering of DFG using ASAP
and ALAP
Minimum bounds on the number of resources for each resource class
Where NRi: the number of resources of class Ri
dRi : the duration of a single operation
ORi : the number of operations
Exploring the Design Space
•
•
•
•
•
Find the minimal area solution constrained to the timing constraints
By checking the critical paths, it determine if the proposed graph
violates the timing constraints. If so, retiming, pipelining and tree height
reduction can be applied.
After acceptable graph is obtained, the resource allocation process is
initiated.
– change the available hardware (FU's, registers, busses)
– redistribute the time allocation over the sub-graphs
– transform the graph to reduce the hardware requirements.
Use a rejectionless probabilistic iterative search technique (a variant of
Simulated Annealing), where moves are always accepted. This
approach reduces computational complexity and gives faster
convergence.
Data path Synthesis
Scheduling and Binding
•
•
•
•
The scheduling task selects the control step, in which a given operation
will happen, i.e., assign each operation to an execution cycle
Sharing: Bind a resource to more than one operation.
– Operations must not execute concurrently.
Graph scheduled hierachically in a bottom-up fashion
Power tradeoffs
–
–
–
–
•
Shorter schedules enable supply voltage (Vdd) scaling
Schedule directly impacts resource sharing
Energy consumption depends what the previous instruction was
Reordering to minimize the switching on the control path
Clock selection
– Eliminate slacks
– Choose optimal system clock period
ASAP Scheduling
• Algorithm
• HAL Example
ALAP Scheduling
• Algorithm
• HAL Example
Force Directed Scheduling
Used as priority function.
Force is related to concurrency.
Sort operations for least force.
Mechanical analogy:
Force = constant displacement.
constant = operation-type distribution.
displacement = change in probability.
Force Directed Scheduling
Example : Operation V6
Force-Directed Scheduling
• Algorithm (Paulin)
Force-Directed Scheduling Example
•
Probability of scheduling operations
into control steps
•
Operator cost for multiplications in a
•
Probability of scheduling operations
into control steps after operation o3 is
scheduled to step s2
•
Operator cost for multiplications in c
List Scheduling
•
DFG with mobility labeling (inside <>)
•
ready operation list/resource constraint
• The scheduled DFG
Static-List Scheduling
•
DFG
•
Partial schedule of five nodes
•
Priority list
The final
schedule
Divide-and-Conquer to minimize the
power consumption
•
•
•
•
Decompose a computation into strongly connected components
Any adjacent trivial SCCs are merged into a sub part;
Use pipelining to isolate the sub parts;
For each sub part
– Minimize the number of delays using retiming;
– If (the sub part is linear)
• Apply optimal unfolding;
– Else
• Apply unfolding after the isolation of nonlinear operations;
• Merge linear sub parts to further optimize;
• Schedule merged sub parts to minimize memory usage
Choosing Optimal Clock Period
SCC decomposition step
Using the standard depth-first
search-based algorithm
[Tarjan,1972] which has a low
order polynomial-time complexity.
For any pair of operations A and B
within an SCC, there exist both a
path from A to B and a path from B
to A. The graph formed by all the
SCCs is acyclic. Thus, the SCCs
can be isolated from each other
using pipeline delays, which
enables us to optimize each SCC
separately.
Idetifying SCC
• The first step of the approach is to identify the computation's
strongly connected components,.
Choosing Optimal Clock Period
Supply Voltage Scaling
Lowering Vdd reduces
energy, but increase delays
Multiple Supply Voltages
Filter Example
Shut-down을 이용한 Scheduling: |a-b|
a
1
b
>
a
b
-
b
1
2
3
>
a
b
Mux
b
a
-
a
1
Mux
a
-
2
a
b
2
3
b
a
b
b
a
>
-
Mux
Loop Scheduling
• Sequential Execution
• Partial loop unrolling
• Loop folding
Loop folding
• Reduce execution delay of a
loop.
• Pipeline operations inside a loop.
• Overlap execution of
operations.
• Need a prologue and
epilogue.
• Use pipeline scheduling for loop
graph model.
DFG Restructuring
• DFG2
• DFG2 after redundant operation
insertion
Minimizing the bit transitions for constants
during Scheduling
Control Synthesis
•Synthesize circuit that:
•Executes scheduled
operations.
•Provides
synchronization.
•Supports:
• Iteration.
• Branching.
• Hierarchy.
• Interfaces.
Allocation
◆Bind a resource to more than one operation.
Optimum binding
Example

L12-high level synthe.. - VADA

Transcript L12-high level synthe.. - VADA

Directory