L5-arch11 - VADA

Download Report

Transcript L5-arch11 - VADA

L5:Lower Power Architecture
Design
1999. 8.2
성균관대학교 조 준 동 교수
http://vada.skku.ac.kr
SungKyunKwan Univ.
VADA Lab.
1
Architectural-level Synthesis
• Translate HDL models into sequencing graphs.
• Behavioral-level optimization:
– Optimize abstract models independently from the
implementation parameters.
•
Architectural synthesis and optimization:
– Create macroscopic structure:
• data-path and control-unit.
– Consider area and delay information
• Hardware compilation:
– Compile HDL model into sequencing graph.
– Optimize sequencing graph.
– Generate gate-level interconnection for a cell library. of the
implementation.
SungKyunKwan Univ.
VADA Lab.
2
Architecture-Level Solutions
•
Architecture-Driven Voltage Scaling: Choose more parallel
architecture, Lowering V dd reduces energy, but increase delays
•
Regularity: to minimize the power in the control hardware and the
interconnection network.
Modularity: to exploit data locality through distributed processing units, memories and control.
– Spatial locality: an algorithm can be partitioned into natural clusters
based on connectivity
– Temporal locality:average lifetimes of variables (less temporal storage,
probability of future accesses referenced in the recent past).
Few memory references: since references to memories are expensive in
terms of power.
•
•
•
Precompute physical capacitance of Interconnect and switching
activity (number of bus accesses
SungKyunKwan Univ.
VADA Lab.
3
Power Measure of P
SungKyunKwan Univ.
VADA Lab.
4
Architecture Trade-off
Reference Data Path
SungKyunKwan Univ.
VADA Lab.
5
Parallel Data Path
SungKyunKwan Univ.
VADA Lab.
6
Pipelined Data Path
SungKyunKwan Univ.
VADA Lab.
7
A Simple Data Path, Result4
SungKyunKwan Univ.
VADA Lab.
8
Uni-processor Implementation
SungKyunKwan Univ.
VADA Lab.
9
Multi-Processor
Implementation
SungKyunKwan Univ.
VADA Lab.
10
Datapath Parallelization
SungKyunKwan Univ.
VADA Lab.
11
Memory Parallelization
At first order P= C * f/2 * Vdd2
SungKyunKwan Univ.
VADA Lab.
12
VLIW Architecture
SungKyunKwan Univ.
VADA Lab.
13
VLIW - cont.
•
•
•
•
•
•
Compiler takes the responsibility for finding the operations that can be
issued in parallel and creating a single very long instruction containing
these operations. VLIW instruction decoding is easier than superscalar
instruction due to the fixed format and to no instruction dependency.
The fixed format could present more limitations to the combination of
operations.
Intel P6: CISC instructions are combined on chip to provide a set of
micro-operations (i.e., long instruction word) that can be executed in
parallel.
As power becomes a major issue in the design of fast -Pro, the simple
is the better architecture.
VLIW architecture, as they are simpler than N-issue machines, could
be considered as promising architectures to achieve simultaneously
high-speed and low-power.
SungKyunKwan Univ.
VADA Lab.
14
Synchronous VS. Asynchronous
•
•
Synchronous system: A signal path starts from a clocked flip- flop
through combinational gates and ends at another clocked flip- flop. The
clock signals do not participate in computation but are required for
synchronizing purposes. With advancement in technology, the systems
tend to get bigger and bigger, and as a result the delay on the clock
wires can no longer be ignored. The problem of clock skew is thus
becoming a bottleneck for many system designers. Many gates switch
unnecessarily just because they are connected to the clock, and not
because they have to process new inputs. The biggest gate is the clock
driver itself which must switch.
Asynchronous system (self-timed): an input signal (request) starts the
computation on a module and an output signal (acknowledge) signifies
the completion of the computation and the availability of the requested
data. Asynchronous systems are potentially response to transitions on
any of their inputs at anytime, since they have no clock with which to
sample their inputs.
SungKyunKwan Univ.
VADA Lab.
15
Asynchronous - Cont.
•
•
•
•
•
•
More difficult to implement, requiring explicit synchronization between
communication blocks without clocks
If the signal feeds directly to conventional gate-level circuitry, invalid
logic levels could propagate throughout the system.
Glitches, which are filtered out by the clock in synchronous designs,
may cause an asynchronous design to malfunction.
Asynchronous designs are not widely used, designers can't find the
supporting design tools and methodologies they need.
DCC Error Corrector of Compact cassette player saves power of 80%
as compared to the synchronous counterpart.
Offers more architectural options/freedom encourages distributed,
localized control offers more freedom to adapt the supply voltage
S. Furber, M. Edwards. “Asynchronous Design Methodologies”.
1993
SungKyunKwan Univ.
VADA Lab.
16
Asynchronous design with adaptive scaling of the
supply voltage
(a) Synchronous
system
(b)
Asynchronous
system with
adaptive scaling
of the supply
voltage
SungKyunKwan Univ.
VADA Lab.
17
Asynchronous Pipeline
SungKyunKwan Univ.
VADA Lab.
18
PIPELINED SELF-TIMED micro P
SungKyunKwan Univ.
VADA Lab.
19
Hazard-free Circuits
6% more logics
SungKyunKwan Univ.
VADA Lab.
20