L4-arch0 - VADA

Download Report

Transcript L4-arch0 - VADA

L4: Architectural Level Design
성균관대학교 조 준 동 교수
http://vlsicad.skku.ac.kr
System-Level Solutions
•
•
•
•
•
Spatial locality: an algorithm can be partitioned into natural clusters
based on connectivity
Temporal locality:average lifetimes of variables (less temporal
storage, probability of future accesses referenced in the recent past).
Precompute physical capacitance of Interconnect and switching
activity (number of bus accesses)
Architecture-Driven Voltage Scaling: Choose more parallel
architecture
Supply Voltage Scaling : Lowering V dd reduces energy, but
increase delays
Software Power Issues
•
•
•
•
•
•
•
•
•
Upto 40% of the on-chip power is dissipated on the buses !
System Software : OS, BIOS, Compilers
Software can affect energy consumption at various levels InterInstruction Effects
Energy cost of instruction varies depending on previous instruction
For example, XORBX 1; ADDAX DX;
Iest = (319:2+313:6)=2 = 316:4mA Iobs =323:2mA
The difference defined as circuit state overhead
Need to specify overhead as a function of pairs of instructions
Due to pipeline stalls, cache misses
Instruction reordering to improve cache hit ratio
Software Power Optimization
• Operand swapping
•
•
Instruction packing
– reduce cache miss with a high
power penalty
– example
• Fujisu DSP
• permit an ALU operation and
a memory data transfer to be
packed
Instruction ordering
– attempt to minimize the energy
associated with the circuit state
effect
– reordering instruction to minimize
the total power for a given table
– minimize activity
associated with the operand
– attempts to swap operands
to ALU or FPU
Software Power Optimization
• Minimizing memory access
costs
– minimizes the number of
memory accesses required
by an algorithm
– example
Before
FOR i:= 1 TO N DO
B[i] = f(A[i]);
FOR i:= 1 TO N DO
C[i] = g(B[i]);
After
•
Memory bank assignment
– formulated as a graph partitioning
problem
– each groups correspond to a
memory bank
– optimum code sequence can vary
using dual loads
b
d
access graph
for code fragment
e
FOR i:= 1 TO N DO
B[i] = f(A[i]);
C[i] = g(B[i]);
a
c
b
c
a
d
END_FOR;
e
partitioned access graph
Bank A
Bank B
Power Management Mode
• APM System
•
•
Support power management
APM-Aware
APM-Aware
– easy control for
Application
Application
applications and OS
APM : Advanced power
management
Operating
APM Driver
System
– power states
• Full On
• APM Enabled
• APM Standby
BIOS
APM BIOS
• APM Suspend
• Off
APM BIOS
Controlled
Hardware
APM-Aware
Device Driver
APM-Aware
Device Driver
OS dependent
OS independent
Add-In
Device
Add-In
Device
Power Management Mode
• APM state transitions
Device
Responsiveness
Decrease
Off Switch
Power
Managed
Full On
•APM Enable
•Enable Call
•Off Switch
•Off Call
•APM Disable
•Disable Call
APM Enabled
•Short Inactivity
•Standby Call
Off Switch
Off Switch
APM Standby
APM Suspend
Hibernation
Off
•Long Inactivity
•Suspend Interrupt
•Suspend Call
On Switch
Power
Usage
Increase
Power Management Mode
• MIPS 4200
• PowerPC 603
– Doze
• clock running to data
cache, snooping logic,
time base/decrementer
only
– Nap
• clocks running to time
base/decrementer only
– Sleep
• all clocks stopped, no
external input clock
– Reduced power
• clocks at 1/4 bus clock
frequency
• Hitachi SH7032
– Sleep
• CPU clocks stopped,
preipherals remain clocked
– Standby
• all clocks stopped
peripherals initialized
Power Optimization
•
•
•
•
•
Modeling and Technology
Circuit Design Level
Logic and Module Design Level
Architecture and System Design Level
Some Design Examples
– ARM7TDMI
Some Design Examples
Processor
•
ARM7TDMI core
– size : 1mm2 @ 0.25um
– power :
• 0.181W@33MHz 5V
• 143 MIPS/W
– feature
• 32 bit addressing
• 32x8 DSP multiplier
• 32-bit register bank and ALU
• 32-bit barrel shifter
– thumb instruction set
• compressed 32-bit ARM
instruciton
• high-code density
ARM7D
ARM7TDMI
PC403GA
V810
68349
29200
486DX
i960SA
System Power(W) MIPS/W
33Mhz 5V
33Mhz 5V
40Mhz 5V
25Mhz 5V
25Mhz 5V
16Mhz 5V
33Mhz 5V
16Mhz 5V
0.165
0.181
1
0.5
0.96
1.1
4.5
1.25
185
143
39
36
9
7
6
4
Processor with Power
Management
• Clock power management
– basic logical method
• gated clocking
– hardware method
• external pin + control register bit
– software method
• specific instructions + control register bit
Avoiding Wastful Computation
•
•
•
•
•
•
•
•
Preservation of data correlation
Distributed computing / locality of reference
Application-specific processing
Demand-driven operation
Transformation for memory size reduction
Consider arrays A and C are already available in memory
When A is consumed another array B is generated; when C is
consumed a scalar value D is produced.
Memory Size can be reduced by executing the j loop before the i loop
so that C is consumed before B is generated and the same memory
space can be used for both arrays.
Avoiding Wastful Computation
Architecture Lower Power Design
• Optimum Supply Voltage Architecture through Hardware
Duplication (Trading Area for Lower Power) and/or Pipelining
– complex and fewer instruction requires less encoding, but larger
decode logic!
• use small complex instruction with smaller instruction length
(e.g., Hitachi SH: 16-bit fixed-length, arithmetic instruction uses
only two operands, NEC V800: variable-length instruction
decoding overhead )
• Superscalar: CPI < 1: parallel instruction execution. VLIW
architecture.
Variable Supply Voltage Block Diagram
•
•
•
Computational work varies with
time. An approach to reduce
the energy consumption of
such systems beyond shut
down involves the dynamic
adjustment of supply voltage
based on computational
workload.
The basic idea is to lower
power supply when the a fixed
supply for some fraction of
time.
The supply voltage and clock
rate are increased during high
workload period.
Power Reduction using Variable Supply
•Circuits with a fixed supply voltage
work at a fixed speed and idle if the
data sample requires less than the
maximum amount of computation.
• Power is reduced in a linear fashion
since the energy per operation is
fixed. If the work load for a given
sample period is less than peak, then
the delay of the processing element
can be increased by a factor of
1/workload without loss in throughput,
allowing the processor to operate at a
lower supply voltage. Thus, energy
per operation varies.
Data Driven Signal Processing
The basic idea of averaging two
samples are buffered and their
work loads are averaged.
The averaged workload is then
used as the effective workload to
drive the power supply.
Using a pingpong buffering
scheme, data samples In +2, In +3
are being buffered while In, In +1
are being processed.
Datapath Parallelization
Memory Parallelization
At first order P= C * f/2 * Vdd2
Pipelined Micro-P
Architecture Trade-Off
Ppipeline =
(1.15C)( 0.58V)2 (f)
= 0.39P
Pparallel =
(2.15C)(0.58V)2 (0.5f)
= 0.36P
PIPLELINED Implementation
Different Classes of RISC Micro-P
Application Specific Coprocessor
•
•
•
•
•
•
DSP's are increasingly called upon to perform tasks for which they are
not ideally suited, for example, Viterbi decoding.
They may also take considerably more energy than a custom solution.
Use the DSP for portions of algorithms for which it is well suited, and
craft an application-specic coprocessor (i.e., custom hardware) for
other tasks.
This is an example of the dierence between power and energy
The application-specic coprocessor may actually consume a more
power than the DSP, but it may be able to accomplish the same task in
far less time, resulting in a net energy savings.
Power consumption varies dramatically with the instruction being
executed.
Clock per Instruction (CPI)
SUPERPIPELINE micro-P
VLIW Architecture
Compiler takes the responsibility for finding
the operations that can be issued in parallel
and creating a single very long instruction
containing these operations. VLIW instruction
decoding is easier than superscalar
instruction due to the fixed format and to no
instruction dependency.
The fixed format could present more
limitations to the combination of operations.
Intel P6: CISC instructions are combined on
chip to provide a set of micro-operations (i.e.,
long instruction word) that can be executed in
parallel.
As power becomes a major issue in the
design of fast -Pro, the simple is the better
architecture.
VLIW architecture, as they are simpler than
N-issue machines, could be considered as
promising architectures to achieve
simultaneously
high-speed and low-power.
Architecture Optimization
current_sum
14
2’s complement architecture
– correlator example
• 64MHz random input
• 64KHz accumulated output
• 1024 length
– accumulator acts as a low-pass
filter
• higher order bits have little
switching activity
– high switching activity of the
adder
• all of the input bits to the
adder switch each time the
input changes sign
4
14
+
4
in_latched
add_out
CLK
CLK
(64MHz)
(64KHz)
CLK
(64MHz)
Transition Activity
•
add_out
1.0
sign-extension
0.5
0.0
0
in_latched
current_sum
2
4
6
Bit Position
8
10
12
Architecture Optimization
13
sign-bit
(to control)
Sign-magnitude architecture
– low switching activity in high
order bit
• no sign-extension is being
performed
• higher order bits only need an
incrementer
– power is not sensitive to very rapid
fluctuations in the input data
input pattern
constant(7,7,…)
ramp(-7,-6,..,6,7..)
random
min->max->min
(-7,+7,-7,+7,…)
2’s(mW) Sign(mW)
1.97
2.13
3.42
2.25
2.51
2.51
5.28
2.46
3
+
13
gated clk
4
clk(64KHz)
14
-
3
13
3
clk
(64MHz)
13
+
14
clk(64KHz)
NEGACC
gated clk
Transition Activity
•
POSACC
clk(64KHz)
sum(2’s complement)
1.0
suma
suma + sumb
(sign-magnitude)
0.5
sumb
0.0
0
2
4
6
Bit Position
8
10
12
Architecture Optimization
IN
>>8
IN
>>7
+
SUM1
IN
+
SUM2
IN
IN
Transition Activity
Ordering of input signals
– the ordering of operations can result
in reduced switching activity
– example
• multiplication with a constant
: IN + (IN >> 7) + (IN >> 8)
– topology II
• the output of first adder has a
small amplitude
-> lower switching activity
• switched 30% less
0.4
SUM1
+
IN
>>7
+
>>8
SUM1
SUM2
0.2
0.0
0
SUM2
2
4
6
8
10
12
10
12
Bit Position
Transition Activity
•
0.4
SUM1
SUM2
0.2
0.0
0
2
4
6
Bit Position
8
•
Architecture Optimization
Reducing glitching activity
– static design can exhibit spurious transitions
• finite propagation delay from one logic block to the next
– important to balance all signal path and reduce the logic depth
– multiple input addition
• 4 input case : 1.5 larger than tree implementation
• 8 input case : 2.5 larger than tree implementation
A
B
A
+
C
+
B C
+
D
D
+
+
+
Chained implemenation
Tree implemenation
Synchronous VS. Asynchronous SYSTEMS
•
•
Synchronous system: A signal path starts from a clocked flip- flop
through combinational gates and ends at another clocked flip- flop. The
clock signals do not participate in computation but are required for
synchronizing purposes. With advancement in technology, the systems
tend to get bigger and bigger, and as a result the delay on the clock
wires can no longer be ignored. The problem of clock skew is thus
becoming a bottleneck for many system designers. Many gates switch
unnecessarily just because they are connected to the clock, and not
because they have to process new inputs. The biggest gate is the clock
driver itself which must switch.
Asynchronous system (self-timed): an input signal (request) starts the
computation on a module and an output signal (acknowledge) signifies
the completion of the computation and the availability of the requested
data. Asynchronous systems are potentially response to transitions on
any of their inputs at anytime, since they have no clock with which to
sample their inputs.
Synchronous VS. Asynchronous SYSTEMS
•
•
•
•
•
•
•
More difficult to implement, requiring explicit synchronization between
communication blocks without clocks
If the signal feeds directly to conventional gate-level circuitry, invalid
logic levels could propagate throughout the system.
Glitches, which are filtered out by the clock in synchronous designs,
may cause an asynchronous design to malfunction.
Asynchronous designs are not widely used, designers can't find the
supporting design tools and methodologies they need.
DCC Error Corrector of Compact cassette player saves power of 80%
as compared to the synchronous counterpart.
Offers more architectural options/freedom encourages distributed,
localized control offers more freedom to adapt the supply voltage
Asynchronous Modules
Example: ABCS protocol
6% more logics
Control Synthesis Flow
PIPELINED SELF-TIMED micro P
Programming Style
Speed vs. Power Optimization