저전력 MP-SoC-1 - VADA

Download Report

Transcript 저전력 MP-SoC-1 - VADA

차세대 모바일(DMB, Wibro) 플랫폼을
위한 저전력/고성능 시스템 SoC 설계
조준동
VADA Lab.
SungKyunKwan University
2006.8
성균관대학교 정보통신공학부
© 조준동
2006년 가을
1
발표순서
•
저전력 SoC 설계 기초
– Power metric
– 기본적인 저전력 설계 기술
• 재구성구조를 이용한 저전력 설계
- Reconfigurable Radio Systems
(Software Defined Radio)
- Fine-grain: FPGA
- Medium-grain: FPFA
- Course-grain:Hybrid SoC
- Systolic Ring
• MP-SoC를 통한 저전력 설계
– Homogeneous MP-SoC
– Heterogeneous MP-SoC
© 조준동, 2006년 여름
2
저전력 설계 기초
성균관대학교 정보통신공학부
© 조준동
2006년 가을
3
이동 단말기 = 소형+저전력+기능
GPS
Cochlear implant
Cellular phone
Noise
cancellation
headphones
Medical
watch
Hearing
aid
Digital still camera
Portable
audio
Digital radio
© 조준동, 2006년 여름
4
미래의 모바일 컴퓨팅
• 실시간 처리 이동 슈퍼 컴퓨팅
– Speech recognition, Cryptography.
– Augmented reality.
• 16개의 Pentium-4 필요
– 2004 Intel P4 @3GHz; 55M TR’s 122mm2 0.09u
– 2014 20GHz 0.03u
• 저전력을 만족하면서 고성능
– requires (massive) parallelism
– Multi-processor systems
– Subsystem integration
Mudge et al:
© 조준동, 2006년 여름
5
차세대 휴대 단말 이슈
© 조준동, 2006년 여름
6
모바일 플랫폼 구조
© 조준동, 2006년 여름
7
플랫폼 계층 및 구분
• Level 0: Foundation Platform
– Infrastructure & standards : Basic Arch.
• Processor core, Peripheral/Interface IP, Bus: e.g., ARM
PrimeXsys
• Level 1: Application specific Integration
Platform
• Application Specific SoC: HW & SW
• Mobile Platform, Home Platform
• Level 2: System Platform
• Terminal Platform
• Handset case: RF + Modem + AP + Memory + MMI
© 조준동, 2006년 여름
8
플랫폼 예 (Scalable Multi-processors)
© 조준동, 2006년 여름
9
저전력 PC 및 임베디드 프로세서
• Highend Processor Core
– AMD: CPU내 메모리 컨틀롤러 및 Northbridge 내장, ATI 인수 합
병을 통한 Graphic Processor의 통합
• 외부 메모리 접속 병목 현상 해소 + 전력 소모 절감
Intel: 내부 Cache 메모리 확장, Prefetch 메커니즘 향상
IBM: Cell processor의 멀티미디어 기능 통합
고성능/저전력을 위해 멀티코어 프로세서가 보편화되면서 대칭형 멀티
프로세서가 차세대 휴대 단말 칩셋에서도 채용될 것으로 예상
ARM사는 AXI의 후속으로 AMBA4 발표 예정, NOC의 선구자인
Sonics사의 SonicsMX 기술은 OMAP등에 채용
© 조준동, 2006년 여름
10
ARM MPCoreTM 아키텍쳐
© 조준동, 2006년 여름
11
다중 표준화 수용
• 4G에서는 많은 수의 컴포넌트들이 제작 비용을 줄
이기 위해서 플랫폼을 이용한 설계 프로세서가 중
요
• 다양한 서비스를 제공하기 위해서 H.264와
MPEG4가 하나의 플랫폼에서 구현되며, WCDMA
와 CDMA2000이 멀티-DSP 플랫폼에서 구현되며
또한 Wibro 및 MoIP등을 수용할 수 있는 MPSoC 플랫폼이 필요하게 될 것이다.
© 조준동, 2006년 여름
12
전력 소모 감축
• Peak 전력이 75mW를 넘지 않도록 설계하여야 한
다 그렇게 때문에 다양한 크기의 프로세서 코어를
이용한 병렬처리를 사용하여 적당한 크기의 프로세
서 코어의 선택 및 전압 및 주파수를 스케일링 함으
로써 전력효율을 높일 수 있게 된다.
Intel/IBM/NEC등 많은 연구가 진행되고 있으며 그
를 위한 효율적인 설계 방법의 필요성이 더욱 중요
하게 대두되고 있다.
© 조준동, 2006년 여름
13
설계 비용 감축
• 트랜지스터 크기가 나노미터가 되면서 mask
NRE인 경우 1M$, design NRE인 경우는 10M$에
서 75M$에 이른다. 그러한 과도한 초기 설계 비용
을 줄이기 위해서 앞으로 하드웨어 설계시의 검증
의 어려움을 해결하기 위하여 기존의 ASIC 칩을 임
베디드 프로세서로 대치하는 경향이 늘어가고 있
다.
© 조준동, 2006년 여름
14
Emb. Systems Prog. 2005:
# of Processors per chip
© 조준동, 2006년 여름
15
MP-ARM Platform
© 조준동, 2006년 여름
16
Parallelism favors lower power
solutions
P. G. Paulin et al, “Parallel Programming
Models for a Multiprocessor SoC Platform
Applied to Networking and Multimedia”,
IEEE Transactions on VLSI Systems,
Vol. 14, No. 7, July 2006
© 조준동, 2006년 여름
17
Parallelism Inside the Processorhree Forms
in Extensible Instruction Sets
Chris Rowen, President and CEO, Tensilica, Inc.
© 조준동, 2006년 여름
18
Multiple concurrent processors
much lower energy
Chris Rowen, President and CEO, Tensilica, Inc.
© 조준동, 2006년 여름
19
Keys to Efficient MP
Flexible range of topologies
Chris Rowen, President and CEO, Tensilica, Inc.
© 조준동, 2006년 여름
20
Different Multi-processor Design
Flows
Chris Rowen, President and CEO, Tensilica, Inc.
© 조준동, 2006년 여름
21
Parallel Architectures
© 조준동, 2006년 여름
22
저전력 디바이스의 필요성
• 실용적
(Reducing power requirements of high
throughput portable applications)
• 경제적
(Reducing packaging costs and achieving
memory savings)
• 기술적
(Excessive heat prevents the realization of
high density chips and limits their
functionalities)
© 조준동, 2006년 여름
23
Driving Forces for Low-Power:
Deep-Submicron Technology
ADVANTAGES
 Smaller geometries
 Higher clock
frequencies
DISADVANTAGES
 Higher power
consumption
 Lower reliability
© 조준동, 2006년 여름
24
동적 전력 소모 Dynamic Power
• Average power consumption by a node cycling at eac
V
h period T:
dE
PMOS
P (t ) 
 VDD  iDD (t )
dt
Network
i
V
dVo
+
iDD (t )  C L
V
C
dt
NMOS
DD
DD
in
o
L
td
V
0
0
E01   P (t )dt  VDDC L  dVo C LVDDV
Network
Average power consumed by a node with
partial activity
(only a fraction
 of the periods has a transition)
Pswitching battery  C V
2
0 DD CLK
f
© 조준동, 2006년 여름
25
CMOS Energy and Power
• E = CL VDD2 P01 + tsc VDD Ipeak P0/11/0 + VDD Ileak/f
f = P * fclock
•
P=
CL VDD2 f
Dynamic power
(~80% today and
decreasing
relatively)
+ tscVDD Ipeak f
Short-circuit power
(~5% today and
decreasing
absolutely)
+ VDD Ileak
Leakage power
(~15% today
and increasing)
© 조준동, 2006년 여름
26
QoS vs. Power
• How accurate should I make my FDCT?
© 조준동, 2006년 여름
27
정적 전력 소모 Static power
Pstatic = VCC x Ntr X Ileak
0
© 조준동, 2006년 여름
28
Maximum clock frequency is a function of the
supply voltage
• t= k CL Vdd / (Vdd – Vt)2
i.e., Vdd =3.3, Vt = 0.8
• Decreasing Vdd reduces power quadratically,
but, run-time of algorithms is only linearly
increased. => Dynamic voltage scaling
• Cruesoe (Transmeta) 32 voltage levels
1.1~1.6 volts, clock 200Mhz~700Mhz in
increments os 33Mhz.
• Voltage transitions takes 20ms
• Intel SpeedStep for Mobile Pentium III: 3
levels
© 조준동, 2006년 여름
29
SCALING TREND




•
•
•
•
Keeping the pace with Gene’s
Law: DPS Chip’s energy
efficiency (MIPS/Watt) doubles
every 18 Month
Low Cost
High flexibility
Reduce idle power in idle state
Gene’s Law  Tech&Circ: Voltage islands, Arch: MPSoC
Low Cost  Integrate, but only when cost effective
Push towards A & D integration
High flexibility  Software radios, reconfigurable architectures
• Reduce static power in idle state  Variable Vdd, VT
© 조준동, 2006년 여름
30
저전력 소모 기술 개발 현황
개발자
IBM, Austin
DoD DARPA
응용 제품
기타
DPM (PowerPC 전력관리, 스케줄링,
405LP)
OS 시스템
휴대용 프로세서
(90% 전력 감소)
PCF50606:
Single Chip
Philips
power
STMicroelect
management
ronics
unit (for smart
Atmel
phone and
wireless PDA)
Atrenta
특징
GlassSpy CAD
tool
Programmed
power
management
(70% 전력 감소)
RTL 구조의 HDL 및
SystemC로
gate된 클록 구조를
생성
© 조준동, 2006년 여름
31
Energy-Flexibility Gap
에너지 효율
(MOPS/mW)
1000
100
신호처리
ASIC
200 MOPS/mW
재구성 구조
10
1
신호처리 프로세서
ASIPs, DSPs
10-80 MOPS/mW
3 MOPS/mW
임베디드 프로세서(ARM)
0.5 MOPS/mW
0.1
가용성
© 조준동, 2006년 여름
326
에너지 감축을 위한 2가지 요소
1. C0
– redundant h/w extraction
– Locality of reference
– Demand-driven / Data-driven computation
– Preservation of data correlations
– Power down techniques (Clock gating,
dynamic power management)
– All in one Approach (SOC)
2. Vdd
– Dynamic voltage scaling based on workload
– 2-D pipelining (systolic arrays)
– Parallel processing
© 조준동, 2006년 여름
33
저전력 설계 기법들…
• Voltage and process scaling
• Design methodologies
– Power-aware design flows and tools, trade area for
lower power
• Architecture Design
• Power down techniques
– Clock gating, dynamic power management
• Dynamic voltage scaling based on workload
• Power conscious RT/ logic synthesis
• Better cell library design and resizing methods
– Cap. reduction, threshold control, transistor layout
© 조준동, 2006년 여름
34
Power Analysis
• Fast and accurate analysis in the design process
– Power budgeting
– Knowledge-based architectural and implementation
decisions
– Package selection
– Power hungry module identification
• Detailed and comprehesive analysis at the later
stages
– Satisfaction of power budget and constraints
– Hot spots
© 조준동, 2006년 여름
35
Power Savings
© 조준동, 2006년 여름
36
Estimation Expectations
© 조준동, 2006년 여름
37
IBM’s PowerPC
• Optimum Supply Voltage through Hardware Parallel,
Pipelining ,Parallel instruction execution
– five instruction in parallel (IU, FPU, BPU, LSU, SRU) ,
RISC
– FPU is pipelined so a multiply-add instruction can be
issued every clock cycle
– Low power 3.3-volt design
– 603e provides four software controllable power-saving
modes.
• Copper Processor with SOI
• IBM’s Blue Logic ASIC :New design reduces of power by a
factor of 10 times
© 조준동, 2006년 여름
38
Silicon-on-Insulator
• How Does SOI Reduce Capacitance ?
Eliminated junction capacitance by using
SOI (similar to glass) is placed between the
impuritis and the silicon substrate
high performance, low power, low soft error
© 조준동, 2006년 여름
39
Why Copper Processor?
• Motivation: Aluminum resists the flow of
electricity as wires are made thinner and
narrower.
• Performance: 40% speed-up
• Cost: 30% less expensive
• Power: Less power from batteries
• Chip Size: 60% smaller than Aluminum chip
© 조준동, 2006년 여름
40
Factors Influencing Ceff
•
•
•
•
Circuit function
Circuit technology
Input probabilities
Circuit topology
© 조준동, 2006년 여름
41
Some Basic Definitions
• Signal probability of a signal g(t) is given by
1 T2
P g   lim  g t dt
T  T T 2
Signal activity of a logic signal g(t) is given by
ng T 
A g   lim
T 
T
where ng(t) is the number of transitions of g(t) in
the time interval between –T/2 and T/2.
© 조준동, 2006년 여름
42
Factors Influencing Ceff:
Circuit Function
• Assume that there are M mutually independent sig
nals g1, g2,...gM each having a signal probability Pi an
d a signal activity Ai, for i  n.
• For static CMOS, the signal probability at the outpu
t of a gate is determined according to the probabili
ty of 1s (or 0s) in the logic description of the gate
P1
P1
1-P1
P2
P1P2
P1
1-(1-P1)(1- P2)
P2
© 조준동, 2006년 여름
43
Factors Influencing Ceff:
Circuit Function (Static CMOS)
• Transistors connected to the
same input are turning on and off
simultaneously when the input
changes
• CL of a static CMOS gate is
charged to VDD any time a 01
transition at the output node is
required.
• CL of a static CMOS gate is
discharged to ground any time a
1 0 transition at the output
node is required.
NOR Gate
© 조준동, 2006년 여름
44
Factors Influencing Ceff:
Circuit Function (Static CMOS)
• State transition diagram of the NOR gate
  1  pY  pY  pY 1  pY   3 8


pY '
pY '
© 조준동, 2006년 여름
45
Factors Influencing Ceff:
Circuit Function (Static CMOS)
• State transition diagram of the NOR gate
  pY ' pY  pY pY '  1 2
© 조준동, 2006년 여름
46
Factors Influencing Ceff:
Input Probabilities (Static CMOS)
• Signal activity calculation: Boolean Difference
f xi  f
xi 1
f
xi 0
It signifies the condition under which output f
is sensitized to input xi
If the primary inputs to function f are not
spatially correlated, the signal activity at f is
Af 
 Pf
1i  N
xi  Axi
© 조준동, 2006년 여름
47
Architecture Driven Supply
Voltage Scaling
• Strategy:
1. Modify the architecture of the system so as to make it faster.
2. Reduce VDD so as to restore the original speed. Power
consumption has decreased.
• The most common architectural changes rely on the
exploitation of parallelization and pipelining.
• Drawback:
The additional circuitry required to compensate the speed
degradation may dominate, and the power consumption
may increase.
• Consequence:
Parallelism and pipelining do not always pay-off.
© 조준동, 2006년 여름
48
Parallel Architectures
Ppar=0.36Pref
© 조준동, 2006년 여름
49
Parallel-Pipelined Architectures
Ppar=0.2Pref
© 조준동, 2006년 여름
50
Loop unrolling
for i = 2 to N - 1
A(i ) = A(i ) + A(i - 1) A(i + 1)
for i = 2 to N - 2 step 2
A(i ) = A(i ) + A(i - 1) A(i + 1)
A(i  1) = A(i  1) + A(i ) A(i + 2)
Yn 1  X n 1  A  Yn  2
Yn  X n  A  Yn 1  X n  A  ( X n 1  A  Yn  2 )
Yn 1  X n 1  A  Yn  2
Yn  X n  A  Yn 1  A  Yn  2
2
© 조준동, 2006년 여름
51
루프 풀기에 의한 저전력 기법
Loop Unrolling for Low Power
© 조준동, 2006년 여름
52
대수 변화 및 상수 전달에 의한 방법
© 조준동, 2006년 여름
53
Loop Unrolling for Low Power
© 조준동, 2006년 여름
54
Encoding
• Bus-invert (BI) code
– Appropriate for random data patterns
– Redundant code (1 extra bus line)
– Reduce avg. transitions up to 25%
0000
1010
0100
1111
1010
0100
1101
0011
0000
1010
1011
1111
1010
1011
0010
0011
0
0
1
0
0
1
1
0
X
Majority
voter
Z
D
Z
D
inv
X
inv
R. J. Fletcher, “Integrated circuit having outputs configured for reduced state changes,” May 1987, U.S. Patent 4667337.
M. R. Stan and W. P. Burleson, “Bus-invert coding for low-power I/O,” IEEE Tr. on VLSI Systems, Mar. 1995, pp. 49-58.
© 조준동, 2006년 여름
55
분할을 통한 적절한 전압 공급
•Partition the chip into multiple sub-units each
of which is designed to operate at a specific
supply voltage
3V
3V
SLOW
5V
3V
SLOW
SLOW
SLOW
FAST
3V
© 조준동, 2006년 여름
56
Using Vdd programmability
Wayne Burleson
• High Vdd to devices on
critical path
• Low Vdd to devices on noncritical paths
• Vdd Off for inactive paths
A – Baseline Fabric
B – Fabric with Vdd Configurable
Interconnect
This work builds on a similar idea for FPGAs described in:
Fei Li, Yan Lin and Lei He. Vdd Programmability to Reduce FPGA Interconnect Power, IEEE/ACM
International Conference on Computer-Aided Design, Nov. 2004
© 조준동, 2006년 여름
57
Automated Low-Power Technique
Exploiting Multiple Supply Voltages
Applied to a Media Processor.
© 조준동, 2006년 여름
58
Voltage Scaling for Low Power
• Dynamic Power  freq  ( C * Vdd2 )
• For a CMOS gate, when Vdd 
– power  
– delay 
• How to minimize delay penalty while enjo
ying power gain ?
© 조준동, 2006년 여름
59
Voltage Scaling for Low Power
• Minimizing the delay penalty due to voltage scaling
– Circuit-level
 Lowering threshold voltage
 Problem : standby leakage current increase.
 Additional process steps or substrate bias
control.
– Architecture-level
 Speedup (pipelining, concurrency), then down
scale supply voltage, or
 Match supply voltage with throughput require
ment
 Area penalty resulting from parallel datapaths.
=> Need for voltage scaling techniques without degrading
performance.
© 조준동, 2006년 여름
60
• Introduction
• Dual-VDD approach techniques
• Clustered-Voltage-Scaling (CVS) Structure.
• Extended Clustered-Voltage-Scaling (ECVS) Structure.
• Algorithm for Structure Synthesis.
• Placement And Routing.
• Supply Voltage Reduction in Clock Tree.
• Design Flow And Power Analysis.
• Results And Discussions.
© 조준동, 2006년 여름
61
Slow
Fast
Slow
High Supply
Voltage
Low Supply
Voltage
Active Power Reduction
Multiple Vdd
• Vdd scaling will slow down
• Mimic Vdd scaling with multiple Vdd
• Advantages:
– No need for change Vth and fab process.
– No need for creating parallel datapaths causing area
penalty.
• Challenges:
– Interface between low & high Vdd
– Delivery and distribution
© 조준동, 2006년 여름
62
Dual-VDD approach
Basics of Multiple Supply
Voltage
•
•
•
VDDL (Shown as
shaded)
Use of dual supply voltages can help reduce dynamic power consumed
in a circuit.
Uses reduced voltage VDDL in non-critical paths.
Applies original voltage VDDH to critical paths.
© 조준동, 2006년 여름
63
The Problem.
• PMOS not turned off when input is weak-1,
conducting static current from the supply to
ground.
?
When Low level voltage
drive high level logic gate.
Blocking the static current is
to Level-converter insertion.
© 조준동, 2006년 여름
64
Level Converter
• Interfacing gates operating with different supply voltages
requires the use of level converters.
• Way of blocking the static current is to Level-converter
insertion.
Low
High
0
0
VDDH
VDDL
© 조준동, 2006년 여름
65
• Introduction
• Dual-VDD approach techniques
• Clustered-Voltage-Scaling (CVS) Structure.
• Extended Clustered-Voltage-Scaling (ECVS) Structure.
• Algorithm for Structure Synthesis.
• Placement And Routing.
• Supply Voltage Reduction in Clock Tree.
• Design Flow And Power Analysis.
• Results And Discussions.
© 조준동, 2006년 여름
66
Basics of Multiple Supply Voltage
VDDL (Shown as
shaded)


Level converters may incur substantial delay and power dissipation.
Keys: To minimize # level converters by clustering.
© 조준동, 2006년 여름
67
Clustered Voltage Supply (CVS)
Following are the kind of
connections possible:
 Inter VDDL gates
 Inter VDDH gates
 VDDH to VDDL gates
=> Need not level converter
=> Level converter only use
VDDL to VDDH gates.
VDDL (Shown as
shaded)
 The number of needed level converters is most same as the number of
of
VDDL flip-flop ( Interface between the output of a VDDL flip-flop and input
a VDDH gate.)
© 조준동, 2006년 여름
68
• Introduction
• Dual-VDD approach techniques
• Clustered-Voltage-Scaling (CVS) Structure.
• Extended Clustered-Voltage-Scaling (ECVS) Structure.
• Algorithm for Structure Synthesis.
• Placement And Routing.
• Supply Voltage Reduction in Clock Tree.
• Design Flow And Power Analysis.
• Results And Discussions.
© 조준동, 2006년 여름
69
Extended Clustered Voltage Supply
(ECVS)

Keys: Maximize share of VDDL cells without increase # of level
converter.
;This leads to more reduction in power.
© 조준동, 2006년 여름
70
Extended Clustered Voltage Supply
(ECVS)

By optimizing the insertion points of level converters.
-. Excessive slack remains in the path from FF1 to G2.
-. Level converter(LC1) move up to interface between G3 and G2,
G3 through G5 can be VDDL.
© 조준동, 2006년 여름
71
• Introduction
• Dual-VDD approach techniques
• Clustered-Voltage-Scaling (CVS) Structure.
• Extended Clustered-Voltage-Scaling (ECVS) Structure.
• Algorithm for Structure Synthesis.
• Placement And Routing.
• Supply Voltage Reduction in Clock Tree.
• Design Flow And Power Analysis.
• Results And Discussions.
© 조준동, 2006년 여름
72
Algorithm for Structure Synthesis.
 Use a level sort technique. (It’s level determined by the depth from a flip-flops)
 Convert FFs at level 0 to VDDL if timing can be satisfied.
 Move backward in ascending order of levels, check if VDDL can be used to
each gate with possible insertion of level converter (check timing & power
saving)
 If the timing meets he constraints, we retry the replacement all at 0 level.
© 조준동, 2006년 여름
73
• Introduction
• Dual-VDD approach techniques
• Clustered-Voltage-Scaling (CVS) Structure.
• Extended Clustered-Voltage-Scaling (ECVS) Structure.
• Algorithm for Structure Synthesis.
• Placement And Routing.
• Supply Voltage Reduction in Clock Tree.
• Design Flow And Power Analysis.
• Results And Discussions.
© 조준동, 2006년 여름
74
III. PLACEMENT AND ROUTING
• Compared possible layout architectures and examined
strong points and problems at each architecture.
• Contents
– Simplest Architecture
– Proposed Architecture
– Result of P&R
© 조준동, 2006년 여름
75
Simplest Architecture
• Strength
– Easy to generate
layout
• Weakness
– Long wire length
– Area increase
– Degradation in
performance
© 조준동, 2006년 여름
76
Proposed Architecture
• Strength
– Wire length is
minimized
– High performance
– Minimize area
• Weakness
– Not easy to
generate layout
© 조준동, 2006년 여름
77
Result of P&R for dual-VDD
• Strength
– No need to newly
create patterns for
VDDL cells
– VDDL supply terminal
for the level-converter
cell is automatically
connected to the
VDDL power line at
the neighbor
© 조준동, 2006년 여름
78
• Introduction
• Dual-VDD approach techniques
• Clustered-Voltage-Scaling (CVS) Structure.
• Extended Clustered-Voltage-Scaling (ECVS) Structure.
• Algorithm for Structure Synthesis.
• Placement And Routing.
• Supply Voltage Reduction in Clock Tree.
• Design Flow And Power Analysis.
• Results And Discussions.
© 조준동, 2006년 여름
79
IV. SUPPLY VOLTAGE REDUCTION IN CLOCK TREE
• Majority of power in a chip is dissipated at clock circuitry.
• Reducing the power of the clock circuitry by reducing the supply
voltage is quite effective.
• Contents
– A. Clock Tree Structure with Dual Supply Voltages
– B. Procedure of Generating Clock Tree
© 조준동, 2006년 여름
80
A. Clock Tree Structure
 In this structure, a root driver of the clock operates at VDDH, while
clock buffers in the clock tree operate at VDDL.
© 조준동, 2006년 여름
81
B. Procedure of Generating
– Partition flipflops into groups so the load will be
balanced.
– Build a clock tree while controlling the number of
stages of clock buffers and the size of the buffers so
the clock skew will be minimized.
– Control interconnect capacitance in the clock tree by
controlling the actual location of each clock buffer in
the layout for minimizing the clock skew.
© 조준동, 2006년 여름
82
• Introduction
• Dual-VDD approach techniques
• Clustered-Voltage-Scaling (CVS) Structure.
• Extended Clustered-Voltage-Scaling (ECVS) Structure.
• Algorithm for Structure Synthesis.
• Placement And Routing.
• Supply Voltage Reduction in Clock Tree.
• Design Flow And Power Analysis.
• Results And Discussions.
© 조준동, 2006년 여름
83
Design flow
RTL design and logic synthesis are performed
with a single supply voltage VDDL
Perform the ECVS structure synthesis. (using the
gate-level mapped netlist resulting from the
logic synthesis)
The result of the ECVS structure synthesis is
used for generating layout with dual VDD’s
© 조준동, 2006년 여름
84
Power analysis
First, performed Verilog simulation for the
description of the entire system. (captured
toggle information at input pins of submodules)
Next, performed PowerMill simulation to analyze
power for each submodule using the toggle
information captured at the Verilog simulation
as stimuli.
© 조준동, 2006년 여름
85
• Introduction
• Dual-VDD approach techniques
• Clustered-Voltage-Scaling (CVS) Structure.
• Extended Clustered-Voltage-Scaling (ECVS) Structure.
• Algorithm for Structure Synthesis.
• Placement And Routing.
• Supply Voltage Reduction in Clock Tree.
• Design Flow And Power Analysis.
• Results And Discussions.
© 조준동, 2006년 여름
86
VI. RESULTS AND DISCUSSIONS
• Contents
–
–
–
–
–
–
–
–
A. Application Example
B. Optimal Voltage of VDDL
C. Results from ECVS Structure Synthesis
D. Power Reduction
E. Clock Skew in Dual VDD Clock-Tree
F. Performance
G. Area Overhead
H. Noise Analysis and Avoidance Scheme
© 조준동, 2006년 여름
87
A. Application Example
– Mpact media processor chip.
– Digital audio/video applications.
•
•
•
•
MPEG2 decoding
real-time MPEG1 encoding
personal video conferencing
28.8 kb/s fax/modem functions
– Main frequency is 75 MHz.
– Dual VDD approach to seven random-logic
modules.
© 조준동, 2006년 여름
88
B. Optimal Voltage of VDDL
 At lower VDDL, the power at a VDDL cell gets smaller, while the number of
the cells replaced with VDDL ones gets less.
 At higher VDDL, the number of cells replaced with VDDL ones gets larger,
while the power reduced at a single VDDL cell gets less.
© 조준동, 2006년 여름
89
C. Results form ECVS
Structure Synthesis
 performed the ECVS structure synthesis at each individual
random-logic module.
 76% of the cells were replaced with VDDL ones on average.
© 조준동, 2006년 여름
90
C. Results form ECVS
Structure Synthesis
 Distributions of
path-delays are
compared in Fig.
between the
original design
and the design
with the ECVS
structure.
The ECVS technique has pushed the center of the distribution toward the
right, spending effectively the excessive slack remaining in the design
without causing timing violations.
© 조준동, 2006년 여름
91
D. Power Reduction
•
reduced the power by 3957% (47% on average)
•
power reduction at flipflops has a biggest
contribution
© 조준동, 2006년 여름
92
D. Power Reduction

Keys: Flip-flops has a biggest contribution to the total power reduction.
The power was reduced by 28% on average by using scheme.
; By using the flip-flop circuit, skew between the clock and the
inverted one is minimized without complicating the design.
© 조준동, 2006년 여름
93
D. Power Reduction

Keys: Power of the clock-tree occupies the major part of that of the
entire
clock system
; at the clock tree consisting of clock buffers and interconnections,
the power was reduced by 73%.
© 조준동, 2006년 여름
94
E. Clock Skew in Dual VDD
Clock-Tree
In the row-by-row
architecture, clock buffers in a
clock-tree operating at VDDL
are placed in VDDL rows.
minimize
the clock skew
The degree of freedom at
buffer placement increases.
© 조준동, 2006년 여름
95
F. Performance
Timing
violations
caused by
the increase
of wire
length were
very few.
Performed
gatesizing at
the layout
phase to
reduce
the delay.
Same
performan
ce
constraint
s as those
for the
original
design.
© 조준동, 2006년 여름
96
G. Area Overhead
Reduction in
the degree
of freedom
in cell
placement.
Increase of
the total
cell count
resulting
from
inserting
level
converters.
Increase of
the powerline area
resulting
from
adding
VDDL
power lines.
Chip size increased by 7%, becoming 7.8*7.3 mm2.
© 조준동, 2006년 여름
97
VII. CONCLUSION
Reduces the power with a small area overhead
while keeping the circuit performance and clock
skew.
Struc
ture
Synt
hesis
P
&
R
Clocktree
genera
tion
Dual VDD
© 조준동, 2006년 여름
98