Design Productivity Crisis

Transcript Design Productivity Crisis

ORION2.0: A Fast and Accurate NoC
Power and Area Model for Early-Stage
Design Space Exploration
Andrew B. Kahng¶
Bin Li‡
Li-Shiuan Peh‡
Kambiz Samadi¶
¶
University of California, San Diego
‡ Princeton University
April 21, 2009
1
Outline







Motivation
ORION2.0 Framework
Dynamic Power Modeling
Leakage Power Modeling
Area Modeling
Validation and Significance Assessment
Conclusions
2
Motivation
 Many-core chip  NoCs needed to interconnect
many-core chips  Power-efficiency of NoCs is
important
 Performance was the primary concern
 Now power efficiency is critical
 28% of total power in Intel 80-core Teraflops chip is due to
interconnection networks (routers + links);
  Need rapid power estimation to trade off alternative
architectures
 Rapid power-area tradeoffs at the architectural level
Our Goal: Develop accurate models that are easily
usable by system-level designer early in the design
cycle
3
Related Work
 Real-chip power measurements (Isci et al. 03)
 RTL-level NoC power estimations (A. Banerjee et al. 07, and N.
Banerjee et al. 04)
 Simulation time is slow
 Requires detailed RTL modeling  not suitable for early-stage NoC
design space exploration
 Architectural-level power estimation
 Interconnection network (Patel et al. 97); model is not instantiated with
architectural parameters not suitable to explore tradeoffs in router
microarchitecture
 Uniprocessor power modeling (Wattch: Brooks et al. 00 and
SimplePower: Ye et al. 00)
 NoC power modeling (ORION 1.0: Wang et al. 02)
 ORION 1.0
 has been widely used
 early-stage design space exploration for NoC power-performance tradeoff
analysis
4
ORION 1.0 Modeling Methodology
 Power models derived for major building blocks
(FIFO, Crossbar, and arbiter)
 For each component, a canonical structure is
described in terms of architectural and technological
parameters
 Detailed analysis is performed to determine
parameterized capacitance equations
 Capacitance equations and switch activity estimation
are combined to determine power consumption
 Power models are based on detailed estimates of
gate and wire capacitance and switching activity
5
Limitations of ORION 1.0
Parameters
Parameters
ORION ORION
1.0
2.0
16
B
B
39
F
F
P
5
P
V
2
V
X
5
X
65nm
tech
tech
5.1
fclk
GHz
fclk
clk
1.2V
Vdd
Vdd
dd
Npipeline
pipeline
App
D
Description
Description
Up to 8.1X
diff.
Component
#buffers
#buffers
flit-width
flit-width
#ports
#ports
#virtual
#virtual channels
channels
#crossbar
#crossbar ports
ports
technology
technology node
node
clock
clock frequency
frequency
supply
supply voltage
voltage
#pipeline
#pipeline stages
stages
application
application domain
domain
chip
chip dimension
dimension
Power (mW)
V1
Buffer
Crossbar
Arbiter
Link
Clock
Total
25.2
53.2
11.1
89.5
Intel
80-core
203.3
138.6
64.7
212.5
304.9
924
10.3X diff.
6
Outline
Motivation
 ORION2.0 Framework
 Dynamic Power Modeling
 Leakage Power Modeling
 Area Modeling
 Validation and Significance Assessment
 Conclusions
7
ORION 2.0: Accurate NoC Router Models
circuit implementation &
buffering scheme
architectural parameters
• # of ports; # of buffers
• # of xbar ports; # of VC
• SRAM and register FIFO
• MUX-tree and Matrix crossbar • voltage, frequency
• different arbitration scheme
• hybrid buffering scheme
ORION 2.0
• interconnect parameters
• device parameters
• scaling factors for future
technologies
• …
grantI
reqI
reqE
reqW
Request reqN
ORION Signals
1.0 reqS
technology parameters
Arbiter
grantE
grantW
grantN
grantS
 Built on top of
Control
 Uses our automatic/semi-automatic
flows
to obtain technology
Write
inputs Source
Source
Buf I
inI
outI
 Provides Link
significant Buf
accuracy
improvement
compared
Link with
E
inE
outE
ORION 1.0
Link
Link
inW Crossbar outW
Buf W
Link
Link
Buf N
Buf S
inN
inS
outN
outS
Link
Link
8
ORION 2.0 Improvements
Power Subcomponents
Buffer
(SRAM-based)
Arbiter
(dynamic power)
Buffer
• SRAM-based
• Flip-flop-based
Arbiter
• VC allocator model
• Leakage power
Crossbar
Links
(dynamic power)
Area
(router)
ORION 1.0
Model Infrastructure
• Application-specific
technology-level
adjustment
• Updated capacitance
and transistor sizes
Crossbar
Links
• Hybrid buffering
• Leakage power
Clock
Area
• More accurate
router area model
• Link area model
ORION 2.0
9
Model Technology Inputs
 Inputs for power calculation
 Leakage current values (obtained from Liberty (.lib) / SPICE)
 Input capacitance for different repeater size (Liberty, Predictive
Technology Models (PTM))
 Inputs for area calculation
 Wire dimensions (Interconnect Technology Format (ITF) / LEF / ITRS)
 Cell area is available from Liberty and for future technologies, ITRS Afactors or proposed area models can be used
 We also provide data for (1) high-performance (HP), and (2)
low-power (LOP) device types for 90nm and 65nm
 Scaling factors for 45nm and 32nm technologies were
obtained from ITRS 2007 / MASTAR5.0
10
Outline
Motivation
ORION2.0 Framework
 Dynamic Power Modeling
 Leakage Power Modeling
 Area Modeling
 Validation and Significance Assessment
 Conclusions
11
Dynamic Power Modeling
 Dynamic Power: Switching Capacitance
 Clock power:
 Pclk =  × Cclk × Vdd2 × f
 Cclk = Csram-fifo + Cpipeline-registers + Cregister-fifo + Cwiring
 Physical Links: due to charging and discharging of capacitive
load
 Pd =  × Cload × Vdd2 × f; Cload = Cground + Ccoupling + Cinput
 Register-based FIFO: implemented as shift registers
 Virtual channel allocator: added two models
 Other components: we use ORION 1.0 models with updated
transistor and technology parameters
12
Clock Power (1)
 Clock power heavily depends on its distribution topology 
we assume an H-tree topology
 Cclk = Csram-fifo + Cpipeline-registers + Cregister-fifo + Cclock-wiring
 Memory structures: precharge circuitry capacitive load on
clock network:




due to precharge transistor Tc
Cchg = Cg(Tc) + Cd(Tc)
Csram-fifo = (Pr + Pw) × F × B × Cchg
where Pr, Pw, F, B are #read ports, #write ports, #buffers, and flit-width,
respectively
 Pipeline registers: due to different stages in a router
 assume D-flip-flop (DFF) as the building block for pipeline registers
 Cpipeline-register = Npipeline × F × Cff, where Cff is DFF capacitance
 Register-based FIFO: due to DFF capacitance used in
registers
 Cregister-fifo = F × B × Cff
13
Clock Power (2)
 Wiring load: due to (1) wiring and (2) clock tree buffers
 Example: 5-level H-tree clock distribution:
16
1 8
2 4
42
8 1
D
D
D
D
D )  Cw
2
2
2
2
2
 where, D, Cw are chip dimension and per-unit-length wire capacitance,
respectively
 capacitive contribution due to clock buffers requires estimation of number of
buffer stages, k:
Cwire  (
0.4 × Rint × Cint
k=
0.7 × Rd × Cgate
 where Rint, Cint, Rd, and Cgate are clock tree network wire resistance, wire
capacitance, drive resistance, and input gate capacitance of a minimum size
inverter, respectively
D
Rint    24 
w
Cint  24  D  w  Carea  2  24  D  Cfringe
 where ρ, Carea, and Cfringe are resistivity, unit area, and unit fringe capacitances
respectively
 Cclock-wiring = kCgate + Cwire
 Clock leakage power is due to clock buffers
14
Repeater and Wire Power Models
 Repeaters (buffers) are used in links and clock tree network
 Leakage power has two main components: (1) sub-threshold leakage, and
(2) gate-tunneling current
 Depending on design conditions we will compute the leakage power at different
temperature conditions:(1) 25◦C, (2) 80◦C, and (3) 110◦C
 Both components depend linearly on device size
ps= (psn + psp) / 2
psn = k0n + k1n × wn
psp = k0p + k1p × wp
 Dynamic power can be calculated as:
pd = a × cl × vdd2 × f
cl = ci + cg + cc
 pd, a, cl, vdd and f are dynamic power, activity factor, load capacitance, supply
voltage and frequency, respectively
 Load capacitance is composed of the input capacitance of the next repeater (ci),
ground (cg) and coupling (cc) capacitances of the wire driven
15
Interconnect Optimization: Buffering
 Conventional delay-optimal buffering  unrealistic buffer
sizes  high dynamic / leakage power  suboptimal
Pareto-optimal frontier of the
power-delay tradeoff of a
5mm interconnect in 90nm /
65nm
 Our approach: iterative optimization of hybrid
objective (power + delay)
 Search for optimal number and size of repeaters
 Can be extended for other interconnect optimizations (e.g.,
wire sizing and driver sizing)
16
Virtual Channel Allocator Model
 Provides three virtual channel (VC) allocation models
 Traditional two-stage VC allocator model
 Most widely used
 Power consumption increases rapidly as number VCs increases
1
.
2:1 arbiter
1
.
:
.
2:1 arbiter
4
.
2:1 arbiter
1
.
2:1 arbiter
4
.
.
1
.
:
.
.
:
:
10
8:1 arbiter
1
.
8:1 arbiter
10
4:1 arbiter
1
.
.
4:1 arbiter
4
.
.
4:1 arbiter
1
.
4:1 arbiter
4
.
.
:
.
.
Stage 1 (totally 40 arbiters)
Stage 2 (totally 10 arbiters)
5 ports, 2 VCs per port
:
.
.
:
:
20
16:1 arbiter
1
.
16:1 arbiter
20
.
Stage 1 (totally 80 arbiters)
Stage 2 (totally 20 arbiters)
5 ports, 4 VCs per port
 Add One-stage VC allocator model
 Lower power consumption
 Lower matching probability
 Add VC selection model
 Proposed by Kumar et al. "A 4.6Tbits/s 3.6GHz Single-cycle NoC
Router with a Novel Switch Allocator in 65nm CMOS”, ICCD07
 Low power and high performance
17
Outline
Motivation
ORION2.0 Framework
Dynamic Power Modeling
 Leakage Power Modeling
 Area Modeling
 Validation and Significance Assessment
 Conclusions
18
Leakage Power Modeling
 Leakage Power: Subthreshold and Gate
 From 65nm and beyond gate leakage becomes significant
 I’sub(i,s) and I’gate(i,s) are subthreshold and gate leakage currents per unit
transistor width for a specific technology
 Wsub(i,s) and Wgate(i,s) are the effective widths of component i at input
state s for subthreshold and gate leakage, respectively
 Key circuit components INVx1, NAND2x1, NOR2x1, and DFF
 Leakage currents are computed at different transistor junction
temperatures: (1) 110◦C, (2) 80◦C, and (3) 25◦C
'
'
Ileak ( Block ) = ∑∑
Prob( i ,s ) × ( Wsub ( i ,s ) × Isub
( i ,s ) + W ( i ,s )gate × Igate
( i ,s ))
i
s
 Same methodology as in ORION 1.0
 Leakage current values are all obtained through SPICE simulation using
foundry SPICE models
19
Arbiter Leakage Power Model
 Three arbitration schemes: (1) matrix, (2) round-robin (RR), and (3) queuing
 Example: matrix arbiter
 with R requesters  one R×R matrix to keep the priorities
gnt n = reqn ×∏( reqi + min ) ×∏( reqi + mni )
i <n
i >n
 grant logic can be implemented as a tree of NOR and INV gates and the RxR
matrix can be constructed using DFF
Ileak ( Arbitermatrix )  I leak (NOR 2)  (( 2R - 1)R ) 
R(R - 1)
I leak (INV )  R  I leak (DFF ) 
2
Pleak ( Arbitermatrix )  I leak ( Arbitermatrix )  Vdd
 NOR2, INV, and DFF represent 2-input NOR gate, inverter gate, and DFF,
respectively
 Further details on modeling methodology in Chen et al. 2003
20
Outline
Motivation
ORION2.0 Framework
Dynamic Power Modeling
Leakage Power Modeling
 Area Modeling
 Validation and Significance Assessment
 Conclusions
21
Router Area Model
 As number of cores increases, the area occupied by communication
components becomes significant (19% of total tile area in the Intel 80-core
Teraflops Chip)
 Gate area model by Yoshida et al. (DAC’04)
 Link area model by Carloni et al. (ASPDAC’08)
Areaarbiter =
(AreaNOR2x12(R-1)R) +
(AreaDFF(R(R-1)/2)) +
(AreaINVx1R)
Matrix Arbiter
22
Repeater and Wire Area Models
 For existing technologies, the area of a repeater can be
calculated as:
ar = τ0 + τ1 × (wn + wp)
 ar denotes repeater area, τ0 and τ1 are coefficients using linear
regression; wn, wp are widths of NMOS, and PMOS respectively
 For future technologies, feature size (F), contacted pitch (CP),
row height (RH), and cell width (CW) can be used to estimate
the area:
NF = (wp + wn + 2 × F) / RH
CW = NF × (F + CP) + CP
ar = RH × CW
 Wiring area can be calculated as:
aw = (n × (ww + sw) + sw) × L
 aw denotes wire area, n is the bit width of the bus, and ww, sw, L are
wire width, spacing and wire length
23
Outline
Motivation
ORION2.0 Framework
Dynamic Power Modeling
Leakage Power Modeling
Area Modeling
 Validation and Significance Assessment
 Conclusions
24
ORION2.0: Validations and Results
 Validation: Two Intel NoC Chips
 (1) Intel 80-core Teraflops: high-performance many-core design
 (2) Intel SCC: ultra low-power communication core
 ORION2.0 offers significant accuracy improvement
Intel 80-core
v1.0
v2.0
%diff (total power) -85.3
%diff (total area) -80.9
Link
21%
FIFO
21%
Link
18%
Component
Buffer
Crossbar
21%
Crossbar
Clock
30%
Arbiter
Arbiter
7%
Clock
ORION 2.0
Link
-6.5
-23.6
Intel SCC
v1.0
v2.0
+202.4 +11.0
+31.9 +25.3
Clock 0%
FIFO
23%
Arbiter 12%
Link 0%
%diff (ORION 2.0 vs. Intel 80-core)
FIFO 28%
Crossbar
16%
Clock
36%
Arbiter
7%
Intel 80-core
-14.8
16.9
-9.0
-20.9
8.8
Crossbar
60%
ORION 1.0
25
Impact on System-Level Design
 Testcases
 VPROC: video processor with 42 cores and 128-bit datawidth
 dVOPD: dual video object plane decoder with 26 cores and 128-bit
datawidth
SoC
2
P (mW)
v1.0
v2.0
VPROC 0.875
dVOPD 0.412
A (mm )
v1.0
v2.0
0.924 2.043 2.329
0.486 1.217 1.343
R1
R1
…
R1
R1
R1
…
R1
…
R1
33
18
25
16
R2
…
R1
max. # router ports
v1.0
v2.0
8
6
12
6
……..
max. # hops
v1.0
v2.0
6
11
5
10
R2
R2
…
…
R1
# routers
v1.0
v2.0
R2
……..
R2
 System-level Impact: Communication-Driven Synthesis in
COSI-OCC
 Accurate ORION 2.0 models lead to better-performing NoC
 Relative power due to additional port not as high in ORION 2.0 vs. 1.0
26
Conclusions
 Accurate models can drive effective NoC design
space exploration
 ORION 1.0 is inaccurate for current and future
technology nodes
 Proposed accurate power and area models for
network routers (ORION 2.0)
 Presented a reproducible methodology for extracting
inputs to our models
 Maintained ORION 1.0 interface, while significantly
improved the accuracy of models  switching to
ORION 2.0 is easy!
27
ORION 2.0 Release
 ORION 2.0 Website: http://www.princeton.edu/~peh/orion.html
28
System-Level NoC Power Modeling Example
Polaris
Toolchain
Step 1
Trident
Synthetic traffic generation
Design-space exploration tool
Step 2 LUNA
Microarchitecture
High-level on-chip network
parameters
analysis
Step 3
ORION
power and
area models
power
consumption
CMOS area
Performance
(latency)
NoC designs
projections
V. Soteriou, N. Eisley, H. Wang, B. Li, L.S. Peh, TVLSI’07

Design Productivity Crisis

Transcript Design Productivity Crisis

Directory