Design Productivity Crisis

download report

Transcript Design Productivity Crisis

ORION2.0: A Fast and Accurate NoC
Power and Area Model for Early-Stage
Design Space Exploration
Andrew B. Kahng¶
Bin Li‡
Li-Shiuan Peh‡
Kambiz Samadi¶
¶
University of California, San Diego
‡ Princeton University
April 21, 2009
1
Outline







Motivation
ORION2.0 Framework
Dynamic Power Modeling
Leakage Power Modeling
Area Modeling
Validation and Significance Assessment
Conclusions
2
Motivation
 Many-core chip  NoCs needed to interconnect
many-core chips  Power-efficiency of NoCs is
important
 Performance was the primary concern
 Now power efficiency is critical
 28% of total power in Intel 80-core Teraflops chip is due to
interconnection networks (routers + links);
  Need rapid power estimation to trade off alternative
architectures
 Rapid power-area tradeoffs at the architectural level
Our Goal: Develop accurate models that are easily
usable by system-level designer early in the design
cycle
3
Related Work
 Real-chip power measurements (Isci et al. 03)
 RTL-level NoC power estimations (A. Banerjee et al. 07, and N.
Banerjee et al. 04)
 Simulation time is slow
 Requires detailed RTL modeling  not suitable for early-stage NoC
design space exploration
 Architectural-level power estimation
 Interconnection network (Patel et al. 97); model is not instantiated with
architectural parameters not suitable to explore tradeoffs in router
microarchitecture
 Uniprocessor power modeling (Wattch: Brooks et al. 00 and
SimplePower: Ye et al. 00)
 NoC power modeling (ORION 1.0: Wang et al. 02)
 ORION 1.0
 has been widely used
 early-stage design space exploration for NoC power-performance tradeoff
analysis
4
ORION 1.0 Modeling Methodology
 Power models derived for major building blocks
(FIFO, Crossbar, and arbiter)
 For each component, a canonical structure is
described in terms of architectural and technological
parameters
 Detailed analysis is performed to determine
parameterized capacitance equations
 Capacitance equations and switch activity estimation
are combined to determine power consumption
 Power models are based on detailed estimates of
gate and wire capacitance and switching activity
5
Limitations of ORION 1.0
Parameters
Parameters
ORION ORION
1.0
2.0
16
B
B
39
F
F
P
5
P
V
2
V
X
5
X
65nm
tech
tech
5.1
fclk
GHz
fclk
clk
1.2V
Vdd
Vdd
dd
Npipeline
pipeline
App
D
Description
Description
Up to 8.1X
diff.
Component
#buffers
#buffers
flit-width
flit-width
#ports
#ports
#virtual
#virtual channels
channels
#crossbar
#crossbar ports
ports
technology
technology node
node
clock
clock frequency
frequency
supply
supply voltage
voltage
#pipeline
#pipeline stages
stages
application
application domain
domain
chip
chip dimension
dimension
Power (mW)
V1
Buffer
Crossbar
Arbiter
Link
Clock
Total
25.2
53.2
11.1
89.5
Intel
80-core
203.3
138.6
64.7
212.5
304.9
924
10.3X diff.
6
Outline
Motivation
 ORION2.0 Framework
 Dynamic Power Modeling
 Leakage Power Modeling
 Area Modeling
 Validation and Significance Assessment
 Conclusions
7
ORION 2.0: Accurate NoC Router Models
circuit implementation &
buffering scheme
architectural parameters
• # of ports; # of buffers
• # of xbar ports; # of VC
• SRAM and register FIFO
• MUX-tree and Matrix crossbar • voltage, frequency
• different arbitration scheme
• hybrid buffering scheme
ORION 2.0
• interconnect parameters
• device parameters
• scaling factors for future
technologies
• …
grantI
reqI
reqE
reqW
Request reqN
ORION Signals
1.0 reqS
technology parameters
Arbiter
grantE
grantW
grantN
grantS
 Built on top of
Control
 Uses our automatic/semi-automatic
flows
to obtain technology
Write
inputs Source
Source
Buf I
inI
outI
 Provides Link
significant Buf
accuracy
improvement
compared
Link with
E
inE
outE
ORION 1.0
Link
Link
inW Crossbar outW
Buf W
Link
Link
Buf N
Buf S
inN
inS
outN
outS
Link
Link
8
ORION 2.0 Improvements
Power Subcomponents
Buffer
(SRAM-based)
Arbiter
(dynamic power)
Buffer
• SRAM-based
• Flip-flop-based
Arbiter
• VC allocator model
• Leakage power
Crossbar
Links
(dynamic power)
Area
(router)
ORION 1.0
Model Infrastructure
• Application-specific
technology-level
adjustment
• Updated capacitance
and transistor sizes
Crossbar
Links
• Hybrid buffering
• Leakage power
Clock
Area
• More accurate
router area model
• Link area model
ORION 2.0
9
Model Technology Inputs
 Inputs for power calculation
 Leakage current values (obtained from Liberty (.lib) / SPICE)
 Input capacitance for different repeater size (Liberty, Predictive
Technology Models (PTM))
 Inputs for area calculation
 Wire dimensions (Interconnect Technology Format (ITF) / LEF / ITRS)
 Cell area is available from Liberty and for future technologies, ITRS Afactors or proposed area models can be used
 We also provide data for (1) high-performance (HP), and (2)
low-power (LOP) device types for 90nm and 65nm
 Scaling factors for 45nm and 32nm technologies were
obtained from ITRS 2007 / MASTAR5.0
10
Outline
Motivation
ORION2.0 Framework
 Dynamic Power Modeling
 Leakage Power Modeling
 Area Modeling
 Validation and Significance Assessment
 Conclusions
11
Dynamic Power Modeling
 Dynamic Power: Switching Capacitance
 Clock power:
 Pclk =  × Cclk × Vdd2 × f
 Cclk = Csram-fifo + Cpipeline-registers + Cregister-fifo + Cwiring
 Physical Links: due to charging and discharging of capacitive
load
 Pd =  × Cload × Vdd2 × f; Cload = Cground + Ccoupling + Cinput
 Register-based FIFO: implemented as shift registers
 Virtual channel allocator: added two models
 Other components: we use ORION 1.0 models with updated
transistor and technology parameters
12
Clock Power (1)
 Clock power heavily depends on its distribution topology 
we assume an H-tree topology
 Cclk = Csram-fifo + Cpipeline-registers + Cregister-fifo + Cclock-wiring
 Memory structures: precharge circuitry capacitive load on
clock network:




due to precharge transistor Tc
Cchg = Cg(Tc) + Cd(Tc)
Csram-fifo = (Pr + Pw) × F × B × Cchg
where Pr, Pw, F, B are #read ports, #write ports, #buffers, and flit-width,
respectively
 Pipeline registers: due to different stages in a router
 assume D-flip-flop (DFF) as the building block for pipeline registers
 Cpipeline-register = Npipeline × F × Cff, where Cff is DFF capacitance
 Register-based FIFO: due to DFF capacitance used in
registers
 Cregister-fifo = F × B × Cff
13
Clock Power (2)
 Wiring load: due to (1) wiring and (2) clock tree buffers
 Example: 5-level H-tree clock distribution:
16
1 8
2 4
42
8 1
D
D
D
D
D )  Cw
2
2
2
2
2
 where, D, Cw are chip dimension and per-unit-length wire capacitance,
respectively
 capacitive contribution due to clock buffers requires estimation of number of
buffer stages, k:
Cwire  (
0.4 × Rint × Cint
k=
0.7 × Rd × Cgate
 where Rint, Cint, Rd, and Cgate are clock tree network wire resistance, wire
capacitance, drive resistance, and input gate capacitance of a minimum size
inverter, respectively
D
Rint    24 
w
Cint  24  D  w  Carea  2  24  D  Cfringe
 where ρ, Carea, and Cfringe are resistivity, unit area, and unit fringe capacitances
respectively
 Cclock-wiring = kCgate + Cwire
 Clock leakage power is due to clock buffers
14
Repeater and Wire Power Models
 Repeaters (buffers) are used in links and clock tree network
 Leakage power has two main components: (1) sub-threshold leakage, and
(2) gate-tunneling current
 Depending on design conditions we will compute the leakage power at different
temperature conditions:(1) 25◦C, (2) 80◦C, and (3) 110◦C
 Both components depend linearly on device size
ps= (psn + psp) / 2
psn = k0n + k1n × wn
psp = k0p + k1p × wp
 Dynamic power can be calculated as:
pd = a × cl × vdd2 × f
cl = ci + cg + cc
 pd, a, cl, vdd and f are dynamic power, activity factor, load capacitance, supply
voltage and frequency, respectively
 Load capacitance is composed of the input capacitance of the next repeater (ci),
ground (cg) and coupling (cc) capacitances of the wire driven
15
Interconnect Optimization: Buffering
 Conventional delay-optimal buffering  unrealistic buffer
sizes  high dynamic / leakage power  suboptimal
Pareto-optimal frontier of the
power-delay tradeoff of a
5mm interconnect in 90nm /
65nm
 Our approach: iterative optimization of hybrid
objective (power + delay)
 Search for optimal number and size of repeaters
 Can be extended for other interconnect optimizations (e.g.,
wire sizing and driver sizing)
16
Virtual Channel Allocator Model
 Provides three virtual channel (VC) allocation models
 Traditional two-stage VC allocator model
 Most widely used
 Power consumption increases rapidly as number VCs increases
1
.
2:1 arbiter
1
.
:
.
2:1 arbiter
4
.
2:1 arbiter
1
.
2:1 arbiter
4
.
.
1
.
:
.
.
:
:
10
8:1 arbiter
1
.
8:1 arbiter
10
4:1 arbiter
1
.
.
4:1 arbiter
4
.
.
4:1 arbiter
1
.
4:1 arbiter
4
.
.
:
.
.
Stage 1 (totally 40 arbiters)
Stage 2 (totally 10 arbiters)
5 ports, 2 VCs per port
:
.
.
:
:
20
16:1 arbiter
1
.
16:1 arbiter
20
.
Stage 1 (totally 80 arbiters)
Stage 2 (totally 20 arbiters)
5 ports, 4 VCs per port
 Add One-stage VC allocator model
 Lower power consumption
 Lower matching probability
 Add VC selection model
 Proposed by Kumar et al. "A 4.6Tbits/s 3.6GHz Single-cycle NoC
Router with a Novel Switch Allocator in 65nm CMOS”, ICCD07
 Low power and high performance
17
Outline
Motivation
ORION2.0 Framework
Dynamic Power Modeling
 Leakage Power Modeling
 Area Modeling
 Validation and Significance Assessment
 Conclusions
18
Leakage Power Modeling
 Leakage Power: Subthreshold and Gate
 From 65nm and beyond gate leakage becomes significant
 I’sub(i,s) and I’gate(i,s) are subthreshold and gate leakage currents per unit
transistor width for a specific technology
 Wsub(i,s) and Wgate(i,s) are the effective widths of component i at input
state s for subthreshold and gate leakage, respectively
 Key circuit components INVx1, NAND2x1, NOR2x1, and DFF
 Leakage currents are computed at different transistor junction
temperatures: (1) 110◦C, (2) 80◦C, and (3) 25◦C
'
'
Ileak ( Block ) = ∑∑
Prob( i ,s ) × ( Wsub ( i ,s ) × Isub
( i ,s ) + W ( i ,s )gate × Igate
( i ,s ))
i
s
 Same methodology as in ORION 1.0
 Leakage current values are all obtained through SPICE simulation using
foundry SPICE models
19
Arbiter Leakage Power Model
 Three arbitration schemes: (1) matrix, (2) round-robin (RR), and (3) queuing
 Example: matrix arbiter
 with R requesters  one R×R matrix to keep the priorities
gnt n = reqn ×∏( reqi + min ) ×∏( reqi + mni )
i <n
i >n
 grant logic can be implemented as a tree of NOR and INV gates and the RxR
matrix can be constructed using DFF
Ileak ( Arbitermatrix )  I leak (NOR 2)  (( 2R - 1)R ) 
R(R - 1)
I leak (INV )  R  I leak (DFF ) 
2
Pleak ( Arbitermatrix )  I leak ( Arbitermatrix )  Vdd
 NOR2, INV, and DFF represent 2-input NOR gate, inverter gate, and DFF,
respectively
 Further details on modeling methodology in Chen et al. 2003
20
Outline
Motivation
ORION2.0 Framework
Dynamic Power Modeling
Leakage Power Modeling
 Area Modeling
 Validation and Significance Assessment
 Conclusions
21
Router Area Model
 As number of cores increases, the area occupied by communication
components becomes significant (19% of total tile area in the Intel 80-core
Teraflops Chip)
 Gate area model by Yoshida et al. (DAC’04)
 Link area model by Carloni et al. (ASPDAC’08)
Areaarbiter =
(AreaNOR2x12(R-1)R) +
(AreaDFF(R(R-1)/2)) +
(AreaINVx1R)
Matrix Arbiter
22
Repeater and Wire Area Models
 For existing technologies, the area of a repeater can be
calculated as:
ar = τ0 + τ1 × (wn + wp)
 ar denotes repeater area, τ0 and τ1 are coefficients using linear
regression; wn, wp are widths of NMOS, and PMOS respectively
 For future technologies, feature size (F), contacted pitch (CP),
row height (RH), and cell width (CW) can be used to estimate
the area:
NF = (wp + wn + 2 × F) / RH
CW = NF × (F + CP) + CP
ar = RH × CW
 Wiring area can be calculated as:
aw = (n × (ww + sw) + sw) × L
 aw denotes wire area, n is the bit width of the bus, and ww, sw, L are
wire width, spacing and wire length
23
Outline
Motivation
ORION2.0 Framework
Dynamic Power Modeling
Leakage Power Modeling
Area Modeling
 Validation and Significance Assessment
 Conclusions
24
ORION2.0: Validations and Results
 Validation: Two Intel NoC Chips
 (1) Intel 80-core Teraflops: high-performance many-core design
 (2) Intel SCC: ultra low-power communication core
 ORION2.0 offers significant accuracy improvement
Intel 80-core
v1.0
v2.0
%diff (total power) -85.3
%diff (total area) -80.9
Link
21%
FIFO
21%
Link
18%
Component
Buffer
Crossbar
21%
Crossbar
Clock
30%
Arbiter
Arbiter
7%
Clock
ORION 2.0
Link
-6.5
-23.6
Intel SCC
v1.0
v2.0
+202.4 +11.0
+31.9 +25.3
Clock 0%
FIFO
23%
Arbiter 12%
Link 0%
%diff (ORION 2.0 vs. Intel 80-core)
FIFO 28%
Crossbar
16%
Clock
36%
Arbiter
7%
Intel 80-core
-14.8
16.9
-9.0
-20.9
8.8
Crossbar
60%
ORION 1.0
25
Impact on System-Level Design
 Testcases
 VPROC: video processor with 42 cores and 128-bit datawidth
 dVOPD: dual video object plane decoder with 26 cores and 128-bit
datawidth
SoC
2
P (mW)
v1.0
v2.0
VPROC 0.875
dVOPD 0.412
A (mm )
v1.0
v2.0
0.924 2.043 2.329
0.486 1.217 1.343
R1
R1
…
R1
R1
R1
…
R1
…
R1
33
18
25
16
R2
…
R1
max. # router ports
v1.0
v2.0
8
6
12
6
……..
max. # hops
v1.0
v2.0
6
11
5
10
R2
R2
…
…
R1
# routers
v1.0
v2.0
R2
……..
R2
 System-level Impact: Communication-Driven Synthesis in
COSI-OCC
 Accurate ORION 2.0 models lead to better-performing NoC
 Relative power due to additional port not as high in ORION 2.0 vs. 1.0
26
Conclusions
 Accurate models can drive effective NoC design
space exploration
 ORION 1.0 is inaccurate for current and future
technology nodes
 Proposed accurate power and area models for
network routers (ORION 2.0)
 Presented a reproducible methodology for extracting
inputs to our models
 Maintained ORION 1.0 interface, while significantly
improved the accuracy of models  switching to
ORION 2.0 is easy!
27
ORION 2.0 Release
 ORION 2.0 Website: http://www.princeton.edu/~peh/orion.html
28
System-Level NoC Power Modeling Example
Polaris
Toolchain
Step 1
Trident
Synthetic traffic generation
Design-space exploration tool
Step 2 LUNA
Microarchitecture
High-level on-chip network
parameters
analysis
Step 3
ORION
power and
area models
power
consumption
CMOS area
Performance
(latency)
NoC designs
projections
V. Soteriou, N. Eisley, H. Wang, B. Li, L.S. Peh, TVLSI’07