Design Productivity Crisis
Download
Report
Transcript Design Productivity Crisis
ORION2.0: A Fast and Accurate NoC
Power and Area Model for Early-Stage
Design Space Exploration
Andrew B. Kahng¶
Bin Li‡
Li-Shiuan Peh‡
Kambiz Samadi¶
¶
University of California, San Diego
‡ Princeton University
April 21, 2009
1
Outline
Motivation
ORION2.0 Framework
Dynamic Power Modeling
Leakage Power Modeling
Area Modeling
Validation and Significance Assessment
Conclusions
2
Motivation
Many-core chip NoCs needed to interconnect
many-core chips Power-efficiency of NoCs is
important
Performance was the primary concern
Now power efficiency is critical
28% of total power in Intel 80-core Teraflops chip is due to
interconnection networks (routers + links);
Need rapid power estimation to trade off alternative
architectures
Rapid power-area tradeoffs at the architectural level
Our Goal: Develop accurate models that are easily
usable by system-level designer early in the design
cycle
3
Related Work
Real-chip power measurements (Isci et al. 03)
RTL-level NoC power estimations (A. Banerjee et al. 07, and N.
Banerjee et al. 04)
Simulation time is slow
Requires detailed RTL modeling not suitable for early-stage NoC
design space exploration
Architectural-level power estimation
Interconnection network (Patel et al. 97); model is not instantiated with
architectural parameters not suitable to explore tradeoffs in router
microarchitecture
Uniprocessor power modeling (Wattch: Brooks et al. 00 and
SimplePower: Ye et al. 00)
NoC power modeling (ORION 1.0: Wang et al. 02)
ORION 1.0
has been widely used
early-stage design space exploration for NoC power-performance tradeoff
analysis
4
ORION 1.0 Modeling Methodology
Power models derived for major building blocks
(FIFO, Crossbar, and arbiter)
For each component, a canonical structure is
described in terms of architectural and technological
parameters
Detailed analysis is performed to determine
parameterized capacitance equations
Capacitance equations and switch activity estimation
are combined to determine power consumption
Power models are based on detailed estimates of
gate and wire capacitance and switching activity
5
Limitations of ORION 1.0
Parameters
Parameters
ORION ORION
1.0
2.0
16
B
B
39
F
F
P
5
P
V
2
V
X
5
X
65nm
tech
tech
5.1
fclk
GHz
fclk
clk
1.2V
Vdd
Vdd
dd
Npipeline
pipeline
App
D
Description
Description
Up to 8.1X
diff.
Component
#buffers
#buffers
flit-width
flit-width
#ports
#ports
#virtual
#virtual channels
channels
#crossbar
#crossbar ports
ports
technology
technology node
node
clock
clock frequency
frequency
supply
supply voltage
voltage
#pipeline
#pipeline stages
stages
application
application domain
domain
chip
chip dimension
dimension
Power (mW)
V1
Buffer
Crossbar
Arbiter
Link
Clock
Total
25.2
53.2
11.1
89.5
Intel
80-core
203.3
138.6
64.7
212.5
304.9
924
10.3X diff.
6
Outline
Motivation
ORION2.0 Framework
Dynamic Power Modeling
Leakage Power Modeling
Area Modeling
Validation and Significance Assessment
Conclusions
7
ORION 2.0: Accurate NoC Router Models
circuit implementation &
buffering scheme
architectural parameters
• # of ports; # of buffers
• # of xbar ports; # of VC
• SRAM and register FIFO
• MUX-tree and Matrix crossbar • voltage, frequency
• different arbitration scheme
• hybrid buffering scheme
ORION 2.0
• interconnect parameters
• device parameters
• scaling factors for future
technologies
• …
grantI
reqI
reqE
reqW
Request reqN
ORION Signals
1.0 reqS
technology parameters
Arbiter
grantE
grantW
grantN
grantS
Built on top of
Control
Uses our automatic/semi-automatic
flows
to obtain technology
Write
inputs Source
Source
Buf I
inI
outI
Provides Link
significant Buf
accuracy
improvement
compared
Link with
E
inE
outE
ORION 1.0
Link
Link
inW Crossbar outW
Buf W
Link
Link
Buf N
Buf S
inN
inS
outN
outS
Link
Link
8
ORION 2.0 Improvements
Power Subcomponents
Buffer
(SRAM-based)
Arbiter
(dynamic power)
Buffer
• SRAM-based
• Flip-flop-based
Arbiter
• VC allocator model
• Leakage power
Crossbar
Links
(dynamic power)
Area
(router)
ORION 1.0
Model Infrastructure
• Application-specific
technology-level
adjustment
• Updated capacitance
and transistor sizes
Crossbar
Links
• Hybrid buffering
• Leakage power
Clock
Area
• More accurate
router area model
• Link area model
ORION 2.0
9
Model Technology Inputs
Inputs for power calculation
Leakage current values (obtained from Liberty (.lib) / SPICE)
Input capacitance for different repeater size (Liberty, Predictive
Technology Models (PTM))
Inputs for area calculation
Wire dimensions (Interconnect Technology Format (ITF) / LEF / ITRS)
Cell area is available from Liberty and for future technologies, ITRS Afactors or proposed area models can be used
We also provide data for (1) high-performance (HP), and (2)
low-power (LOP) device types for 90nm and 65nm
Scaling factors for 45nm and 32nm technologies were
obtained from ITRS 2007 / MASTAR5.0
10
Outline
Motivation
ORION2.0 Framework
Dynamic Power Modeling
Leakage Power Modeling
Area Modeling
Validation and Significance Assessment
Conclusions
11
Dynamic Power Modeling
Dynamic Power: Switching Capacitance
Clock power:
Pclk = × Cclk × Vdd2 × f
Cclk = Csram-fifo + Cpipeline-registers + Cregister-fifo + Cwiring
Physical Links: due to charging and discharging of capacitive
load
Pd = × Cload × Vdd2 × f; Cload = Cground + Ccoupling + Cinput
Register-based FIFO: implemented as shift registers
Virtual channel allocator: added two models
Other components: we use ORION 1.0 models with updated
transistor and technology parameters
12
Clock Power (1)
Clock power heavily depends on its distribution topology
we assume an H-tree topology
Cclk = Csram-fifo + Cpipeline-registers + Cregister-fifo + Cclock-wiring
Memory structures: precharge circuitry capacitive load on
clock network:
due to precharge transistor Tc
Cchg = Cg(Tc) + Cd(Tc)
Csram-fifo = (Pr + Pw) × F × B × Cchg
where Pr, Pw, F, B are #read ports, #write ports, #buffers, and flit-width,
respectively
Pipeline registers: due to different stages in a router
assume D-flip-flop (DFF) as the building block for pipeline registers
Cpipeline-register = Npipeline × F × Cff, where Cff is DFF capacitance
Register-based FIFO: due to DFF capacitance used in
registers
Cregister-fifo = F × B × Cff
13
Clock Power (2)
Wiring load: due to (1) wiring and (2) clock tree buffers
Example: 5-level H-tree clock distribution:
16
1 8
2 4
42
8 1
D
D
D
D
D ) Cw
2
2
2
2
2
where, D, Cw are chip dimension and per-unit-length wire capacitance,
respectively
capacitive contribution due to clock buffers requires estimation of number of
buffer stages, k:
Cwire (
0.4 × Rint × Cint
k=
0.7 × Rd × Cgate
where Rint, Cint, Rd, and Cgate are clock tree network wire resistance, wire
capacitance, drive resistance, and input gate capacitance of a minimum size
inverter, respectively
D
Rint 24
w
Cint 24 D w Carea 2 24 D Cfringe
where ρ, Carea, and Cfringe are resistivity, unit area, and unit fringe capacitances
respectively
Cclock-wiring = kCgate + Cwire
Clock leakage power is due to clock buffers
14
Repeater and Wire Power Models
Repeaters (buffers) are used in links and clock tree network
Leakage power has two main components: (1) sub-threshold leakage, and
(2) gate-tunneling current
Depending on design conditions we will compute the leakage power at different
temperature conditions:(1) 25◦C, (2) 80◦C, and (3) 110◦C
Both components depend linearly on device size
ps= (psn + psp) / 2
psn = k0n + k1n × wn
psp = k0p + k1p × wp
Dynamic power can be calculated as:
pd = a × cl × vdd2 × f
cl = ci + cg + cc
pd, a, cl, vdd and f are dynamic power, activity factor, load capacitance, supply
voltage and frequency, respectively
Load capacitance is composed of the input capacitance of the next repeater (ci),
ground (cg) and coupling (cc) capacitances of the wire driven
15
Interconnect Optimization: Buffering
Conventional delay-optimal buffering unrealistic buffer
sizes high dynamic / leakage power suboptimal
Pareto-optimal frontier of the
power-delay tradeoff of a
5mm interconnect in 90nm /
65nm
Our approach: iterative optimization of hybrid
objective (power + delay)
Search for optimal number and size of repeaters
Can be extended for other interconnect optimizations (e.g.,
wire sizing and driver sizing)
16
Virtual Channel Allocator Model
Provides three virtual channel (VC) allocation models
Traditional two-stage VC allocator model
Most widely used
Power consumption increases rapidly as number VCs increases
1
.
2:1 arbiter
1
.
:
.
2:1 arbiter
4
.
2:1 arbiter
1
.
2:1 arbiter
4
.
.
1
.
:
.
.
:
:
10
8:1 arbiter
1
.
8:1 arbiter
10
4:1 arbiter
1
.
.
4:1 arbiter
4
.
.
4:1 arbiter
1
.
4:1 arbiter
4
.
.
:
.
.
Stage 1 (totally 40 arbiters)
Stage 2 (totally 10 arbiters)
5 ports, 2 VCs per port
:
.
.
:
:
20
16:1 arbiter
1
.
16:1 arbiter
20
.
Stage 1 (totally 80 arbiters)
Stage 2 (totally 20 arbiters)
5 ports, 4 VCs per port
Add One-stage VC allocator model
Lower power consumption
Lower matching probability
Add VC selection model
Proposed by Kumar et al. "A 4.6Tbits/s 3.6GHz Single-cycle NoC
Router with a Novel Switch Allocator in 65nm CMOS”, ICCD07
Low power and high performance
17
Outline
Motivation
ORION2.0 Framework
Dynamic Power Modeling
Leakage Power Modeling
Area Modeling
Validation and Significance Assessment
Conclusions
18
Leakage Power Modeling
Leakage Power: Subthreshold and Gate
From 65nm and beyond gate leakage becomes significant
I’sub(i,s) and I’gate(i,s) are subthreshold and gate leakage currents per unit
transistor width for a specific technology
Wsub(i,s) and Wgate(i,s) are the effective widths of component i at input
state s for subthreshold and gate leakage, respectively
Key circuit components INVx1, NAND2x1, NOR2x1, and DFF
Leakage currents are computed at different transistor junction
temperatures: (1) 110◦C, (2) 80◦C, and (3) 25◦C
'
'
Ileak ( Block ) = ∑∑
Prob( i ,s ) × ( Wsub ( i ,s ) × Isub
( i ,s ) + W ( i ,s )gate × Igate
( i ,s ))
i
s
Same methodology as in ORION 1.0
Leakage current values are all obtained through SPICE simulation using
foundry SPICE models
19
Arbiter Leakage Power Model
Three arbitration schemes: (1) matrix, (2) round-robin (RR), and (3) queuing
Example: matrix arbiter
with R requesters one R×R matrix to keep the priorities
gnt n = reqn ×∏( reqi + min ) ×∏( reqi + mni )
i <n
i >n
grant logic can be implemented as a tree of NOR and INV gates and the RxR
matrix can be constructed using DFF
Ileak ( Arbitermatrix ) I leak (NOR 2) (( 2R - 1)R )
R(R - 1)
I leak (INV ) R I leak (DFF )
2
Pleak ( Arbitermatrix ) I leak ( Arbitermatrix ) Vdd
NOR2, INV, and DFF represent 2-input NOR gate, inverter gate, and DFF,
respectively
Further details on modeling methodology in Chen et al. 2003
20
Outline
Motivation
ORION2.0 Framework
Dynamic Power Modeling
Leakage Power Modeling
Area Modeling
Validation and Significance Assessment
Conclusions
21
Router Area Model
As number of cores increases, the area occupied by communication
components becomes significant (19% of total tile area in the Intel 80-core
Teraflops Chip)
Gate area model by Yoshida et al. (DAC’04)
Link area model by Carloni et al. (ASPDAC’08)
Areaarbiter =
(AreaNOR2x12(R-1)R) +
(AreaDFF(R(R-1)/2)) +
(AreaINVx1R)
Matrix Arbiter
22
Repeater and Wire Area Models
For existing technologies, the area of a repeater can be
calculated as:
ar = τ0 + τ1 × (wn + wp)
ar denotes repeater area, τ0 and τ1 are coefficients using linear
regression; wn, wp are widths of NMOS, and PMOS respectively
For future technologies, feature size (F), contacted pitch (CP),
row height (RH), and cell width (CW) can be used to estimate
the area:
NF = (wp + wn + 2 × F) / RH
CW = NF × (F + CP) + CP
ar = RH × CW
Wiring area can be calculated as:
aw = (n × (ww + sw) + sw) × L
aw denotes wire area, n is the bit width of the bus, and ww, sw, L are
wire width, spacing and wire length
23
Outline
Motivation
ORION2.0 Framework
Dynamic Power Modeling
Leakage Power Modeling
Area Modeling
Validation and Significance Assessment
Conclusions
24
ORION2.0: Validations and Results
Validation: Two Intel NoC Chips
(1) Intel 80-core Teraflops: high-performance many-core design
(2) Intel SCC: ultra low-power communication core
ORION2.0 offers significant accuracy improvement
Intel 80-core
v1.0
v2.0
%diff (total power) -85.3
%diff (total area) -80.9
Link
21%
FIFO
21%
Link
18%
Component
Buffer
Crossbar
21%
Crossbar
Clock
30%
Arbiter
Arbiter
7%
Clock
ORION 2.0
Link
-6.5
-23.6
Intel SCC
v1.0
v2.0
+202.4 +11.0
+31.9 +25.3
Clock 0%
FIFO
23%
Arbiter 12%
Link 0%
%diff (ORION 2.0 vs. Intel 80-core)
FIFO 28%
Crossbar
16%
Clock
36%
Arbiter
7%
Intel 80-core
-14.8
16.9
-9.0
-20.9
8.8
Crossbar
60%
ORION 1.0
25
Impact on System-Level Design
Testcases
VPROC: video processor with 42 cores and 128-bit datawidth
dVOPD: dual video object plane decoder with 26 cores and 128-bit
datawidth
SoC
2
P (mW)
v1.0
v2.0
VPROC 0.875
dVOPD 0.412
A (mm )
v1.0
v2.0
0.924 2.043 2.329
0.486 1.217 1.343
R1
R1
…
R1
R1
R1
…
R1
…
R1
33
18
25
16
R2
…
R1
max. # router ports
v1.0
v2.0
8
6
12
6
……..
max. # hops
v1.0
v2.0
6
11
5
10
R2
R2
…
…
R1
# routers
v1.0
v2.0
R2
……..
R2
System-level Impact: Communication-Driven Synthesis in
COSI-OCC
Accurate ORION 2.0 models lead to better-performing NoC
Relative power due to additional port not as high in ORION 2.0 vs. 1.0
26
Conclusions
Accurate models can drive effective NoC design
space exploration
ORION 1.0 is inaccurate for current and future
technology nodes
Proposed accurate power and area models for
network routers (ORION 2.0)
Presented a reproducible methodology for extracting
inputs to our models
Maintained ORION 1.0 interface, while significantly
improved the accuracy of models switching to
ORION 2.0 is easy!
27
ORION 2.0 Release
ORION 2.0 Website: http://www.princeton.edu/~peh/orion.html
28
System-Level NoC Power Modeling Example
Polaris
Toolchain
Step 1
Trident
Synthetic traffic generation
Design-space exploration tool
Step 2 LUNA
Microarchitecture
High-level on-chip network
parameters
analysis
Step 3
ORION
power and
area models
power
consumption
CMOS area
Performance
(latency)
NoC designs
projections
V. Soteriou, N. Eisley, H. Wang, B. Li, L.S. Peh, TVLSI’07