Presentazione di PowerPoint
Download
Report
Transcript Presentazione di PowerPoint
NoC Physical Implementation
Federico Angiolini
[email protected]
DEIS Università di Bologna
Physical Implementation and
NoCs
NoCs and physical implementation flows are
strictly related topics
On the one hand, NoCs are designed to
alleviate back-end issues (structured wiring)
On the other hand, back-end properties
critically affect NoC behaviour and effectiveness
ASIC Synthesis Flow
A Typical ASIC Design Flow
Design
Space
Exploration
RTL
Coding
Logic
Synthesis
Placement
Routing
Ideally, one-shot linear flow
In practice, iterations needed to fix issues
Validation failures
Bad quality of results
No timing closure
Basics of a Back-End Flow
RTL code
Circuit description
Tech
Libs
GTech
Analysis
Connected network
of logic blocks
Netlist
Logic
Synthesis
Connected network
of gates
Placed Netlist
Placement
Major vendors:
Synopsys, Mentor,
Magma, Cadence
Placed network of
gates
Layout
Routing
Placed and routed
network of gates
Notes on Tech Libraries
Encapsulate foundry capabilities
Typical content: boolean gates,
flip-flops, simple gates
But, in lots of variations: fan-in,
driving strength, speed, power...
Describe: function, delay, area,
power, physical shape...
Often many libraries per process:
high-perf/low-power; best/worst;
varying VDD; varying VT
Tech
Libs
Analysis of the Hardware
Description
Analysis
Normally, a very swift step
Input: Verilog/VHDL description
Output: circuit description in terms of
“adders”, “muxes”, “registers”, “boolean
gates”, etc. (GTech = Generic Technology)
Output is not optimized by any metric
Just translates specifications into an abstract
circuit
Logic Synthesis
Logic
Synthesis
Takes minutes to hours
Input: GTech description
Output: circuit description in terms of
“HSFFX4”, “LPNOR2X2”, “LLINVX32”, etc.
(i.e.: specific gates of a specific tech library)
Output is...
Complying with timing specs (e.g. “at 500 MHz”)
Optimized for area and power
...How Does This Work?
Based on GTech, paths are identified
register-to-register
input-to-register
register-to-output
input-to-output
Along each path, GTech blocks are replaced with
actually available gates from a technological library
The outcome is called netlist
Delay is analyzed first – and some paths are
detected as critical
Example: Critical Paths
“Adventures in ASIC
Digital Design”
Based on chosen library gates and netlist,
path 1 → 6 is longest and violates constraints
Netlist Optimization
Synthesis process optimizes critical paths until
timing constraints are met, e.g.
Use faster gates instead of lower-power
Play with driving strength (as in buffering)
Refactor combinational logic to minimize gates to
be traversed
Once timing is met, analyze non-critical paths
Optimize for area and power, even if slower
Placement
Placement
Step 1: Floorplanning
Place macro-blocks onto “rectangle” (→ chip)
e.g. processors, memories...
Step 2: Detailed placement
Align single gates of macro-blocks into “rows”
Typically aiming at 85% row utilization
Example: xpipes Placement
Approach
Floorplan = mix of
hard macros for IP cores
soft macros for NoC blocks
Routing
Routing
Step 1: Clock tree insertion
Step 2: Power network insertion
Bring clock to all flip-flops
Bring VDD, GND nets across chip
Typically over top metal layers
Either as ring (small designs) or grid (bigger designs)
Step 3: Logic routing
Actually connect gates to each other
Typically over bottom metal layers
Example: Binary Clock Tree
Issue:
minimizing
skew
Critical at
high
frequencies
Consumes
large amount
of power
Courtesy Shobha
Vasudevan
Issue with Traditional Flow
Major problem with traditional flow...
...wiring is not considered during synthesis!!!
Outdated assumption: wiring delay is negligible
Partial fix: wireload models
Consider fan-out of gates
If small, assume short wiring at outputs, and a bit of
extra delay
If large, assume long wiring at outputs, and a noticeable
extra delay
Still grossly inaccurate
Physical Synthesis
Currently envisioned solution:
physical synthesis
Merge placement with logic
synthesis:
Initial, quick logic synthesis
Coarse-grained placement
Incremental synthesis &
placement until convergence
Drastically better results (more
predictable)
Still may not suffice... also
integrate routing step??
RTL
Quick logic synthesis
Initial Netlist
Quick placement
Initial Placed
Netlist
Incremental synthesis
& placement
Final Placed
Netlist
Advanced Back-End Flow
RTL code
Circuit description
Tech
Libs
GTech
Analysis
Connected network
of logic blocks
Physical
Synthesis
Major vendors:
Synopsys, Mentor,
Magma, Cadence
Placed Netlist
Placed network of
gates
Layout
Routing
Placed and routed
network of gates
Some Observations on the
Physical Implementation of NoCs
Study 1: Cross-Benchmarking NoCs
vs. Traditional Interconnects
Study performance, area, power of a
NoC implementation as opposed to
traditional bus interconnects
Plain shared bus
Hierarchical bus
130nm technology
Note: on old, unoptimized version of
NoC architecture
AMBA AHB Shared Bus
M0 M1 M2 M3 M4 M5 M6 M7 M8 M9 T0 T1 T2 T3 T4
AMBA AHB
P0 P1 P2 P3 P4 P5 P6 P7 P8 P9 S10 S11 S12 S13 S14
Baseline architecture
Ten ARM cores, five traffic generators, fifteen
slaves (fully populated bus)
ARM cores: running a pipelined multimedia
benchmark
Traffic generators:
Streaming traffic towards a memory (DSP-like)
Periodically querying some slaves (IOCtrl-like)
AMBA AHB Multilayer
M0 M1 T0
AHB Layer 0
P0 P1
S10
M2 M3 T1
AHB Layer 1
M4 M5 T2
AHB Layer 2
P4 P5
M6 M7 T3
AHB Layer 3
S11
AMBA AHB
crossbar
P2 P3
S12
S13
P6 P7
M8 M9 T4
AHB Layer 4
P8 P9
Dramatically improves
performance:
S14
Intra-cluster traffic to
private slaves (P0-P9) is
bound within each layer,
reducing congestion
Shared slaves (S10-S14)
can be accessed in parallel
Representative 5x5
Multilayer configuration
(up to 8x8 allowed)
xpipes (Quasi-)Mesh
T0
P0
M3
S10
T2
P3
M6
M2
P1
M4
S12
M7
P6
M1
P2
T3
P4
T4
P7
T1
M5
S13
M8
S14
S11
130nm
M0
P5
M9
P8
P9
1
mm²
Excellent bandwidth
Balanced architecture, no max frequency bottlenecks
Very regular topology: easy to floorplan
Overhead of area&power due to many links and buffers
NoCs vs. Traditional
Interconnects - Performance
Execution time (ms)
4
AMBA AHB shared bus
AMBA AHB multilayer
xpipes mesh (21 bit, 3 buffers)
xpipes mesh (38 bit, 3 buffers)
3
2
1
0
256 B
1 kB
4 kB
Time to complete
functional benchmark
Shared buses are totally
collapsing
NoCs are 10-15% faster
than hierarchical buses
Cache size
Observation #1:
NoCs are much more scalable and
can provide better performance
under severe load.
NoCs vs. Traditional
Interconnects - Summary
Cross-benchmarking Layout
Frequency
Bandwidth
AMBA vs. 3 NoCs Frequency Predictability
Functional
Benchmark
Execution
Time
Cell Area
Power
Energy (NoC
+ 5W cores)
AMBA Multilayer
370 MHz
-23%
24 GB/s
baseline
0.52 mm2
75 mW
5.08 mJ
xpipes 21-bit qmesh
793 MHz
-6%
87 GB/s
~10% faster
1.7 mm2
376 mW
5.17 mJ
xpipes 38-bit qmesh
793 MHz
-6%
158 GB/s
~15% faster
2.1 mm2
473 mW
4.96 mJ
Observation #2:
NoCs are dramatically more
predictable than traditional
interconnects.
Observation #3:
NoCs are better in performance and
physical design, but be careful about
area and power!
Bandwidth or Latency?
AMBA AHB multilayer
70
Overall
Bandwidth
NoC bandwidth is much
higher (44 links, ~1 GHz)
But this is indirect clue of
performance
xpipes 21-bit qmesh
87 GB/s
xpipes
10 38-bit qmesh
158 GB/s
xpipes mesh (21 bit, 3 buffers)
Bandwidth
(GB/s)
60
NoC latency penalty/gain
depends on transaction
Penalty on short reads
Gain on posted writes
Observation #4:
Latency matters more than raw
bandwidth. NoCs have to be careful
about some transaction types.
xpipes mesh (38 bit, 3 buffers)
50
40
AMBA
Multilayer
24 GB/s
30
20
0
256 B
1 kB
4 kB
Cache size
posted writes
Processor Perceived Latency (ns)
Processor Perceived Latency (ns)
short reads
7
AMBA AHB multilayer
xpipes mesh (21 bit, 3 buffers)
xpipes mesh (38 bit, 3 buffers)
6
5
4
3
2
1
0
256 B
1 kB
Cache size
4 kB
Area, Power Budget Analysis
Clock trees,
spare cells
4%
OCP clock tree
9%
NI initiators
14%
NI initiators
24%
NI targets
11%
xpipes clock
tree
35%
Switches
48%
NI targets
24%
a. Area
38-bit qmesh
Switches
31%
b. Power
Observation #5:
Clock trees are negligible in area, but eat
up almost half of the power budget.
Study 2: Implementation of NoCs in
90 and 65nm
Study behaviour of NoCs as they are
implemented in cutting-edge
technologies
Observe behaviour of tech libraries,
tools, architecture and links as they are
scaled from one technology node to
another
Link Design Constraints
65nm lowest
power
65nm power/
performance
Power to drive a 38-bit (plus flow control) unidirectional link
Observation #6:
Long links (unless custom designed)
become either infeasible, or too powerhungry. Keep them segmented.
Link Repeaters/Relay Stations
Wire segmentation by topology design
Put more switches, closer
Adds a lot of overhead
Wire segmentation by repeater insertion
Flops/relay stations to break links
Details are strictly related to flow control
VALID
VALID
VALID
(N)ACK
(N)ACK
(N)ACK
VALID
VALID
VALID
STALL
STALL
STALL
Sender
Sender
Receiver
Receiver
Observation #7:
Architectural provisions may be
needed to tackle physical-level
issues. These may impact
performance, so they should be
accounted for in advance.
Wireload Models and 65nm
Wireload models to guesstimate propagation delay
during logic synthesis are inaccurate
As seen, for 130nm, 6 to 23% off from actual achievable
post-placement timing
In 65nm, problem is dramatically worse
No timing closure after placement (-50% frequency,
huge runtimes...)
Traditional logic synthesis tools (e.g. Synopsys Design
Compiler) insufficient
Physical synthesis however works great
Observation #8:
Physical synthesis is compulsory for
next-generation nodes.
Placement in Soft Macros
In our experiments, placement&routing is extremely
sensitive to soft macro area
Fences too tight: flow fails
Fences too wide: tool produces bad results
Solution: accurate component area models
Involves work since area depends on architectural
parameters (cardinality, buffering...)
Observation #9:
Thorough characterization of the
components may be key to the convergence
of the flow for a whole topology.
65nm Degrees of Freedom
5.9X
25
90 nm, HP
90 nm, LP
65 nm, HP
20
65 nm, LP
15
11X
10
2.7X
6.3X
5
0
relative frequency
Observation #10:
There is no such thing as
a “65nm library”.
Power/performance
degrees of freedom span
across one order of
magnitude. It is the
designer’s (or the tools’)
responsibility to pick the
right library choice.
relative power
LP and HP libraries differ in gate design, VT, VDD...
Technology Scaling within
Modules
1,5
90 nm, HP
65 nm, HP
1,25
6x6 switch,
38 bits,
6 buffers
1
0,75
0,5
0,25
0
relative frequency
relative area
relative power
Within modules, scaling looks great
+25% frequency
-46% area
-52% power
Technology Scaling on
Topologies
Three designs for max frequency:
65 nm, 1 mm2 cores
90 nm, 1 mm2 cores
65 nm, 0.4 mm2 cores
Mesh Scaling
Scaling of meshes
(max perf. corner)
Max Layout
Frequency
Max
Bandwidth
Cell Area
Power/MHz
90nm, 1 mm2 cores
1 GHz
228 GB/s
1.31 mm2
0.785 mW/MHz
65nm, 1 mm2 cores
1.25 GHz
285 GB/s
0.64 mm2
0.416 mW/MHz
65nm, 0.4 mm2 cores
1.25 GHz
285 GB/s
0.63 mm2
0.396 mW/MHz
Links
Always short (<1.2 mm) → non-pipelined
However
90 nm 1 mm2: 3.1 mW
65 nm 1 mm2: 3.6 mW (tightest fit → more buffering)
65 nm 0.4 mm2: 2.2 mW
Power shifting from switches/NIs to links (buffering)
High-Radix Switch Feasibility
1200
Frequency (MHz)
1000
Estimated after
synthesis
800
After P&R
600
400
200
0
2x2
3x3
4x4
6x6
8x8
10x10 14x14 18x18 22x22 26x26 30x30
Switch Radix
High-radix switches become too slow
10x10 is maximum realistic size
For sizes 26x26 and 30x30, P&R is unfeasible!
Absolute Clock Tree Skew (ns)
Clock Skew in High-Radix
Switches
0,200
14,00%
12,00%
0,160
10,00%
0,120
8,00%
6,00%
0,080
Absolute Skew
Relative skew
4,00%
0,040
2,00%
0,000
0,00%
2x2
4x4
8x8
14x14
22x22
30x30
Switch Radix
A single switch is still a small entity
Skew can be confined to <10%, typically <5%
A Complete NoC Synthesis
Flow
Design of a NoC-Based System
Software Services
Mapping, QoS, middleware...
Architecture
Packeting, buffering, flow control...
CAD Tools
Physical Implementation
Synchronization, wires, power...
All these items are key opportunities and challenges
Strict interaction/feedback mandatory!...
CAD tools must guide designers to best results
The Design Tool Dilemma
Automatically find topology and
architectural parameters so that
Design constraints are satisfied
Area, power, latency are minimized
A hypercube?
A torus? Or,
do I want a
custom
topology?
Custom Topology & Mapping
Objectives
Design fully application-specific custom topologies
Generate deadlock-free networks
Optimize architectural parameters of the NoC
(frequency, flit size), tuning based upon application
requirements
Physical design awareness
Leverage accurate analytical models for area and
power, back-annotated from layouts
Integrated floorplanner to achieve design closure
while also considering wiring complexity
The xpipes NoC Design Flow
User
objectives:
power,
hop delay
Applicatio
n Traffic
Task
Graph
NoC
Area, Power
Models
Constraints:
area, power,
hop delay,
wire length
NoC
component
library
FPGA
Emulation
IP Core
models
Topology
Synthesis
includes:
Floorplanner
NoC Router
SunFloor
System
specs
Platform
Generation
Platform
Generation
xpipesCompiler
SystemC
code
Synthesis
RTL
Architectural
Simulation
Floorplanning specifications
Area, power characterization
Placement&
Routing
To fab
Example: Task Graph
VLD
INV
SCAN
ACDC
PRED
VOP
MEM
STRIPE
MEM
RLD
IDCT
UP
SAMP
IQUANT
PAD
VOP
REC
ARM
Captures communication among system cores
Source/destination pairs
Required bandwidth
Measuring xpipes Performance
topology
specs
fabric instantiation
xpipes
library
xpipesCompiler
topology
SystemC
architectural simulation
cycle-accurate simulation platform
HDL translation
RTL SystemC Converter
topology
HDL
fabric synthesis
tech
library
traffic
generators
Synopsys Physical
Compiler
topology
netlist
place&route
Synopsys Astro
area
figure
s
architectural
statistics
topology
floorplan
traffic
logs
performanc
e
figures
verification,
power modeling
Mentor ModelSim
Synopsys PrimePower
power
figure
s
Example Layout
Floorplan is automatically
generated
Black areas = IP cores
Colored areas = NoC
Over-the-cell routing
allowed in this example
65nm design