Transcript Synthesis

STRUCTURED IC
SYNTHESIS
Contents
Introduction
Switch Models of Transistors
Architectures
Advantages / Disadvantages
Remarks
Introduction
Structured ASICs include everything between
FPGA and a Standard Cell-based design
Structured ASIC’s are used mainly for midvolume level designs
The design task for structured ASIC’s is to map
the circuit into a fixed arrangement of known
cells
Properties
Low NRE cost
– Implementation engineering effort
– Mask tooling charges
High performance
Low power consumption
Less Complex
– Fewer layers to fabricate
Small marketing time
– Pre-made cell blocks available for placing
Architecture
Two Main Levels
– Structured Elements
Combinational and
sequential function
blocks
Can be a logical or
storage element
– Array of Structured
Elements
Uniform or non-uniform
array styles
A fixed arrangement of
structured elements
Main Implementation Steps
1.
2.
3.
4.
5.
6.
7.
RTL Design
Register transfer level design
Logical synthesis
Maps RTL into structured elements
Design for Test insertion
Improves testability and fault coverage
Placement
Maps each structured element onto array elements
Places each element into a fixed arrangement
Physical synthesis
Improves the timing of the layout
Optimizes the placement of each element
Clock synthesis
Distributes the clock network
Minimizes the clock skew and delay
Routing
Inserts the wiring between the elements
Implementation Issues
Logical synthesis, placement and routing
all depend on the target structure element
architecture and hence add more
complexity to the design process.
The completeness of the target structured
ASIC library also affects what specifically
can be implemented from the design.
FPGA
Vs.
Easy to Design
Short Development Time
Low NRE Costs
Design Size Limited
Design Complexity
Limited
Performance Limited
High Power Consumption
High Per-Unit Cost
Standard Cell ASIC
Difficult to Design
Long Development Time
High NRE Costs
Support Large Designs
Support Complex
Designs
High Performance
Low Power Consumption
Low Per-Unit Cost (at
high volume)
There are things in between (some times referred to as Structured
ASIC) that Combine the Best of Both
Structured ASIC Architectures
Fine-Grained
Structured elements
contain unconnected
discrete components
Could include
transistors, resistors,
and others
Structured ASIC Architectures
Medium-Grained
Structured elements contain generic logic
Could include gates, MUX’s, LUT’s or flip-flops
Structured ASIC Architectures
Hierarchical
Use mini structured elements that contain
only gates, MUX’s, and LUT’s
It does not contain storage elements like
flip-flops
This mini element is then combined with
registers or flip-flops
Architecture Comparison
Fine-grained requires many connections in
and out of a structured element
Higher granularities reduce connections to
the structured element but decreases the
functionality it can support
Clearly, each individual design will benefit
differently at varying granularities
Structured ASIC Advantages
Largely Prefabricated
– Components are “almost”
connected in a variety of
predefined configurations
– Only a few metal layers
are needed for fabrication
– Drastically reduces
turnaround time
Routing Layer
Routing Layer
Pre-Routed Layer
Pre-Routed Layer
Pre-Routed Layer
Structured ASIC Advantages
Easier and faster to design than standard
cell ASIC’s
– Multiple global and local clocks are
prefabricated
– No skew problems that need to be addressed
– Signal integrity and timing issues are
inherently addressed
Structured ASIC Advantages
Capacity, performance, and power
consumption closer to that of a standard
cell ASIC
Faster design time, reduced NRE costs,
and quicker turnaround
Therefore, the per-unit cost is reasonable
for several hundreds to 100k unit
production runs
Structured ASIC Disadvantages
Lack of adequate design tools
– Expensive
– Altered from traditional ASIC tools
These new architectures have not yet
been subject to formal evaluation and
comparative analysis
– Tradeoffs between 3-, 4-, and 5-input LUT’s
– Tradeoffs between sizes of distributed RAM
Technology Comparison
Generally speaking
– 100:33:1 ratio between the number of gates in
a given area for standard cell ASIC’s,
structured ASIC’s, and FPGA’s, respectively
– 100:75:15 ratio for performance (based on
clock frequency)
– 1:3:12 ratio for power
Design Tools
Many companies are using existing standard
cell-based CAD tools
– They add product specific placement tools
– To maximize benefits, we need CAD tools designed
specifically for structured ASIC’s
– Need updated algorithms to exploit the modularity of
structured ASIC’s
– Clock aware design
Need architectural evaluation and analysis tools
Embedded Clocks..Sometimes
2 main clocks
– Accessible from
anywhere
Embedded Clocks
8 local clocks
– Chip divided
into 4 regions
– 4 local clocks
can be
assigned to
each region
– Region divided
into 4 sub
regions
– Each subregion assigned 2
local clocks
More Clock Signals Needed?
Use a custom layer to implement an
additional clock signal
Custom layer is limited, so it many not be
feasible
Try to avoid this as much as possible
Assigning Clock Signal
Main/local clock assignment
– Which clock should be the main clock?
– Which clock should be the local clock?
Region clock assignment
– Which local clock should be assigned to each
region?
Do we need a custom clock?
– We generally do not want it
3 methods to solve this
Number Based Heuristics
Method 1
Assign 2 most used clocks as main clocks
Other clocks are local clocks
Assign local clocks to subregions based
on I/O positions
Perform placement
Problems
May not be possible
What about delay optimization?
Placement Based Clock Optimization
Method 2
1. Perform placement without clock constrains
Based on interconnect delays
2. Clock assignment as result of step 1
Which clock should be the main clock?
Which local clock should be assigned to each
region?
3. Move violating FFs to other regions
4. Map FFs to embedded positions
Placement Based Clock Optimization
Method 2
Problems
Moving FFs to different regions will
drastically increase interconnect delays
Huge performance loss
How do we solve this?
Design Flow
Partition
Front-end
physical design
Floorplanning
Placement
Routing
Back-end
physical design
Extraction and
Verification
Floorplanning Based Clock Optimization
Circuit Partitioning
Consider Clock and Delay Domain
Floorplanning
Not Using Embedded Clock Constraints
Embedded Clock
Constraint Violation
No
Yes
Regional Clock Assignment
Based on Current Floorplanning
Incremental Floorplanning
Use Embedded Clock Constraints
Done
Let’s look at some basics
Series and Parallel Transistor Networking
nMOS: 1 = ON
pMOS: 0 = ON
Series: both must be ON
Parallel: either can be ON
a
a
0
g1
g2
(a)
(b)
a
g1
g2
(c)
a
g1
g2
b
0
1
b
b
OFF
OFF
OFF
ON
a
a
a
a
0
1
1
1
0
1
b
b
b
b
ON
OFF
OFF
OFF
a
a
a
a
0
0
b
1
b
0
b
1
1
0
g2
a
b
a
g1
a
0
0
b
(d)
a
0
1
1
0
1
1
b
b
b
b
OFF
ON
ON
ON
a
a
a
a
0
0
0
1
1
0
1
1
b
b
b
b
ON
ON
ON
OFF
Example: NOR Cell
Activity:
– Sketch a 4-input CMOS NOR gate
A
B
C
D
Y
Compound Gates
Compound gates can do any inverting
function
AOI22)
Ex: YA  A B C C D (AND-AND-OR-INVERT,
A
C
B
D
B
(a)
A
(b)
B C
D
(c)
C
D
A
B
(d)
C
D
A
B
A
B
C
D
Y
A
C
B
D
(e)
D
(f)
Y
CMOS O3AI
Y   A B  C D
A
B
C
D
Y
D
A
B
C
Gate Layout
Layout can be very time consuming
– Design gates to fit together nicely
– Build a library of standard cells
Standard cell design methodology
– VDD and GND should abut (standard
height)
– Adjacent gates should satisfy design rules
– nMOS at bottom and pMOS at top
– All gates include well and substrate
contacts
Example: Inverter
Example: NAND3
Horizontal N-diffusion and p-diffusion strips
Vertical polysilicon gates
Metal1 VDD rail at top
Metal1 GND rail at bottom
32 l by 40 l
Stick Diagrams
Stick diagrams help plan layout quickly
– Need not be to scale
– Draw with color pencils or dry-erase markers
Wiring Tracks
A wiring track is the space required for a wire
– 4 l width, 4 l spacing from neighbor = 8 l pitch
Transistors also consume one wiring track
Well spacing
Wells must surround transistors by 6 l
– Implies 12 l between opposite transistor flavors
– Leaves room for one wire track
Area Estimation
Estimate area by counting wiring tracks
– Multiply by 8 to express in l
Example: O3AI
Sketch a stick diagram for O3AI and estimate area
–
Y   A B  C D
Example: O3AI
Sketch a stick diagram for O3AI and estimate area
–
Y   A B  C D
Placement
Problem
– Given a netlist, and fixed-shape cells (small, standard cell), find
the exact location of the cells to minimize area and wire-length
– Consistent with the standard-cell design methodology
Row-based, no hard-macros
– Modules:
Usually fixed, equal height (exception: double height cells)
Some fixed (I/O pads)
Connected by edges or hyperedges
Objectives
– Cost components: area, wire length
Additional cost components: timing, congestion
Placement Cost Components
Area
– Would like to pack all the modules very tightly
Wire length (half-perimeter of the hnet bbox)
– Minimize average wire length
– Would result in tight packing of modules with high connectivity
Overlap
– Could be prohibited by the moves, or used as penalty
– Keep the cells from overlapping (moves cells apart)
Timing
– Not a 1-1 correspondence with wire length minimization, but
consistent on average
Congestion
– Measure of routability
– Tends to move cells apart
Importance of Placement
Placement: fundamental problem in physical design
Glue of the physical synthesis
Became very active again in recent years:
– 9 new academic placers for WL min. since 2000
– Many other publications to handle timing, routability, etc.
Reasons:
– Serious interconnect issues (delay, routability, noise) in deep-submicron
design
Placement determines interconnect to the first order
Need placement information even in early design stages (e.g., logic synthesis)
Need to have a good placement solution
– Placement problem becomes significantly larger
– Cong et al. [ASPDAC-03, ISPD-03, ICCAD-03] point out that existing placers
are far from optimal, not scalable, and not stable
[© He]
Placement can Make A Difference
MCNC Benchmark circuit e64 (contains 230 4-LUT). Placed to a
FPGA.
Random Initial
Placement
Final
Placement
After Detailed
Routing
[© He]
ASICs
Design Types
– Lots of fixed I/Os, few macros, millions of standard cells
– Placement densities : 40-80% (IBM)
– Flat and hierarchical designs
SoCs
– Many more macro blocks, cores
– Datapaths + control logic
– Can have very low placement densities : < 20%
Micro-Processor (P) Random Logic Macros(RLM)
–
–
–
–
Hierarchical partitions are placement instances (5-30K)
High placement densities : 80%-98% (low whitespace)
Many fixed I/Os, relatively few standard cells
Recall “Partitioning w Terminals” DAC`99, ISPD `99, ASPDAC`00
[© He]
Requirements for Placers
Must handle 4-10M cells, 1000s macros
– 64 bits + near-linear asymptotic complexity
– Scalable/compact design database (OpenAccess)
Accept fixed ports/pads/pins + fixed cells
Place macros, esp. with var. aspect ratios
– Non-trivial heights and widths
(e.g., height=2rows)
Honor targets and limits for net length
Respect floorplan constraints
Handle a wide range of placement densities
(from <25% to 100% occupied), ICCAD `02
[© He]
Placement Footprints:
Standard Cell:
Data Path:
IP - Floorplanning
[© He]
Placement Footprints:
Core
Reserved areas
IO
Control
Mixed Data Path &
sea of gates:
[© He]
Placement Footprints:
Perimeter IO
Area IO
[© He]
Unconstrained
Placement
[© He]
Floor planned
Placement
[© He]
VLSI Global Placement
Examples
bad
placement
good
placement
[© He]
Placement Algorithms
A
Top-Down
– Partitioning-based placement
1
– Recursive bi-partitioning or quadrisection
2
B
Cut direction?
Partition vs. physical location
Iterative
– Simulated annealing
OR: Force directed
– Start with an initial placement, iteratively
improve wire-length / area
Constructive
– Start with a few cells in the center, and
place highly connected adjacent modules
around them
C
A
L
D
H
B
F
G
Simulated Annealing Placement
Cost
– Area (usually fixed # of rows, variable row width)
– Wirelength (Euclidian or Manhattan)
– Cell overlap (penalty increases with temperature)
Moves
– Exchange two cells within a radius R
(R temperature dependent?)
– Displace a cell within a row
– Flip a cell horizontally
Low vs. High temperature
– If used as a post processing, start with low-temp
Post-processing?
– Might be needed if there are still overlaps
Case Study: TimberWolf
“The Timberwolf Placement and Routing Package”, Sechen, Sangiovanni; IEEE Journal of SolidState Circuits, vol SC-20, No. 2(1985) 510-522
“Timber wolf 3.2: A New Standard Cell Placement and Global Routing Package” Sechen,
Sangiovanni, 23rd DAC, 1986, 432-439
Timber wolf
Stage 1



Modules are moved between different rows as well as within the same row
modules overlaps are allowed
when the temperature is reduced below a certain value, stage 2 begins
Stage 2


Remove overlaps
Annealing process continues, but only interchanges adjacent modules within the same row
[© He]
Solution Space
All possible arrangements of modules into
rows possibly with overlaps
overlaps
Neighboring Solutions
Three types of moves:
.
M1: Displace a module to a
new location
.
M2: Interchange two
modules
M3: Change the orientation of a module
1
2
3
2
4
1
3
1
4
3
2
Axis of
reflections
4
[© He]
M1: Displacement
Move Selection
M2: Interchange
M3: Reflection
Timber wolf first try to select a move betwee M1 and M2
Prob(M1)=4/5
Prob(M2)=1/5
If a move of type M1 is chosen (for certain module) and
it is rejected, then a move of type M3 (for the same
module) will be chosen with probability 1/10
Restriction on:
How far a module can be displaced
What pairs of modules can be interchanged
[© He]
Move Restriction
Range Limiter
– At the beginning, R is very large, big enough to contain the whole chip
– Window size shrinks slowly as the temperature decreases. In fact,
height and width of R  log(T)
– Stage 2 begins when window size are so small that no inter-row
modules interchanges are possible
Rectangular window R
Cost Function
net i
hi
Cost = C1+C2+C3
– C1 = S(aiwi + bihi)
wi
– ai, bi are horizontal and vertical weights, respectively
– ai =1, bi =1 1/2 perimeter of bounding box
– Critical nets: Increase both ai and bi
– Double metal technology: Over-the-cell routing is possible.
Fewer feed through cells are needed
– vertical wirings are “cheaper” than horizontal wirings . use
smaller vertical weights i.e. bi< ai
[© He]
Cost Function (Cont’d)
C2: Penalty function for module overlaps
O(i,j) = amount of overlaps in the X-dimension
between modules i and j
C2 
O (i , j )  a  2
i  j parameter to ensure C2  0 when T  0
a — offset
C3: Penalty function that controls the row lengths
Desired row length = d( r )
l( r ) = sum of the widths of the modules in row r
C3 
b
r
l (r ) - d (r )
Annealing Schedule
– Tk = r(k)•Tk-1 k= 1, 2, 3, ….
– r(k) increase from 0.8 to max value 0.94 and
then decrease to 0.1
– At each temperature, a total number of K•n
attempts is made
– n= number of modules
– K= user specified constant
[© He]
Force-Directed Placement
Model
– Wires simulated as springs
(if the only force, what will happen?)
Forceij = Weightij x distanceij.
– Cell sizes as repellant forces
– [Eisenmann, DAC’98]:
“vacant” regions work as “attracting” forces
“overcrowded” regions work as “repelling” forces
Algorithm
– Solve a set of linear equations to find an intermediate solution
(module locations)
– Repeat the process until equilibrium
Force-Directed Placement (cont.)
Model (details):
– Cell distances: either
– OR:
– Forces:
– Objective: find x,y coordinates for all cells such that total force exerted
on each cell is zero.
Force-Directed Placement (cont.)
Avoiding overlaps or collapsing in one point?
–
–
–
–
Use fixed boundary I/O cells
Use repelling force between cells that are not connected by a net
Do not allow a move that results in overlap
Use repelling “field” forces from congested areas to sparse ones
[Eisenmann, DAC’98]
Problems with force directed:
– Overlap still might occur (cell sizes model artificially)
– Flat design, not hierarchy
Partitioning-based Placement
Simultaneously perform:
– Circuit partitioning
– Chip area partitioning
– Assign circuit partitions to chip slots
Problem:
– Circuit partitioning unaware of the physical location
B
A
B
A
– Solution: Terminal propagation (add dummy terminals)
A
B
A
B
[She99] p.239
Partitioning-based Placement
More problems:
– Direction of the cut? [Yildiz, DAC’01]
1
1
4
5
2
2
3
3
6
7
(a)
4
5
(b)
5
6
7
8
9
(c)
1 2
1
2
3
4
3
(d)
– How to handle fixed blocks? (area assigned to a partition might
not be enough)
– How to correct a bad decision made at a higher level?
Advantages:
– Hierarchical, scalable
– Inherently apt for congestion minimization, easily extendable to
timing optimization