System on Chip - Ohio University

Download Report

Transcript System on Chip - Ohio University

Digital Integrated
Circuits
A Design Perspective
System on a
Chip Design
Application
Specific
Integrated
Circuits:
Introduction
Jun-Dong Cho
SungKyunKwan Univ.
Dept. of ECE, Vada Lab.
http://vada.skku.ac.kr
Contents




Why ASIC?
Introduction to System On Chip Design
Hardware and Software Co-design
Low Power ASIC Designs
Why ASIC – Design
productivity grows!
Complexity increase 40 % per year
Design productivity increase 15 % per year
Integration of PCB on single die
Silicon in 2010
Density AccessTime
(Gbits/cm2)
(ns)
Die Area:
2.5x2.5 cm
DRAM
8.5
10
Voltage:
0.6 V
DRAM (Logic)
2.5
10
Technology: 0.07 m
SRAM (Cache)
0.3
1.5
Density
Max. Ave. Power Clock Rate
(Mgates/cm2)
(W /cm2)
(GHz)
Custom
25
54
3
Std. Cell
10
27
1.5
Gate Array
5
18
1
Single-Mask GA
2.5
12.5
0.7
FPGA
0.4
4.5
0.25
ASIC Principles


Value-added ASIC for huge volume
opportunities; standard parts for quick time to
market applications
Economics of Design
– Fast Prototyping, Low Volume
– Custom Design, Labor Intensive, High Volume

CAD Tools Needed to Achieve the Design
Strategies
– System-level design: Concept to VHDL/C
– Physical design VHDL/C to silicon, Timing closure
(Monterey, Magma, Synopsys, Cadence, Avant!)

Design Strategies: Hierarchy; Regularity;
Modularity; Locality
ASIC Design Strategies





Design is a continuous tradeoff to achieve
performance specs with adequate results in
all the other parameters.
Performance Specs - function, timing, speed,
power
Size of Die - manufacturing cost
Time to Design - engineering cost and
schedule
Ease of Test Generation & Testability engineering cost, manufacturing cost,
schedule
ASIC Flow
Structured ASIC Designs




Hierarchy: Subdivide the design into many
levels of sub-modules
Regularity: Subdivide to max number of
similar sub-modules at each level
Modularity: Define sub-modules
unambiguously & well defined interfaces
Locality: Max local connections, keeping
critical paths within module boundaries
ASIC Design Options








Programmable Logic
Programmable Interconnect
Reprogrammable Gate Arrays
Sea of Gates & Gate Array Design
Standard Cell Design
Full Custom Mask Design
Symbolic Layout
Process Migration - Retargeting Designs
ASIC Design Methodologies
Custom
Cell-based
Prediffused
Prewired
Density
Very High
High
High
Medium - Low
Performance
Flexibility
Design time
Manufacturing time
Cost - low volume
Cost - high volume
Very High
Very High
Very Long
Medium
Very High
Low
High
High
Short
Medium
High
Low
High
Medium
Short
Short
High
Low
Medium - Low
Low
Very Short
Very Short
Low
High
Why SOC?
• SOC specs are coming from system engineers rather
than RTL descriptions
•SOC will bridge the gap hardware/software and their
implementation in novel, energy-efficient silicon architecture.
•In SOC design, chips are assembled at IP block level
(design reusable) and IP interfaces rather than gate level
CMOS density now allows complete
System-on-a-chip Solutions
Dedicated logic P core
RAM & ROM
DMA
phone
phone
book
book
keypad
intfc
S/P
control
protocol
Source:
Brodersen, ICASSP ‘98
Also like to add

Demod
and
sync
Viterbi
Equal.
speech
quality
enhancement
A
D
digital
down
conv
Analog
de-intl
&
decoder
voice
recognition

RPE-LTP
speech
decoder
DSP core
How do we design these chips?
FPGA
Reconfigurable
Interconnect
Possible Single-Chip Radio
Architectures
Software Radio
GOAL: Simplify System Design Process
Seek architectures which are flexible such
that hardware and protocols can be
designed independently
APPROACH: Minimize the use of
dedicated logic
Universal Radio
GOAL: Maximize Bandwidth Efficiency and
Battery Life
Seek architectures which perform complex
algorithms very fast with minimal energy
APPROACH: Minimize the use of
programmable logic
Why is SOC design so scary?
60 GHz SiGe Transceiver for
Wireless LAN Applications
A low power 30 GHz LNA is designed
as the front end of the receiver.
Wideband and high gain response is
realized by a 2-stage design using
a stagger-tuned technique.
The simulated performance predicts a
forward gain of |S21| > 20 dB over
a 6 GHz range with an input match
of |S11| < -30 dB and output match
of |S22| < -10 dB.
The mixer consists of a single
balanced Gilbert cell.
A fully-integrated differential 25 GHz
VCO is used, in conjunction with
the mixer, to downconvert the RF
input to a 5 GHz IF.
30 GHz receiver layout consisting of the LNA, mixer and VCO
Wideband CMOS LC VCO
A 1.8 GHz wideband LC VCO
implemented in 0.18 µm bulk
CMOS has been successfully
designed, fabricated, and
measured.
This VCO utilizes a 4-bit array of
switched capacitors and a small
accumulation-mode varactor to
achieve a measured tuning range
exceeding 2:1 (73%) and a worstcase tuning sensitivity of 270
MHz/V.
The amplitude reference level is
programmable by means of a 3-bit
DAC.
VCOs die photograph
Front-End
A High Level View of
an Industry Standard Design Flow
HDL Entry
good?
source: Hitachi, Prof. R. W. Brodersen
Problems with this flow:

Back-End
Synthesis

good?
Floor-plan
Place & Route


good?
Physical Verification
DRC & LVS
good?
done
Every step can loop to every
other step
Each step can take hours or
days for a 100,000 line
description
HDL description contains no
physical information
Different engineers handle
the front-end and back-end
design
How have semiconductor
companies made this flow
work?
A More Accurate Picture of the Standard
Flow
Source: IBM Semiconductor, Prof. R. Newton
Architecture

10 months

Front-End
10 months

Architecture: Partition the chip into functional units
and generate bit-true test vectors to specify the
behavior of each unit
TOOLS: Matlab, C, SPW, (VCC)
FREEZE the test vectors
Front-End: Enter HDL code which matches the test
vectors
TOOLS: HDL Simulators, Design Compiler
FREEZE the HDL code
Back-End: Create a floor-plan and tweak the tools
until a successful mask layout is created
TOOLS: Design Compiler, Floor-planners, Placers,
Routers, Clock-tree generators, Physical Verification
Back-End 2 months
Fabrication 2 months
How can we improve this flow?
Common Fabric for IP Blocks




Soft IP blocks are portable, but not as predictable as
hard IP.
Hard IP blocks are very predictable since a specific
physical implementation can be characterized, but are
hard to port since are often tied to a specific process.
Common fabric is required for both portability and
predictability.
Wide availability: Cell Based Array, metal programmable
architecture that provides the performance of a
standard cell and is optimized for synthesis.
Four main applications





Set-top box: Mobile multimedia system, base
station for the home local-area network.
Digital PCTV: concurrent use of TV,3D
graphics, and Internet services
Set-top box LAN service: Wireless homenetworks, multi-user wireless LAN
Navigation system: steer and control traffic
and/or goods-transportation
CMPR is a multipurpose program that can be
used for displaying diffraction data, manual- &
auto-indexing, peak fitting and other
PC-Multimedia Applications
Types of System-on-a-Chip
Designs
Physical gap



Timing closure problem: layout-driven logic and RTlevel synthesis
Energy efficiency requires locality of computation and
storage: match for stream-based data processing of
speech,images, and multimedia-system packets.
Next generation SOC designers must bridge the
architectural gap b/w system specification and
energy-efficient IP-based architectures, while CAE
vendors and IP providers will bridge the physical gap.
Circular Y-Chart
SOC Co-Design Challenges




Current systems are complex and heterogenous
Contain many different types of components
Half of the chip can be filled with 200 low-power,
RISC-like processors (ASIP) interconnected by fieldprogrammable buses, embedded in 20Mbytes of
distributed DRAM and flash memory, Another Half:
ASIC
Computational power will not result from multi-GHz
clocking but from parallelism, with below 200 MHz.
This will greatly simplify the design for correct timing,
testability, and signal integrity.
Bridging the architectural gap




One-M gate reconfigurable, one-M gate hardwired
logic.
50GIPS for programmable components or 500 GIPS
for dedicated hardwares
Product reliability: design at a level far above the RT
level, with reuse factors in excess of 100
Trade-off: 100MOPs/watt (microprocessor)
100GOPs/watt (hardwired) Reconf. Computing with a
large number of computing nodes and a very
restricted instruction set (Pleiades)
Why Lower Power


Portable systems
– long battery life
– light weight
– small form factor
IC priority list
– power dissipation
– cost
– performance

Technology direction
Reduced voltage/power
designs based on
mature high
performance IC
technology, high
integration to minimize
size, cost, power, and
speed
Microprocessor Power
Dissipation
Power(W)
Alpha 21164
Alpha 21264
50
P III 500
45
P II 300
40
35
Alpha21064 200
30
25
P6 166
20
P5 66
15
P-PC604 133
10
5
i286
1980
i386 DX 16
1985
i486 DX2 66
i486 DX25
P-PC601 50
i486 DX4 100
i486 DX 50
P-PC750 400
1990
1995
2000
year
Levels for Low Power Design
Hardware-software partitioning,
Power down
Complexity,
Concurrency, Locality,
Algorithm
Regularity, Data representation
Parallelism, Pipelining, Signal correlations
Architecture
Instruction set selection, Data rep.
Circuit/Logic
Sizing, Logic Style, Logic Design
System
Technology
Threshold Reduction, Scaling, Advanced packaging
SOI
Possible Power Savings at Different Design Levels
Level of
Abstraction
Algorithm
Expected Saving
Architecture
10 - 90%
Logic Level
Layout Level
20 - 40%
Device Level
10 - 30%
10 - 100 times
10 - 30%
Power-hungry Applications


Signal Compression: HDTV Standard,
ADPCM, Vector Quantization, H.263, 2-D
motion estimation, MPEG-2 storage
management
Digital Communications: Shaping Filters,
Equalizers, Viterbi decoders, Reed-Solomon
decoders
New Computing Platforms
P  kCFV

2
SOC power efficiency more than 10GOPs/w
– Higher On Chip System Integration: COTS: 100W,
SOC:10W (inter-chip capacitive loads, I/O buffers)
– Speed & Performance: shorter interconnection,fewer
drivers,faster devices,more efficient processing
artchitectures




Mixed signal systems
Reuse of IP blocks
Multiprocessor, configurable computing
Domain-specific, combined memory-logic
Low Power Design Flow I
Function
Partitioning and
System
Level
Specification
HW/SW Allocation
System-Level
Power Analysis
Behavioral
Description
Software
Functions
Power-driven
Behavioral
Transformation
Processor
Selection
Behavioral-Level
Power Analysis
Power Conscious
Behavioral
Description
Software-Level
Power Analysis
Software
Optimization
High-Level
Synthesis and
Optimization
To RT-Level Design
RT-Level
Power Analysis
Low Power Design Flow II
RT-level
Description
Data-path
RTL
Library
RTL
mapping
Controller
Logic Synthesis
and
Optimization
Gate-Level
Power Analysis
Gate-level
Description
Processor
Control and
Steering Logic
Memory
RTL
Macrocells
Standard cell
Library
High-Level
Synthesis and
Optimization
Switch-level
Description
Switch-Level
Power Analysis
Three Factors affecting
Energy
–
–
–
Reducing waste by Hardware Simplification: redundant
h/w extraction, Locality of reference,Demand-driven /
Data-driven computation,Application-specific
processing,Preservation of data correlations, Distributed
processing
All in one Approach(SOC): I/O pin and buffer reduction
Voltage Reducible Hardwares
– 2-D pipelining (systolic arrays)
– SIMD:Parallel Processing:useful for data w/ parallel
structure
– VLIW: Approach- flexible
IBM’s PowerPC Lower Power
Architecture






Optimum Supply Voltage through Hardware Parallel, Pipelining
,Parallel instruction execution
– 603e executes five instruction in parallel (IU, FPU, BPU, LSU,
SRU)
– FPU is pipelined so a multiply-add instruction can be issued
every clock cycle
– Low power 3.3-volt design
Use small complex instruction with smaller instruction length
– IBM’s PowerPC 603e is RISC
Superscalar: CPI < 1
– 603e issues as many as three instructions per cycle
Low Power Management
– 603e provides four software controllable power-saving modes.
Copper Processor with SOI
IBM’s Blue Logic ASIC :New design reduces of power by a factor of
10 times
Power-Down Techniques
Lowering the voltage
along with the clock
actually alters the
energy-per-operation of
the microprocessor,
reducing the energy
required to perform a
fixed amount of work
Implementing Digital Systems
H/W and S/W Co-design
Three Co-Design Approaches

IFIP International Conference FORTE/PSTV’98, Nov.’98 N.S. Voros et.al, “Hardware software co-design of embedded systems using multiple formalisms for application
”
development



ASIP co-design: builds a specific programmable processor for
an application, and translates the application into software code.
H/w and s/w partitioning includes the instruction set design.
H/w s/w synchronous system co-design: s/w processor as a
master controller, and a set of h/w accelerators as coprocessors. Vulcan, Codes, Tosca, Cosyma
H/w s/w for distributed systems: mapping of a set of
communication processors onto a set of interconnected
processors. Behavioral decomposition, process allocation and
communication transformation. Coware(powerful), Siera (reuse),
Ptolemy (DSP)
Mixing H/W and S/W


Argument: Mixed hardware/ software systems
represent the best of both worlds.
High performance, flexibility, design reuse, etc.
Counterpoint: From a design standpoint, it is
the worst of both worlds
– Simulation: Problems of verification, and test become harder
– Interface: Too many tools, too many interactions, too much
heterogeneity
– Hardware/ software partitioning is “AI- complete”!
– (MIT, Stanford: by analogy with "NP-complete") A term used
to describe problems in artificial intelligence, to indicate that
the solution presupposes a solution to the "strong AI
problem" (that is, the synthesis of a human-level
intelligence). A problem that is AI-complete is just too hard.
Low power partitioning
approach





Different HW resources are invoked according to the
instruction executed at a specific point in time
During the execution of the add op., ALU and
register are used, but Multiplier is in idle state.
Non-active resources will still consume energy since
the according circuit continue to switch
Calculate wasting energy
Adding application specific core and partial running
Whenever one core performing, all the other cores
are shut down
ASIP (Application Specific
Instruction Processors) Design



Given a set of applications, determine micro
architecture of ASIP (i. e., configuration of
functional units in datapaths, instruction set)
To accurately evaluate performance of
processor on a given application need to
compile the application program onto the
processor datapath and simulate object code.
The micro architecture of the processor is a
design parameter!
ASIP Design Flow
Cross-Disciplinary nature




Software for low power:loop transformation leads to
much higher temporal and spatial locality of data.
Code size becomes an important objective Software will
eventually become a part of the chip
Behavior-platform-compiler codesign: codesigned with
C++ or JAVA, describing their h/w and s/w
implementation.
Multidisciplinary system thinking is required for future
designs (e.g., Eindhoven Embedded Systems Institute http://www.eesi.tue.nl/english)
VLSI Signal Processing Design
Methodology




pipelining, parallel processing, retiming,
folding, unfolding, look-ahead, relaxed lookahead, and approximate filtering
bit-serial, bit-parallel and digit-serial
architectures, carry save architecture
redundant and residue systems
Viterbi decoder, motion compensation, 2Dfiltering, and data transmission systems
Low Power DSP

DO-LOOP Dominant
VSELP Vocoder : 83.4 %
2D 8x8 DCT
: 98.3 %
LPC computation : 98.0 %
DO-LOOP Power Minimization
==> DSP Power Minimization
VSELP : Vector Sum Excited Linear Prediction
LPC : Linear Prediction Coding
Deep-Submicron Design
Flows







Rapid evaluation of complex designs for area
and performance
Timing convergence via estimated routing
parasitics
In-place timing repair without resynthesis
Shorter design intervals, minimum iterations
Block-level design and place and route
Localized changes without disturbance
Integration of complex projects and design
reuse
SOC CAD Companies









Avant!
www.avanticorp.com
Cadence www.cadence.com
Duet Tech
www.duettech.com
Escalade
www.escalade.com
Logic visions
www.logicvision.com
Mentor Graphics
www.mentor.com
Palmchip
www.palmchip.com
Sonic www.sonicsinc.com
Summit Design
www.summit-design.com




Synopsys
www.synopsys.com
Topdown design
solutions
www.topdown.com
Xynetix Design
Systems
www.xynetix.com
Zuken-Redac
www.redac.co.uk
Design
Technology
for Low Power
Radio Systems
Rhett Davis
Dept. of EECS
Univ. of Calif.
Berkeley
http://bwrc.eecs.berkeley.edu
Domain of Interest



Highly integrated system-on-a-chip solutions – SOC’s
Wireless communications with associated processing,
e.g. multimedia processing, compression, switching,
etc…
Primary computation is high complexity dataflow with a
relatively small amount of control
Why Systems-on-a-Chip - SOC ?
State-of-the-Art CMOS is easily able to implement complete
systems (or what was on a board before)
– A microprocessor core is only 1-2 mm2
(1-2 % of the area of a $4 chip)
– Portability (size) is critical to meet the cost, power and size
requirements of future wireless systems
– Chips will be required to support the complete application (wireless
internet, multimedia)
– Dedicated stand-alone computation is replacing general purpose
processors as the semiconductor industry driver
Cellular Phones: An example
Small
Signal RF
Digital Cellular Market
(Phones Shipped)
1996 1997 1998 1999 2000
Units
Power
RF
Power
Management
Analog
Baseband
48M 86M 162M 260M 435M
Digital Baseband
(DSP + MCU)
(Courtesy Mike McMahon, Texas Instruments)
Cellular Phone Baseband SOC
ROM
MCU
DSP
Gates
Analog
RAM
2000+ phones on each 8” wafer @ .15 Leff
1Million Baseband Chips per Day!!!
(Courtesy Mike McMahon, Texas Instruments)
Wireless System Design Issues


It is now possible to use CMOS to integrate all
digital radio functions – but what is the “best”
architectural way to use CMOS???
Computation rates for wireless systems will easily
range up to 100’s of GOPS in signal processing
– What’s keeping us from achieving this in silicon?
– What can we do about it?
Computational Efficiency Metrics

Definition: MOPS
– Millions of algorithmically defined arithmetic operations (e.g.
multiply, add, shift) – in a GP processor several instructions per
“useful” operation

Figures of merit
– MOPS/mW - Energy efficiency (battery life)
– MOPS/mm2 - Area efficiency (cost)
Optimization of these “efficiencies” is the basic goal
assuming functionality is met
Energy-Efficiency of Architectures
Energy Efficiency
MOPS/mW (or MIPS/mW)
1000
Dedicated
HW
100
10
1
Direct mapped
100-1000 MOPS/mW
Reconfigurable
Processor/Logic
Reconfiguration (???)
Potential of 10-100 MOPS/mW
ASIPs
DSPs
DSP
1-10 MIPS/mW
Embedded Processors
0.1
Microprocessor
.1-1 MIPS/mW
Flexibility (Coverage)
Software Processors: Energy Trends
300
A21164-300
A21064A
250
Freq(MHz)
200
MIPS R5000
PPro200
MIPS R4400
HP PA8000
UltraSparc-167
150
HP PA7200
PP-133
MIPS R10000
PP166
PPro-150
PPC 604-120
DX4 100
100
PP-100
PPC603e-100
SuperSparc2-90
50
PPC 601-80
486-66
PP-66
i386C-33
i486C-33
i386
0
1991
1992
1993
1994
1995
1996
Primary means of performance increase of software processors has
been by increasing clock rate
Decreasing Energy Efficiency
ECV 2
DD
Software Processors: Area Trends


Increasing clock rate results in a memory bottleneck – addressed by bringing
memory on-chip
Area is increasingly dominated by memory – degrading MOPs/mm2
16x16 multiplier
(.05 mm2)
DSP processor with 1 multiplier
(25 mm2)
Why time multiplex to save area if the overhead is much greater
than the area saved????
Parallelism is the answer, but …

Not by putting Von Neumann processors in parallel and
programming with a sequential language
– Attempts to do this have failed over and over again…
– The parallel computer compiler problem is very difficult

Not by trying to capture parallelism at the instruction level
– Superscalar, VLIW, etc… are very inefficient
– Hardware can’t figure out the parallelism from a sequential
language either
The problem is the initial sequential description (e.g. C)
which is poorly matched to highly parallel applications
What is really hapenning…
Re-entering it
Starting with a parallel
algorithmic description using a sequential
description
Then try to
rediscover the
parallelism
While (i=0;i++:i<num) {
a = a * c[i];
b[i] = sin (a * pi) + cos(a*pi);
};
Outfil = b[i] * indata;
We take this path so that we can use an architecture
that is orders of magnitude less efficient in energy and area
??????
What can a fully parallel CMOS solution
potentially do?
In .25 micron a multiplier requires .05 mm2 and 7pJ
per operation at 1 V. Adders and registers are about
10 times smaller and 10 times lower energy
Lets implement a 50mm2 , .25 micron chip using
adders, registers and multipliers
 We can have 2000 adders/registers and 200
multipliers in less than 1/2 of the chip, also assume 1/3
of power goes into clocks
 25 MHz clock (1 volt) gives ~50 Gops at 100mW
 500 MOPS/mW and 1000 MOPS/mm2
Start with a parallel description of the algorithm…
Then directly map into hardware …
S reg
Mult1
Mac1
X reg
Mac2
Add,
Sub,
Shift
Mult2
Results in fully parallel solutions
Energy
Area
64-point FFT
Energy per
Transform (nJ)
16-State Viterbi
Decoder
Energy per
Decoded bit (nJ)
64-point FFT
Transforms per second
per unit area
(Trans/ms/mm2)
16-State Viterbi
Decoder
Decode rate per unit
area (kb/s/mm2)
Direct-Mapped Hardware
1.78
0.022
2,200
200,000
FPGA
683
5.5
1.8
100
Low-Power DSP
436
19.6
4.3
50
High-Performance DSP
1700
108
10
150
(numbers taken from vendor-published benchmarks)
Orders of magnitude lower efficiency
even for an optimized processor architecture
Reasons software solutions seem attractive
(1) Believed to reduce time-to-system-implementation
(2) Provides flexibility
(3) Locks the customers into an architecture they can’t
change
(4) Difficulty in getting dedicated SOC chips designed
Are these good reasons???
(1) Believed to reduce time-to-system
implementation



Software decreases time to get first prototype, but
time to fully verified system is much longer (hardware
is often ready but software still needs to be done)
Limitations of software prototype often sets the
ultimate limit of the system performance
Software solutions can be shipped with bugs, not a
real option for SOC
(2) Need flexibility

Software is not always flexible
– Can be hard to verify

Flexibility does not imply software programmability
– Domain specific design can have multiple modules,
coefficients and local state control (the factor of 100 in
efficiency) to address a range of applications
– Reconfiguration of interconnect can achieve flexibility with
high levels of efficiency
Flexibility without software
-3
Lower limit
Function-specific reconfigurable hardware
Data-path reconfigurable processor
FPGA
Low-power DSP
High-performance DSP
-4
10
Energy per Transform (J)
2)
(Transforms per Second)/(Silicon Area) (Trans/s/mm
8
10
-5
10
-6
10
-7
10
-8
10
-9
10
10
Function-specific reconfigurable hardware
Data-path reconfigurable processor
FPGA
Low-power DSP
High-performance DSP
7
10
6
10
5
10
4
10
3
10
-10
10
1
10
2
10
FFT size
Energy per Transform
vs. FFT size
* All results are scaled to 0.18m
3
10
1
10
2
10
FFT size
3
10
Transforms per Second per mm2
vs. FFT size
Reasons software solutions seem attractive
(1) Believed to reduce time-to-system implementation
(2) Provides flexibility
(3) Locks the customers into an architecture they can’t
change
(4) Difficulty in getting dedicated SOC chips designed
Standard DSP-ASIC Design Flow
Algorithm
Design
Floating-Point
Simulation
System/Architecture
Design
Fixed-Point
Simulation
Hardware/FrontEnd Design
RTL Code
Physical/BackEnd Design
Mask Layout
Problems:
Sequential

Mixed
Sequential
& Structural
Integer only,
Structural w/
Sequential
Leaf-cells
Single-wire
Connectivity
w/ Timing
Constraints


Three translations
of design data
Requirements for re-verification at
each stage
Uncontrolled looping when pipeline
stalls
Prohibitively Long Design Time
for Direct Mapped Architectures
Direct Mapping Design Flow
Algorithm/System
Front-End
Simulation
Back-End
Floorplan
RTL Libraries
Automated Flow
Mask Layout
Performance Estimates




Encourages iterations of layout
Controls looping
Reduces the flow to a single phase
Depends on fast automation
Déjà vu???


An automated style of design with parameterized modules
processed through foundries is just the reincarnation of
good ole Silicon Compilation of >10 years ago
What happened?
– A decline of research into design methodologies
– A single dominant flow has resulted - the Verilog-SynopsysStandard Cell
– Lack of tool flows to support alternative styles of design
– Research community lost access to technology – moved to highly
sub-optimal processor and FPGA solutions
Capturing Design Decisions
reg. file
S
Categories:

Function - basic input-output behavior

Signal - physical signals and types

Circuit - transistors

Floorplan - physical positions
reg.
file
MAC
add
shift
How to get layout and performance estimates in a day?
Simplified View of the Flow
New Software:

Generation of netlists from a
dataflow graph
dataflow graph
elaborate
macro
library
netlist
floorplan
merge


autoLayout

route
layout
Merging of floorplan from last
iteration
Automatic routing and
performance analysis
Automation of flow as a
dependency graph (UNIX
MAKE program)
Why Simulink?
2
TAP_COEF
D
A
Q
WEN
addr
SRAM
A
wen
reset_acc
CONTROL
1
X
B
Z
RESET
1
Y
MAC
Time-Multiplexed FIR Filter



Simulink is an easy sell to algorithm developers
Closely integrated with popular system design tool Matlab
Successfully models digital and analog circuits
Modeling Datapath Logic


Discrete-Time
(cycle accurate)
1
A
Fixed-Point Types
(bit true)
2
B
+
MULT
S12
1
Z
+
ADD
S18
1
Z
REG
3
RESET
MUX


Completely specify function
and signal decisions
No need for RTL
0
CONST
S18
Multiply / Accumulate
Modeling Control Logic

init
entry: addr=0;
wen=1;

[addr==15]
incr
during: addr++;
reset_acc=0;
restart
entry: addr=0;
wen=0;
reset_acc=1;


Address Generator / MAC Reset
Extended finite statemachine editor
Co-simulation with dataflow
graph
New Software:
Stateflow-VHDL translator
No need for RTL
Specifying Circuit Decisions
Black Box
2
TAP_COEF
D
A
Q
WEN
StateflowVHDL
translator
addr
SRAM
A
wen
reset_acc
CONTROL
1
X
B
Z
RESET
1
Y
MAC
Time-Multiplexed FIR
Filter


Macro choices embedded in dataflow graph
Cross-check simulations required
RTL Code
or
Data-path
Generator
Code
or
Custom
Module
Hierarchy Hardened Progressively
System-Level
Design Environment
estimate
performance:
power, area, delay
layout and
characterize
new hard macro
Hard Macro Characterization
Libraries




Macro characterization saved for fast estimates
Each level of hierarchy becomes a new hard macro
Higher levels of hierarchy are adjusted
When top level of hierarchy is hardened, the design is done
Capturing Floorplan Decisions
Parallel Pipelined FIR Filter




Commercial physical design tools used
Instance names in floorplan match dataflow graph
Placements merged on each iteration
Manhattan distance can be used for parasitic estimates
Reduced Impact of Interconnect
FO4 inv
delay
0.5
Wire
delay
Wire delay / FO4 inv. delay
0.45
...
0.4
0.35

5mm M6 wire
0.18 m
0.3
0.25
0.2
0.15
1mm M6 wire
0.1
0.05
0
0.4 0.6
0.8
1
1.2 1.4
VDD (V)
1.6 1.8
Long wires can
be modeled as
lumped
capacitances
Race-Immune Clock Tree Synthesis
t skew(max) < t clk-Q(min) - t hold(max)
Hierarchical Clock Tree Synthesis
Example Clock Tree
Stages:
Sinks:
Skew:
Clock Power:
Logic Power:
22
7650
320 ps
2.8 mW
21 mW
Race margin
= 580 ps
 0.18 m
 VDD = 1 V
Demonstrated on a 600k transistor design
Example 1: Macro Hardening
parallel pipelined FIR filter
area in 0.25 m
1.4 mm2
power @ 25 MHz (1 V, PowerMill)
13.0 mW
critical path delay (1 V, PathMill)
18.0 ns
cells
21 k
transistors
240 k
execution time
(elaborate / route)
(characterization)
3 hours
9 hours
disk space
(elaborate / route)
(characterization)
180 MB
1.5 GB
Most time/disk space spent on extraction and power simulation
Example 2: Test Chip








Parallel Pipelined FIR Filter
(8X decimation filter for 12-bit 200 MHz SD)
300k transistors
0.25 mm
1.0 V
25 MHz
6.8 mm2
14 mW
2 phase clock
3 layers of
P&R hierarchy
TDMA Baseband Receiver

control

rotate & correlate
carrier
detection
frequency estimation







600k transistors
0.18 mm
1.0 V
25 MHz
1.1 mm2
21 mW
single phase clock
5 clock domains
2 layers of
P&R hierarchy
Conclusions


Direct-Mapped hardware is the most efficient use of silicon
Direct-Mapped hardware can be easier to design and verify
than embedded hardware/software systems

Don’t translate design data, refine it

Design with dataflow graphs, not sequential code

Design flow automation speeds up design space exploration
Embedded
Processor
Architectures and
(Re)Configurable
Computing
Vandana Prabhu
Professor Jan M. Rabaey
Jan 10, 2000
Pico Radio Architecture
FPGA
Embedded uP
Dedicated FSM
Dedicated
DSP
Reconfigurable
DataPath
Reconfigurable Computing:
Merging Efficiency and Versatility
Spatially programmed connection of processing elements.
“Hardware” customized to
specifics of problem.
Direct map of problem
specific dataflow, control.
Circuits “adapted” as
problem requirements
change.
Matching Computation and Architecture
AddressGen
AddressGen
Memory
Memory
MAC
MAC
Convolution
L
G
C
Control
Processor
Two models of computation:
Two architectural models:
communicating processes + data-flow
sequential control+ data-driven
Implementation Fabrics for
Data Processing
300 million multiplications/sec
357 million add-sub’s/sec
Data In
Adaptive
Pilot
Correlator
C0
Acquisition and
Timing Recovery
...
Adaptive
Pilot
Correlator
Channel
Coefficient
Estimates
Digital Baseband
Receiver
C L-1
Signal Update Block
Sk
Adaptive
Data
Correlator
Adaptive
Pilot
Correlator
Digital
Baseband
Receiver
Power: 460mW
Area: 1089mm2
Power: 1500mW
Area: 3600mm2
Direct
Mapped
Power: 3mW
Area: 1.3mm2
Power: 10mW
Area: 5mm2
Pleiades
Power: 18.49mW
Area: 5.44mm2
Power: 62.33mW
Area: 21.34mm2
DSP
Data Out
16 Mmacs/mW!
Software Methodology Flow
Area &
Timing
Constraints
Algorithms
proc &
Accelerator
PDA Models
Kernel Detection
Xform’s
for low
power
Behavioral
Estimation/Exploration
Power & Timing Estimation
of Various Kernel Implementations
Partitioning
Reconfig HW
(Marlene Wan)
Software Compilation
Reconfig. Hardware Mapping
Interface Code Generation
Premapped
Kernels
Kernels
Executable Intemediate
Form
Interconnect
Optimization
Maia: Reconfigurable Baseband
Processor for Wireless
• 0.25um tech: 4.5mm x 6mm
• 1.2 Million transistors
• 40 MHz at 1V
• 1 mW VCELP voice coder
• Hardware
• 1 ARM-8
• 8 SRAMs & 8 AGPs
• 2 MACs
• 2 ALUs
• 2 In-Ports and 2 Out-Ports
• 14x8 FPGA
Implementation Fabrics for
Protocols
RACH
req
A protocol = Extended FSM
RACH
akn
ASIC
idle
Memory
RACH
slotset
update
read
ARM8
Power 0.26mW 2.1mW 114mW
Energy 10.2pJ/op 81.4pJ/op n*457pJ/op
write

R_ENA
idle
addr
Slot Pkt
start end
Intercom TDMA MAC
BUF
BUF
Slot_Set_Tbl
2x16
ASIC: 1V, 0.25 m CMOS process
FPGA: 1.5 V 0.25 m CMOS low-energy
FPGA

W_ENA
slot_set Slot_no
<31:0>
<5:0>
FPGA

ARM8: 1 V 25 MHz processor; n = 13,000

Ratio: 1 - 8 - >> 400
Idea: Exploit model of computation:
concurrent finite state machines,
communicating through message
passing
Low-Power FPGA

Low Energy Embedded FPGA
(Varghese George)

Test chip
–
–
–
–
–

8x8 CLB array
5 in - 3 out CLB
3-level interconnect hierarchy
4 mm2 in 0.25 m ST CMOS
0.8 and 1.5 V supply
Simulation Results
– 125 MHz Toggle Frequency
– 50 MHz 8-bit adder
– energy 70 times lower than
comparable Xilinx
An Energy-Efficient µP System
Integrated
dc-dc
converter
µProc. Speed
• Dynamic Voltage Scaling
(Trevor Pering & Tom Burd)
Lower speed,
Lower voltage,
Lower energy
Before
After
Idle
Xtensa Configurable Processor

Xtensa (Tensilica,Inc) for embedded CPU
– Configurability allows designer to keep “minimal” hardware
overhead
– ISA (compatible with 32 bit RISC) can be extended for
software optimizations
– Fully synthesizable
– Complete HW/SW suite

VCC modeling for exploration
– Requires mapping of “fuzzy” instructions of VCC processor
model to real ISA
– Requires multiple models depending on memory
configuration
– ISS simulation to validate accuracy of model
(Vandana Prabhu)
Microprocessor Optimizations for
Network Protocols

ImplementsTransport layer on configurable processor
– TDMA control and channel usage management

Upper layer of protocol is dominated by processor control flow
– Memory routines, Branches, Procedure calls


Artifacts of code generation tools is significant

Excessively modular code introduces procedure calls

Uses dynamic memory allocation
Configurable processor
calloc

Increased size of register file

Customized instructions help datapath but not control
Total Execution
Time
memcpy
other
Memory Routines
Efficient implementaion at code generation and
architecture levels!
(Kevin Camera & Tim Tuan )
Implementation Methodology for
Reconfigurable Wireless Protocol


Changing granularity within protocol stack
requires estimation tool for energy-efficient
implementation
Software exploration on processors
– Exploring Xtensa’s TIE

Hardware exploration on FPGA platforms
– Optimal FPGA architecture
– Alternately “Reconfigurable FSM” analogous to
Pleiades approach for datapath kernels
(Suetfei Li & Tim Tuan)
TCI - A First Generation PicoNode
Tensilica
Embedded Proc.
Memory
Sub-system
Sonics Backplane
Baseband
Processing
Configurable
Logic
(Physical Layer)
Programmable
Protocol Stack
The System-on-a-Chip Nightmare
System Bus
DMA
CPU
DSP
Mem
Ctrl.
Bridge
The “Board-on-a-Chip”
Approach
MPEG
I
O
O
C
Custom
Interfaces
Control Wires
Peripheral
Bus
Courtesy of Sonics, Inc
The Communications Perspective
(Mike Sheets)
DMA
CPU
C
MEM
DSP
I
MPEG
O
Communications-based Design
Example: “The Silicon Backplane”
(Sonics, Inc)
Open Core
ProtocolTM
SiliconBackplane
AgentTM
Guaranteed Bandwidth
Arbitration
Summary





Design for low-energy impacts all stages of
the design process — the earlier the better
Energy reduction requires clear
communication and computation abstractions
Efficient and abstract modeling of energy at
behavior and architecture level is crucial
Efficient hardware implementation of protocol
stack
Beat the SoC monster!
Targeting Tiled Architectures
in Design Exploration
Lilian Bossuet1, Wayne Burleson2, Guy Gogniat1,
Vikas Anand2, Andrew Laffely2, Jean-Luc Philippe1
LESTER Lab
Université de
Bretagne Sud
Lorient, France
{lilian.bossuet,
guy.gogniat,
jeanluc.philippe}@uni
v-ubs.fr
1
Department of Electrical
and Computer Engineering
University of
Massachusetts,
Amherst, USA
{burleson, vanand,
alaffely}@ecs.umass.edu
2
Design Space Exploration:
Motivations



Design solutions for new telecommunication and
multimedia applications targeting embedded systems
Optimization and reduction of SoC power
consumption
Increase computing performance
– Increase parallelism
– Increase speed

Be flexible
– Take into account run-time reconfiguration
– Targeting multi-granularity (heterogeneous) architectures
Design Space Exploration:
Flow

Progressive design space
reduction:
– iterative exploration
– refinement of architecture
model
– increase of performance
estimation accuracy

One level of abstraction for
one level of estimation
accuracy
Reconfigurable Architectures



Bridging the flexibility gap between ASICs and
microprocessor [Hartenstein DATE 2001]
Energy efficient and solution to low power
programmable DSP [Rabaey ICASSP 1997, FPL 2000]
Run Time Reconfigurable
[Compton & Hauck 1999]

=> A key ingredient for future silicon platforms
[Schaumont & all. DAC 2001]
Design Space of Reconfigurable
Architecture
RECONFIGURABLE ARCHITECTURES
(R-SOC)
MULTI GRANULARITY
(Heterogeneous)
FINE GRAIN
(FPGA)
Processor +
Coprocessor
Island
Topology
Hierarchical
Topology
Coarse Grain
Coprocessor
Fine Grain
Coprocessor
• Xilinx Virtex
• Xilinx Spartran
• Atmel AT40K
• Lattice ispXPGA
• Altera Stratix
• Altera Apex
• Altera Cyclone
• Chameleon
• REMARC
• Morphosys
• Pleiades
• Garp
• FIPSOC
• Triscend E5
• Triscend A7
• Xilinx Virtex-II Pro
• Altera Excalibur
• Atmel FPSIC
COARSE GRAIN
(Systolic)
Tile-Based
Architecture
Mesh
Topology
• aSoC
• E-FPFA
Linear
Topology
• RAW
• Systolic Ring
• CHESS
• RaPiD
• MATRIX
• PipeRench
• KressArray
• Systolix Pulsedsp
Hierarchical
Topology
• DART
• FPFA
A Target Architecture: aSoC




Adaptive System-on-a-Chip (aSoC)
Tiled architecture containing many heterogeneous
processing cores (RISC, DSP, FPGA, Motion
Estimation, Viterbi Decoder)
Mesh communication network controlled with
statically determined communication schedule
A scalable architecture.
FPGA in System-on-a-Chip


Fast Time-To-Market
Post-Fabrication
Customization
–
–
–
–

Broaden application domain
Run-time Reconfiguration
Bug Fixes
Upgrades
10x-100x Worse:
– Area
– Performance
– Power
Mark L. Chang [email protected]
aSoC Architecture
tile



Heterogeneo
us Cores
Point-to-point
connections
Communicati
on Interface
North
uProc
MUL
West
East
ctrl
FPGA
MUL
South
Core
North
aSoC Communications
Interface
West
East
ctrl
South





Interface Crossbar
– inter-tile transfer
– tile to core transfer
Interconnect/Instruction Memory
– contains instructions to
configure the interface
crossbar (cycle-by-cycle)
Interface Controller
– selects the instruction
Coreports
– data interface and storage for
transfers with the tile IP core
Dynamic Voltage and Frequency
Selection
– Dynamic Power Management
Core
Core
Coreports
Interface
Crossbar
North
North
South
South
East
East
West
Inputs
West
Outputs
Local Config.
Decoder
North to South & East
Instruction Memory
Controller
PC
Local
Frequency
& Voltage
aSoC Exploration ...

Type of tiles

Number of each type of tile

Placement of the tiles


Intern architecture of reconfigurable tiles
(FPGA core)
Communication scheduling
Design Space Exploration:
Goals




Goal: Rapid exploration of various architectural solutions to be
implemented on heterogeneous reconfigurable architectures
(aSoC) in order to select the most efficient architecture for one
or several applications
Take place before architectural synthesis (algorithmic
specification with high level abstraction language)
Estimations are based on a functional architecture model
(generic, technology-independent)
Iterative exploration flow to progressively refine the architecture
definition, from a coarse model to a dedicated model
Design Exploration Flow
Targeting Tiled Architecture
C
SPECIFICATION
C to HCDFG parser
Model of the aSOC Architectures
HCDFG Graphs of the application
Application App 1
Tile T2
aSOC A 1
Function F 2
Tile T1
Function F 1
T1
F1
T2
F2
THF Model
HF Model
Application Analysis
aSOC Builder
Tile Exploration
Final model of
aSOC architecture
Results of the Tile exploration step
Function
F1
F2
Tile
T1
T2
T1
T2
Performance
T11 , C 11 , Occ 11
T21 , C 21 , Occ 21
T12 , C 12 , Occ 12
T22 , C 22 , Occ 22
Static Communication
Scheduling
aSOC Analysis
Application Analysis

C SPECIFICATION
C to HCDFG parser
Model of the aSOC Architectures
HCDFG Graphs of the application
Application App1
Tile T2
aSOC A1
Function F2
Function F1
Tile T1
T1
F1
T2
F2
THF Model
HF Model
aSOC Builder
Tile Exploration
Final model of
aSOC architecture
Results of the Tile exploration step
Function
F1
Tile
T1
T2
T1
T2
Performance
T11 , C11 , Occ11
T21 , C21 , Occ21
T12 , C12 , Occ12
T22 , C22 , Occ22

Algorithmic metrics:
– Characterize the application orientation
Application Analysis
F2
Use of algorithmic metrics and
dedicated scheduling algorithms to
highlight the target architectures
Static CommunicationScheduling
aSOC Analysis
• Processing
• Memory
• Control
– Characterize the application potential
parallelism
• Processing
• Memory
Tile Exploration: with 3 steps

– Link between necessary resources
(application) and available resources
(tile)
– Use of an allocation algorithm based on
communication costs reduction
C SPECIFICATION
C to HCDFG parser
Model of the aSOC Architectures
HCDFG Graphs of the application
Application App1
Tile T2
aSOC A1
Function F2
Function F1
Tile T1
T1
F1
T2
F2
THF Model
HF Model
Application Analysis
aSOC Builder

Tile Exploration
Results of the Tile exploration step
F2
Tile
T1
T2
T1
T2
Performance
T11 , C11 , Occ11
T21 , C21 , Occ21
T12 , C12 , Occ12
T22 , C22 , Occ22
Composition:
– Take into account of the function
scheduling to estimate additional
resources (register, mux, …)
Final model of
aSOC architecture
Function
F1
Projection:
Static CommunicationScheduling
aSOC Analysis

Estimation:
– performance interval computation (lower
and upper bounds)
– speed/resource utilization/power
characterization
aSoC Builder
C SPECIFICATION

Environment AppMapper

Partition and assignment
C to HCDFG parser
Model of the aSOC Architectures
HCDFG Graphs of the application
Application App1
Tile T2
aSOC A1
Function F2
Function F1
Tile T1
T1
F1
– based on Run Time Estimation
T2
F2
THF Model
HF Model
Application Analysis
aSOC Builder
Tile Exploration

Final model of
aSOC architecture
– Communication Scheduling
– Core compilation
Results of the Tile exploration step
Function
F1
F2
Tile
T1
T2
T1
T2
Performance
T11 , C11 , Occ11
T21 , C21 , Occ21
T12 , C12 , Occ12
T22 , C22 , Occ22
Compilation
Static CommunicationScheduling
aSOC Analysis

Generate tiles configuration
– Communications instructions
– Bitstreams (for reconfigurable tile)
– RISC instructions
aSoC Analysis
C SPECIFICATION
C to HCDFG parser
Model of the aSOC Architectures
HCDFG Graphs of the application
Application App1
Tile T2
aSOC A1
Function F2
Function F1
Tile T1
T1
F1
T2

F2
THF Model
Use the results of previous steps
– Functions scheduling
– Tile allocation
– Communication scheduling
HF Model
Application Analysis
aSOC Builder
Tile Exploration
Final model of
aSOC architecture
Results of the Tile exploration step
Function
F1
F2
Tile
T1
T2
T1
T2
Performance
T11 , C11 , Occ11
T21 , C21 , Occ21
T12 , C12 , Occ12
T22 , C22 , Occ22
Static CommunicationScheduling
aSOC Analysis

Complete estimation of the proposed
solution
– Global execution time
– Global power consumption
– Total area
Power-Aware
System on a Chip
A. Laffely, J. Liang, R. Tessier, C. A.
Moritz, W. Burleson
University of Massachusetts Amherst
Boston Area Architecture Conference
30 Jan 2003
{alaffely, jliang, tessier, moritz,
burleson}@ecs.umass.edu
This material is based upon work supported by the National Science Foundation under Grant No. 9988238.
Any opinions, findings, and conclusions or recommendations expressed in this material are those of the
author(s) and do not necessarily reflect the views of the National Science Foundation.
Adaptive System-on-a-Chip
Tile
Proc

– Point to point
communication pipeline
Multiplier

Communication
Interface
North
FPGA
Multiplier
West

East
Low-overhead core
interface for
– On-chip bus substitute
for streaming
applications

Core
Allows for heterogeneous
cores
– Differing sizes, clock
rates, voltages
ctrl
South
Tiled architecture with
mesh interconnect
Based on static
scheduling
– Fast and predictable
aSoC Implementation
2500 l
.18  technology
Full custom
3000 l
Some Results

9 and 16 core systems tested for IIR, MPEG
encoding and Image processing applications
– ~ 2 x the performance compared to Coreconnect
bus Burst and Hierarchical
– ~ 1.5 x the performance of an oblivious routing
network1 (Dynamic routing)
– Max speedup is 5 x
1. W. Dally and H. Aoki, “Deadlock-free Adaptive Routing in Multi-computer Networks
Using Virtual Routing”, IEEE Transactions on Parallel and Distributed Systems, April 1993