Recent Progress in Field Programmable Gate Arrays

Download Report

Transcript Recent Progress in Field Programmable Gate Arrays

Recent Progress in Field
Programmable Gate
Arrays:
Hardware, CAD software,
evaluation boards, and
reconfigurable computing
Marek Perkowski
Chengdu, June 2008
Programmable Logic
The simplest programmable logic devices
are PALs (see 22V10 figure next page).
PLD – my students use them in their first
year at PSU – Programmable Logic Devices
What is the next step in the evolution of
PLDs?
More gates!
How do we get more gates? We could put
several PALs on one chip and put an
interconnection matrix between them!!
This is called a Complex PLD (CPLD).
22V10 PLD
Cypress CPLD
Programmable
interconnect matrix.
Each logic block is
similar to a 22V10.
Any other approaches?
Another approach to building a “better” PLD is place a lot of
primitive gates on a die, and then place programmable interconnect
between them:
FPGA Technology
1.
Bird’s Eye View of FPGA Technology
2. FPGAs in 2004: Virtex-4 Introduction
3.
Software and Design
4.
Special Problems and Solutions
Field Programmable Gate Arrays
The FPGA approach to arrange primitive logic elements (logic
cells) arrange in rows/columns with programmable routing
between them.
What constitutes a primitive logic element? Lots of different
choices can be made! Primitive element must be classified as a
“complete logic family”.
• A primitive gate like a NAND gate
• A 2/1 mux (this happens to be a complete logic family)
• A Lookup table (I.e, 16x1 lookup table can implement any
4 input logic function).
Often combine one of the above with a DFF to form the
primitive logic element.
Other FPGA features
Besides primitive logic elements and
programmable routing, some FPGA
families add other features
Embedded memory
Many hardware applications need memory for
data storage. Many FPGAs include blocks of
RAM for this purpose
Dedicated logic for carry generation, or
other arithmetic functions
Phase locked loops for clock
synchronization, division, multiplication.
Altera Flex 10K FPGA Family
Altera Flex 10K FPGA Family (cont)
Dedicated memory
16 x1 LUT
DFF
Emedded Array Block
Memory block, Can be configured:
256 x 8, 512 x 4, 1024 x 2, 2048 x 1
EPROM/EEPROM Technology
EPROM can be reprogrammed, no need for
external storage.
EPROM can not be re-programmed in circuit.
EEPROM can be re-programmed in circuit.
EEPROM consumes 2X more area as EPROM.
Erasable PLD (EPLD)
SOP-based PAL
Logic array
In, Out, bidirection
Registers
I/Os
Configured to
D, T, JK, SR FFs.
Programmable clock
to each FF.
Programming the FPGA
Configuration.
Readback - design verification and
debugging.
Security - a security-bit to prevent
readback.
Advantages and
Disadvantages of FPGA
Fast turnaround.
Low NRE (non-recurring engineering)
changes.
Low risk.
Effective design verification.
low testing cost.
Chip size & cost.
Slow speed.
CPLD versus FPGA
CPLD
Interconnect style
Architecture and timing
Software compile times
In-system performance
Power consumption
Applications addressed
Continuous
Predictable
Short
Fast
High
Combinational and
registered logic
FPGA
Segmented
Unpredictable
Long
Moderate
Moderate
Registered
logic only
Source: Altera
FPGAs
What? - Programmable logic +
programmable routing = FPGAs.
Why? - Zero NREs, easy bug fixes, and
short time-to-market.
How?
Comparison of Different
Design Technologies
Design time
Fabrication
Chip area
Design cost
Unit cost
Design cycle
Custom Std Cells Gate Arrays
Long
Short
Short
Long
Long
Short
Small
Med.
Large
High
Med.
Low
Low
Low
Med.
Long
Med.
Short
FPGAs
Short
None
Very large
Very low
High
Very short
Emerging FPGA-based
Applications
Low-volume production.
Urgent time-to-market competition.
Rapid prototyping.
Logic emulation.
Custom-computing hardware.
Reconfigurable computing.
Design
Considerations
Target architecture.
Fixed logic and routing resources.
Fixed I/O pins.
Slow signal delays.
FPGA Selection Criteria
Density.
Speed.
Price.
Flexibility.
COSTS of Technologies
Lower Cost
Moore’s Law is alive
Smaller geometries and larger wafers
and lower defect density (=higher yield )
continue to achieve lower cost per function
LUT + flip-flop: $1.- in 1990, $ 0.002 in 2003
State-of-the-art: 90 nm on 300 mm wafers
Spartan-3 uses this technology for lowest cost
Rapid price reductions, intense competition
Changing costs of FPGAs and
technologies
More Logic and Better Features:
>100,000 LUTs & flip-flops
>200 BlockRAMs, and the same number of 18 x 18
multipliers
1156 pins (balls) with > 800 GP I/O
50 I/O standards, incl. LVDs with internal termination
16 low-skew global clock lines
Multiple clock management circuits
On-chip microprocessor(s) and Gbps transceivers
Gate count is really a meaningless metric
A Bird’s Eye View…
Higher Speed
Smaller and faster transistors
90 nm technology, using 193 nm ultra-violet light
Cu interconnect ( instead of Al ) was easily achieved
Low-K dielectric progress is disappointing
System speed: up to 500 MHz,
Mainly through smart interconnects, clock management,
dedicated circuits, flexible I/O.
Integrated transceivers running at 10 Gigabits/sec
Speeding up general-purpose logic is getting difficult
A Bird’s Eye View…
Better tools
Back-End Place&Route and XST synthesis
VHDL and Verilog becoming entry point
IP/Cores speed up design and verification
Embedded Software Development Tools
support architectures and merge HW and SW
Domain-Specific Languages
System Generator bridges the gap between
Matlab/Simulink and FPGA circuit description
ASIC-size FPGAs need ASIC-like tools
ASICs Are Losing Ground
Mask set >$1M + design + verification + risk
Source:IBM
ASICS are only for extreme designs:
Extreme volume, speed, size, low power
SPGA
Allow multiple building blocks.
Logic.
Memory.
Data path.
Applications Using SPGAs
Intellectual property (IP).
Communication & networking.
Graphical processing.
Embedded processing.
Designing with SPGAs
A team-based approach.
Understanding how to use SPGA system
features will be the key to pulling the entire
design into a single device.
CMOS PLD Market Share
31%
5%
5%
6%
24%
11%
3%
Other
Cpress
AT&T
Actel
Lattice
AMD
Altera
Xilinx
15%
Source:dataquest
CMOS Logic Market
8%
14%
10%
30%
9%
Std logic
Programmable
GA
Std cell
Custom
Chipset
29%
But the market
share is growing
Source:dataquest
FPGAs Growth
2500
2000
1500
M USD
1000
Milions
US
dollars
500
0
1996
1997
1998
1999
2000
Source: Integrated Circuit Engineering
CMOS Programmable-logic
Market
5
4
3
B USD
2
Billions
US
dollars
1
0
1997
1998
1999
2000
Source:dataquest
Rapid Prototyping
What?
Why?
How?
What is prototyping?
Basic components: FPGAs and FPICs.
Hardware : boards, boxes, and cabinets.
Software: methodologies and CAD tools.
Field Programmable
Gate Arrays
Field Programmable
Interconnect Devices
Product Development Cycle
Market survey
Customer
acceptance
Product development
Production
Pressures on Today’s
Product Development
Time-to-market!
Design complexity!
Why Needs Prototyping?
Design verification.
Limited production.
Concurrent engineering.
This requires cooperation
of engineers, computer science
specialists and marketing
Design Verification
Specification
Functionality &
requirements
?
Final product
Final functionality
& performance
Design Process
Specification
System-level design
RTL design
Logic-level design
Physical-level design
Final chips
Simulation
Fast prototyping
Formal verification
Logic emulation
Formal verification is just
one of options
Verification Alternatives
Modeling System Prepare
accuracy integration time Speed
Event Driven Simulation
High
No
Short
Slow
Cycle-Based Simulation
Med.
No
Short
Med.
Behavioral Simulation
Low
No
Short
Med.
Hardware Accelerated Sim Varies No
Med.
Med. Fast
Breadboarding
Med.
Yes
Long Very Fast
Emulation or Prototyping
Med.
Yes
Med. Very Fast
A Minute in the Life of a
100K Gates Design
One minute
1 --------- Actual hardware at 50MHz
10 -------- Logic emulator or prototype at 5MHz
100------2K-------- HW accelerator at 250M evals/sec
1 Mon. 50K------- Cycle-based simulator at 1K insts/sec
3 Mon. 120K----- Compiled-code logic simulator at 125MIPs
1.5 Yr. 800K----- Event-driven logic simulator at 125 MIPs
ten
minutes
We need FPGA emulation because simulation is too slow
Development with Prototyping
small
gap
SW
Design
Code
HW
Design
Build
CHIP
Design
Fab
Big gap
Integration
Integration
Debug
Debug
Debug
Development with Prototyping
You speed up development through parallelism
SW
HW
CHIP
Design
Design
Design
Integration
Code System
& SW Debug
Build
HW Integration
& Debug
Chip debug
Fab
Final
Integration
How to Develop a
Prototyping using FPDs
Custom-designed prototyping board.
Logic-emulation systems.
Field-programmable printed-circuit-boards.
Field Programmable
Devices
FPGA State of the Art 2004
90-nanometer manufacturing technology
Ten Gigahertz serial I/O (SerDes) in silicon
0.07 femtosecond
asynchronous data capture window
causes 1.5 ns metastable delay
Issues in FPGA
Technologies
Complexity of Logic Element
How many inputs/outputs for the logic element?
Does the basic logic element contain a FF? What type?
Interconnect
How fast is it? Does it offer ‘high speed’ paths that cross the
chip? How many of these?
Can I have on-chip tri-state busses?
How routable is the design? If 95% of the logic elements are
used, can I route the design?
More routing means more routability, but less room for
logic elements
Issues in FPGA Technologies (cont)
Macro elements
Are there SRAM blocks? Is the SRAM dual ported?
Is there fast adder support (i.e. fast carry chains?)
Is there fast logic support (i.e. cascade chains)
What other types of macro blocks are available (fast
decoders? register files? )
Clock support
How many global clocks can I have?
Are there any on-chip Phase Logic Loops (PLLs) or
Delay Locked Loops (DLLs) for clock synchronization,
clock multiplication?
Issues in FPGA Technologies (cont)
What type of IO support do I have?
TTL, CMOS are a given
Support for mixed 5V, 3.3v IOs?
3.3 v internal, but 5V tolerant inputs?
Support for new low voltage signaling standards?
GTL+, GTL (Gunning Tranceiver Logic) - used on Pentium II
HSTL - High Speed Transceiver Logic
SSTL - Stub Series-Terminate Logic
USB - IO used for Universal Serial Bus (differential signaling)
AGP - IO used for Advanced Graphics Port
Maximum number of IO? Package types?
Ball Grid Array (BGA) for high density IO
Altera FPGA Family
Summaries Now we discuss some
popular families
Altera Flex10K/10KE
LEs (Logic elements) have 4-input LUTS (look-up tables)
+1 FF
Fast Carry Chain between LE’s, Cascade chain for logic
operations
Large blocks of SRAM available as well
Altera Max7000/Max7000A
EEPROM based, very fast (Tpd = 7.5 ns)
Basically a PLD architecture with programmable
interconnect.
Max 7000A family is 3.3 v
Xilinx FPGA Family Summaries
Virtex Family
SRAM Based
Largest device has 1M gates
Configurable Logic Blocks (CLBs) have two 4-input
LUTS, 2 DFFs
Four onboard Delay Locked Loops (DLLs) for clock
synchronization
Dedicated RAM blocks (LUTs can also function as RAM).
Fast Carry Logic
XC4000 Family
Previous version of Virtex
No DLLs, No dedicated RAM blocks
Actel FPGA Family
Summaries
MXDS Family
Fine grain Logic Elements that contain Mux logic +
DFF
Embedded Dual Port SRAM
One Time Programmable (OTP) - means that no
configuration loading on powerup, no external serial
ROM
AntiFuse technology for programming (AntiFuse
means that you program the fuse to make the
connection).
Fast (Tpd = 7.5 ns)
Low density compared to Altera, Xilinx - maximum
number of gates is 36,000
Cypress CPLDs
Ultra37000 Family
32 to 512 Macrocells
Fast (Tpd 5 to 10ns depending on number of
macrocells)
Very good routing resources for a CPLD
Evolution
1965
1980
1995
2010(?)
Max Clock Rate (MHz)
1
10
100
1000
Min IC Geometries (µ)
-
5
0.5
0.05
# of IC Metal Layers
1
2
3
12
2000
500
100
25
1-2
2-4
4-8
10-20
PC Board Trace Width (µ)
# of PC-Board Layers
Every 5 years:
System speed doubles, IC geometry shrinks 50%
Every 7-8 years:
PC-board min trace width shrinks 50%
The Ever-Shrinking Circuitry
Number of LUTs + flip-flops + routing
that fit on the cross section of a human hair
2000
2002
2004
2005
2 LUTs in Virtex-II (150 nm)
3 LUTs in Virtex-IIPro (130 nm)
4 LUTs in Virtex-4 (90 nm)
8 LUTs = one CLB in 65 nm
Moore’s law is alive and well in FPGAs
Middle-of-the-Road Xilinx FPGAs
1990
1994
1998
2000
2002
2004
2005
XC3042
XC4005
XC4013XL
XCV300
XC2V1000
XC2VP30
XC4V60-LX
288
512
1,152
6,144
10,240
27,382
53,248
LUTs + flip-flops
LUTs + flip-flops
LUTs + flip-flops
LUTs + flip-flops
LUTs + flip-flops
LUTs + flip-flops
LUTs + flip-flops
Same price for each: One day’s engineering salary
Thirteen Years of Progress of Xilinx
Devices
200x More Logic
plus memory, µP,
DSP, MGT
40x Faster
50x Lower Power
per function x MHz
500x Lower Cost
per function
Moore Meets Einstein
2048
1024
Trace Length in cm per 1/4 clock period
512
256
128
64
32
16
Clock Frequency in MHz
8
4
2
1
’65
’70
’75
’80
’85
’90
’95
’00
’05
’10
Year
Speed Doubles Every 5 Years…
...but the speed of light never changes
FPGAs in 2003
1000 to 80,000 LUTs and flip-flops,
millions of bits in dual-ported RAMs
Low-skew Global Clocks,
Frequency synthesis, 50 ps phase control
18 Kbit BlockRAMs and 18 x 18 multipliers
FPGAs are not glue-logic anymore
FPGAs in 2003
1000 to 80,000 LUTs and flip-flops,
millions of bits in dual-ported RAMs
Low-skew Global Clocks,
Frequency synthesis, 50 ps phase control
18 Kbit BlockRAMs and 18 x 18 multipliers
FPGAs are not glue-logic anymore
FPGAs in 2003
300+ MHz system clock,
800 MHz I/O
3+ Gigabit transceivers
Embedded hard and soft microprocessors
Design security: Triple-DES encryption
VHDL/Verilog entry, synthesis, auto place and route
FPGAs are a compelling alternative to ASICs
FPGAs in 2004
Virtex-4 in September 2004
4th Generation
Advanced Logic
ASMBL™
Column-Based
Architecture
500 MHz
SmartRAM™
BRAM/FIFO
Integrated 450 MHz
PowerPC Cores
0.6 - 11.1 Gbps
RocketIO™
Integrated
Tri-Mode
Ethernet MAC
Cores
Integrated
System Monitor
500 MHz
Xesium™ Clocking
500 MHz
Xtreme DSP™ Slice
SelectIO with
ChipSync™
Technology:
- 1 Gbps LVDS
- 600 Mbps SE
New ASMBL™ Columnar Architecture
Enables “Dial-In”
Resource Allocation Mix
Logic, DSP, BRAM, I/O, MGT,
DCM, PowerPC
Made possible by
Flip-Chip Packaging
I/O Columns Distributed
throughout the Device
FPGA Innovation: Virtex-4
90 nm technology, triple-oxide, 1.2-V Vccint supply
General-purpose I/O up to 1 Gbps,
Vcco=1.5, 2.5, or 3.3-V
0.6 to 11.2 Gigabit/sec RocketI/O transceivers
Advanced Silicon Modular Block architecture
Three sub-families:
V4-LX for logic-intense applications
V4-SX for DSP-intensive applications
V4-FX with PPC micros and multi-gigabit transceivers
Common architecture for diverse applications
FPGA Innovation: Virtex-4
Higher Performance:
500 MHz for all sub-blocks
More Versatility
New innovative functions
Higher Level of Integration
More LUTs, flip-flops, RAMs, multipliers
Lower Cost
Smaller area = lower cost per function
Lower Power per ( Function times MHz )
FPGA Innovation: Virtex-4
Flip-chip packaging:
lower pin-inductance, stiffer Vcc distribution
Lower power per function and MHz
Triple-oxide gates, multiple thresholds,
smaller size, lower Vcc, better design
Better clocking, less skew, more flexibility
Better configuration control, partial reconfiguration
Robust configuration cell, SEU tolerant like 130 nm
FPGA Innovation: Virtex-4
Improved I/O Flexibility and Performance
Supports >50 standards, on-chip termination
Source-synchronous and system-synchronous
Serializer/deserializer behind each pin
Programmable delay available for each pin
> 1Gbps SelectI/O on each pin
>10 Gbps transceivers on dedicated pins (-FX family only)
Source-synchronous I/O improves performance
Serial I/O saves pins and pc-board area
FPGA Innovation: Virtex-4
Faster logic and memory
500+ MHz operation of all on-chip functions
32-bit arithmetic
48-bit adders and synchronous loadable counters
Up to 72-bit wide memory
4- to 36-bit wide FIFO control in each BlockRAM
Operates with fully independent write and read clocks
Reliable EMPTY and FULL outputs
also ALMOST Empty and ALMOST Full
FIFOs need no fabric resources and no design expertise
Advanced Clocking
Proper clocking is extremely important
for performance and reliability
Most design need many global clock lines
with minimal clock delay and clock skew
Digital Clock Manager (DCM) provides:
Four-phase outputs,
Frequency multiplication and division
Fine phase adjustment
Advanced I/O
>50 Different Output Standards
(strength, voltage, input threshold, etc)
multiple parallel output transistors
which are either fully on or fully off,
Nothing is ever analog, except in LVDS
Digitally Controlled Impedance =DCI
for series-termination of transmission-line drivers
Adjusts up/down strength to be = external resistor
One external pull-up and pull-down resistor per bank
V2Pro and Virtex-4 can “update-only-if-necessary”
System Synchronous
System-Synchronous when the clock
arrives “simultaneously” at all chips
typically used below 200 MHz clock rate
On-chip clock distribution DCM
Zero clock delay controls set-up time,
and avoids hold time requirements
The traditional design methodology
Source Synchronous
Each data bus has its own clock trace
typically used at 200 to 800 MHz clock rate
On-chip clock-distribution DCM
centers the clock in the data eye
Adds more unidirectional-only clock lines
The only way above 300 MHz
Serial Transceiver Technology
3.125 Gbps
over each pair
32b @
78 MHz
Virtex-II Pro
32b @
78 MHz
Virtex-II Pro
Serial Transceiver Technology
Up to 11.1 Gbps
over each pair
64b @
168 MHz
64b @
168 MHz
Virtex-4
Virtex-4
RocketIO™
Multi-Gigabit Transceiver
8 to 24 per device
TXDATA
FIFO
Encode
8-64b Wide
Transmitter
78MHz to
700MHz
Serializer
Transmit
Buffer
Serial
Out
TX Clock Generator
Programmable Features:
16X/20X Multiplier
REFCLK
Receiver
Comma Detect
and Word
Alignment
RXDATA
Elastic
8-64b Wide
Buffer
RX Clock Generator
Decode
DeSerializer
622 Mb/s – 11.1 Gb/s
Receive
Buffer
Serial
In
64b/66b or 8b/10b EnDec
Comma Detect
Rx and Tx FIFO
Pre-Emphasis
Receiver Equalization
Output Swing
On-Chip Termination
Channel bonding
AC & DC Coupling
Virtex-4 Capabilities
Any type of design runs at >400 MHz
Pipelining provides extra performance “for free”
Synchronous is best, but 32 clock are available
Gigabit serial saves pins and board area
On-chip termination for board signal integrity
I/O features support double-data rate operation
and source-synchronous design
Virtex-4 Capabilities
Popular functions are hard-wired
for lower cost, higher performance, and ease-of-use:
microprocessors, FIFOs, serial I/O, clock management, etc.
Many pre-tested soft cores are available
Some are free, some for a fee
One-hot state machines are preferred
But MicroBlaze and PicoBlaze may be better
Massive parallelism enhances DSP,
Up to 1024 fast two’s complement multipliers per chip,
faster than dedicated DSP chips, but needs system-rethinking
2004 Challenges
Technology moves rapidly: 130, 90, 65 nm
Multiple Vcc, lower voltage - higher current
Lower Vcc makes decoupling very critical
Moore’s law becomes more difficult to sustain
Leakage current has increased significantly
Triple-oxide transistors and clever design provide relief
Signal integrity on pc-boards is crucial
“homebrew” prototyping would waste money and time
Use Standard Evaluation Boards Instead