Programmable Logic`` (PPT Slides)

Download Report

Transcript Programmable Logic`` (PPT Slides)

ENG241
Digital Design
Week #12
Programmable Logic Technologies
Week #12 Topics







The Von Neumann Architecture
What is Programmable Logic?
Classification of Programmable Logic
Field Programmable Gate Arrays
FPGA CAD
Applications
Summary
2
Resources

Chapter #10, Mano Sections

10.3 Programmable Implementations Tech
3
The Von Neumann Computer

Principle
In 1945, the mathematician Von Neumann (VN)
demonstrated in study of computation that a
computer could have a simple structure,
capable of executing any kind of program,
given a properly programmed control unit,
without the need of hardware modification
ENIAC - The first electronic
computer (1946)
4
The Von Neumann Computer

Structure
 An arithmetic and logic unit (ALU) also called data path for
program execution
 A control unit (control path) featuring a program counter for
controlling program execution
 A memory for storing program and data.
 The memory consists of the word with the same length
Processor or
Central processing unit
Memory
Datapath
Data
Data
and
Instructions
Registers
Instruction
register
PC
Address
register
Address
Control Unit
5
The Von Neumann Computer


Coding
A program is coded as a set of instructions to be
sequentially executed
Program execution
 Instruction Fetch (IF): The next instruction to be
executed is fetched from the memory
 Decode (D): Instruction is decoded (operation?)
 Read operand (R): Operands read from the memory
 Execute (EX): Operation is executed on the ALU
 Write result (W): Results written back to the memory
 Instruction execution in Cycle (IF, D, R, EX, W)
What is the problem with this computing paradigm?
6
Bottlenecks in VN Architecture
7
The Von Neumann Computer



Advantage:

Simplicity.

Flexibility: any well coded program can be executed
Drawbacks:

Speed efficiency: Not efficient, due to the sequential
program execution (temporal resource sharing).
 Resource efficiency: Only one part of the
hardware resources is required for the execution of
an instruction. The rest remains idle.
 Memory access: Memories are about 5 times
slower than the processor
How to compensate for deficiencies?
8
Improving Performance of VN (GPPs)
1. Technology Scaling
 Improve performance (increase clock frequency!)
2. Improving Instruction Set of Processor
3. Application Specific Processors (DSP)
4. Use of Hierarchical Memory System
 Cache can enhance speed
5. Multiplicity of Functional Units (H/W)
 Adders/Multipliers/Dividers (CDC-6600)
ENGG3380
ENGG4540
6. Pipelining within CPU (H/W)
 A four stage pipeline stage (IF/ID/OF/EX)
7. Overlap CPU & I/O Operations (H/W)
 DMA (Direct Memory Access) can be used to enhance performance
8. Time Sharing (SW)
 Multi-tasking assigns fixed or variable time slices to multiple programs
9. Parallelism & Multithreading (S/W) (H/W)
 Compilers/Multi-core systems
9
Spatial vs. Temporal Computing
Von Neumann Architecture
(Ax + B)x + C
Temporal (Processor)
10
Spatial vs. Temporal Computing
Von Neumann Architecture
Ax2 + Bx + c
Spatial (ASIC or FPGA)
(Ax + B)x + C
Temporal (Processor)
ENGG3050
11
Temporal vs. Spatial Based Computing
Temporal-based execution
(software)
Spatial-based execution
(reconfigurable computing)
Ability to extract parallelism (or concurrency) from
algorithm descriptions is the key to acceleration
using reconfigurable computing
12
Gerald Estrin Fix-Plus Machine
 Attempts to have a flexible hardware
structure that can be dynamically
modified at run-time to compute a
desired function are almost as old as the
development of other computing
paradigms.
 In 1959, Gerald Estrin, at UCLA,
introduced the concept of reconfigurable
computing by introducing the Fix-Plus
Machine.
Estrin at work.
Substantial efforts
on Reconfiguration
Programmable Logic I



We learnt in the first part of this course that any
combinational logic circuit can be implemented
with the sum of min-terms (SOP).
If we can control the number of AND gates to be
used and also control the inputs to the OR gate
then we can design a programmable logic circuit.
Remember when we used a decoder to implement
any Boolean function! That was some type of
implementing programmable logic!
Programmable
AND array
Programmable
Or Array
14
I. Programmable AND Array
o
o
o
o
o
If we remove fuses Faf and Fbt this will disconnect the complementary
version of input ‘a’ and the true version of input ‘b’.
This leaves the device to perform its new function  y = a AND b’
The process of removing fuses is typically referred to as programming
the device (blowing, burning the device).
Devices based on fusible-link technology are said to be One Time
Programmable (OTP).
Remember: FPGAs are not based on this type of technology.
Logic 1
Fat
a
Pull-up resistors
NOT
&
b
y = a & !b
AND
Fbf
NOT
15
Decoders: Implementing Logic

Example: Implement the following boolean functions
1.
S(A2,A1,A0) = SUM(m(1,2,4,7))
1.
Since there are three
inputs, we need a 3-to-8
line decoder.
2.
The decoder generates the
eight minterms for inputs
A0,A1,A2
3.
An OR GATE forms the
logical sum minterms
required.
16
II. Programmable OR Array
17
Programmable Boolean Functions
Multiplexers can also be used to realize Boolean
functions since they consist of an array of AND gates
followed by an OR gate.
18
Classification of PLDs
Programmable
Or Array
Programmable
AND array
Programmable
AND array
Programmable
Or Array
19
Classification
Programmable Logic Devices
The first
programmable
ICs were
generically
referred to as
(PLDs).
PROMs
PLDs
Simple PLDs
Complex PLDs
SPLDs
PLAs
CPLDs
PALs
GALs
etc.
20
Programmable Logic Array (PLA)
3
4
Like
programmable
inverter
Tied to 0 – F1
not inverted
Tied to 1 – F1 is
inverted
1
2
21
Complex PLDs (CPLDs)
The integration of several Simple PLD blocks with a
programmable interconnect on a single chip  CPLD
PLD
Block
•
•
•
•
•
•
I/O Block
PLD
Block
I/O Block
I/O Block
•
•
•
Interconnection Matrix
I/O Block
•
•
•
PLD
Block
PLD
Block
22
III. SRAM FPGAs:
Memory units can be used to implement a Boolean function
by storing the output of the truth table in the memory and
accessing the values by using variables of the truth table
as address lines.
A
B
C
D
Z
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
0
0
0
0
1
1
1
1
0
0
0
0
1
1
1
0
0
1
1
0
0
1
1
0
0
1
1
0
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
0
1
1
1
0
1
1
1
0
1
1
1
0
0
0
A
B
C
D
LUT
Z
LUT implementation
A
B
Z
C
D
Gate implementation
23
Generic FPGA architecture:
Configurable Logic Block
(CLB)  LUT + FF
Connection
Block
Wire segments
Switch Block
Routing Channels
I/O pad
24
SRAM based Programmable Cell
o
There are two main versions of semiconductor RAM devices:
o
o
o
o
Dynamic RAM (DRAM) and
Static RAM (SRAM).
SRAM based devices can be used to control NMOS transistors
to be on/off.
This can be very useful to control Multiplexers, Routing, e.t.c.
SRAM
25
Pass Transistor
o
o
o
An SRAM cell can drive the gate (G) terminal of an NMOS
transistor.
If SRAM (M) = 1 then signals passes from S  D
An SRAM cell can be attached to the select line of a MUX
to control it.
26
Look Up Table (LUT)
o
o
The LUT is used to realize any boolean function.
Assume the function to be realized is y = (a&b) | !c
This could be achieved by loading the LUT with the
appropriate output values
Required function
a
b
c
Truth table
&
|
y = (a & b) | !c
y
Programmed LUT
a b c
y
SRAM cells
0
0
0
0
1
1
1
1
1
0
1
1
1
0
1
1
1
0
1
1
1
0
1
1
0
0
1
1
0
0
1
1
0
1
0
1
0
1
0
1
000
001
010
011
100
101
110
111
8:1 Multiplexer
o
y
abc
27
Configurable Logic Block (CLB)
A Configurable logic block consists of lookup table (LUT), a
register that could act as flip flop or a latch, and a mulitplexer,
along with a few other elements.
a
b
c
d
3-input
LUT
y
mux
flip-flop
q
clock
28
29
Xilinx CLB
Switch Matrix
o
o
Connections between CLBs and IOBs are made using
wiring segments in both horizontal and vertical
channels lying between the various blocks.
Four segments meet, on each there is 6 pass
transistors.
30
Xilinx IOB
31
CAD for FPGAs
Design
Entry
Placement
Routing
Synthesis
Packing LUTs
to CLBs
Simulation
Logic
Optimization
Mapping
to k-LUT
Configure an
FPGA
32
CAD for FPGAs: Place & Route
Design
Entry
Placement
Routing
Synthesis
Packing LUTs
to CLBs
Simulation
Logic
Optimization
Mapping
to k-LUT
Configure an
FPGA
33
Programming an FPGA?
f1
A
B
f2
C
D
E
F
ABC
f3
D
E
F
f1
f2
f3
Technology Mapping
Placement
Routing
34
FPGA Placement Problem
• Input – A technology mapped netlist of Configurable
Logic Blocks (CLB) realizing a given circuit.
• Output – CLB netlist placed in a two dimensional array
of slots such that total wirelength is minimized.
i1
i2
i3
i4
i1
1
4
2
5
3
6
8
f1
4
8
10
f1
f2
CLB Netlist
Placement
i2
5
1
7
9
i3
2
3
6
7
i4
9
10
f2
FPGA
35
Global vs. Detailed Routing

Global routing
LB
LB
SB
LB
LB
LB
LB
SB
LB
LB
LB
LB
LB
SB
LB
SB
SB
LB
Detailed routing
LB
SB
SB
LB

LB
SB
LB
LB
36
Remember!
Programmable
Lookup Tables (LUTs)
Programmable
routing structure
Main bottleneck with
state-of-the-art fine
grain FPGAs is the
routing enabled by
pass transistors!
37
Remember!
Programmable
Lookup Tables (LUTs)
Programmable
routing structure
y
...
...
x
LUT
y
z f
x
z
SRAM
f
0
0
1
...
1
0
Look-up-tables are
flexible but require
lots of configuration
and suffer from
power dissipation!
38
Fine Grain FPGAs: Spartan2
o
o
o
4K bit RAM blocks
Large amt of logic
Program stored in
SRAM
39
Medium Grain: Xilinx Virtex
•
Virtex-II FPGA introduced followed by Virtex-II Pro in 2003
– 444 18x18 Multipliers & 18kbit block RAMs introduced
– Gbit Serial I/O Communications & Power PC Processors Introduced
– Complex Floating Point Algorithm Implementation now possible
•
Virtex-II / Pro
– 44,000 Logic Slices
– 444 18Kbits BRAMs
– 444 18x18 Multipliers
– 2 PowerPC
Processors
– 20 Gbit I/O
– 1164 Max User I/O
40
Zynq - Extensible Processing Platform
41
Dynamic Partial Reconfiguration


Partial Reconfiguration is the ability to dynamically modify blocks of logic
while the remaining logic continues to operate without interruption.
Computation sequences are not know at compile time. The system decides,
respectively reacts dynamically to application driven reconfiguration
requests.
Function A3
A1
A2
B1
Function B2
C1
Function C2
Configuration
Port
Full
Bit File
Configuration
Port or ICAP
42
Partial
Bit Files
Methods for executing algorithms
Hardware
(Application Specific
Integrated Circuits)
Advantages:
•very high
performance and
efficient
Disadvantages:
•not flexible (can’t
be altered after
fabrication)
• expensive
Reconfigurable
computing
Advantages:
•fills the gap
between hardware
and software
•much higher
performance than
software
•higher level of
flexibility than
hardware
Software-programmed
processors
Advantages:
•software is very
flexible to change
Disadvantages:
•performance can
suffer if clock is not
fast
•fixed instruction set
by hardware
43
Reconfigurable Devices
Reconfigurable Devices (RD) are usually
used in many different ways:
1.
2.
3.
4.
Rapid Prototyping
Non-frequent reconfigurable systems
Frequently reconfigurable systems
High Performance Computing (Acceleration
of Complex Algorithms
44
1. Rapid prototyping

Testing hardware in real conditions
before fabrication
 Software simulation
 Relatively inexpensive
 Slow
 Accuracy ?
 Hardware emulation
 Hardware testing under real
operation conditions
 Fast
 Accurate
 Allow several iterations
APTIX System Explorer
ITALTEL FLEXBENCH
45
2. Non-Frequent Reconfiguration
46
3. Frequently Reconfigured
Computing systems that are
able to adapt their behaviour
and structure to changing
operating and environmental
conditions, time-varying
optimization objectives, and
physical constraints like
changing protocols, new
standards, or dynamically
changing operation conditions
of technical systems
47
4. Algorithm Acceleration
Real Time Video
Processing
Gravity Simulation
48
- Single Precision Floating
Point calculations
-36 GFlops + 40 GOPs
sustained Performance on
a single PCI card
- >200 times Power
reduction over Xeon
- N-Body computation
- Single Precision Floating
Point
- 20GFlops/sec sustained
performance
-100 times faster than
2.4GHz Pentium 4 CPU
fMRI and Real-time Human Body Imaging
• Technique for determining which parts of the brain are activated
by different types of physical sensation or activity – “brain
mapping”
• High- and low-resolution scans compared using numerous FFTs
– Typically post-processed
– Much error correction needed due to subject movement
– 3D data representation requires a good deal of conventional processing
• Studying how RC devices can achieve real-time processing
Figures c/o University of Oxford, UK
49
Image Registration
• In computer vision, sets of data acquired by sampling the
same scene or object at different times, or from different
perspectives, will be in different coordinate systems.
• Image registration is the process of transforming the
different sets of data into one coordinate system.
• Registration is necessary in order to be able to compare or
integrate the data obtained from different measurements.
50
Biomechanical Kinematics
• Knee-joint simulation*
– Build a generic model to predict human movement (jumping, walking, etc)
– Used to study joint replacement stresses without risking patient injury
– Biomechanical simulations frequently use costly optimization methods
– Studying how RC-based parallel processing can increase performance
Figures c/o UF Computational Biomechanics Lab
51
Satellite Imaging
•
•
•
Satellite imaging used for mapping, environmental studies
and defense applications
High-data rate and low-power demands of space require
cutting-edge technology such as RC to provide required
processing capabilities
Including RC devices in the processing chain will
eventually enhance performance
c/o US Air Force
c/o LANL
Receive
Cube
Pulse
Compression
Doppler
Processing
Space-Time
Adaptive
Processing
(STAP)
Constant
False Alarm
Rate
(CFAR)
c/o LANL
Send
Results
GMTI processing chain
Corner Turn
Partitioned along
range dimension
Partitioned along
pulse dimension
52
…Towards a safe use of on board Support Systems and Services:
The AIDE Integrated Project
Adaptive Integrated Driver Vehicle Interface
Microphone
GPS
antenna
Sensor box for
Curve Warning,
Navigation, DVE
Compact PC
for Curve Warning
Real time controller
for Gateway
Industrial PC
for ICA, HMI,
Speech I/O,
Navigation, etc.
Haptic barrel key
Vehicle server
PC for DVE
Radar sensor
for Frontal Collision Warning
CRF Demonstrator Vehicle
Car Radio / CD
CMOS Camera for Lane
Departure Warning
USB MP3 player
Data processing
unit for Frontal
Collision
Warning
Navigation System
BT link to Nomadic Devices
Reconfigurable
LCD Display
AIDE Integrated Project’s OEMs:
Volvo, CRF, PSA, Renault, DaimlerChrysler, Ford, BMW, SEAT, OPEL
Image Processing
Unit for Lane
Departure Warning
53
ITS
Driving Assistance - Information Support
www.seeingmachines.com
Summary
o Programmable logic comes in different flavors such as
PLDs, CPLDs and FPGAs.
o Field Programmable Gate Arrays is a technology
introduced in the late 80’s to allow Engineers to implement
their design without the need to fabricate the chip as we
do in Application Specific Integrated Circuits (ASICs).
o The main components of an FPGA are the CLBs, IOBs
and programmable interconnect (Fine Grain FPGAs).
o New technologies of FPGAs include Block Memory,
Processors, Multipliers (we start to call these Coarse Grain
FPGAs)
o Applications of FPGAs in HPC, Embedded Systems, Cars,
Appliances, … (Endless ..)
55
Programmable Logic Array (PLA)
o
o
o
The PLA is similar in concept to the PROM,
except that the PLA does not provide full
decoding of the variables and does not
generate all the minterms.
The decoder is replaced by an array of AND
gates that can be programmed to
generate product terms of the input
variables.
The product terms are then selectively
connected to OR gates to provide the
sum of products for the required Boolean
functions.
57
Programming
•
•
Programming
the PLA can be
specified in
tabular form
3 sections,
1.
product terms,
2.
input and AND
gates,
3.
Outputs
58