Lecture 6: Vector

Download Report

Transcript Lecture 6: Vector

Lecture 14:
(Re)configurable Computing
Case Studies
Prof. Jan Rabaey
Computer Science 252, Spring 2000
The contributions of Andre Dehon (Caltech) and Ravi Subramanian (Morphics)
to this slide set are gratefully acknowledged
JR.S00 1
Summary of Previous Class
• Configurable Computing using “programming
in space” versus “programming in time” for
traditional instruction-set computers
• Key design choices
– Computational units and their granularity
– Interconnect Network
– (Re)configuration time and frequency
• Next class: Some practical examples of
reconfigurable computers
JR.S00 2
Applicability of Configurable Processing
• Stand-alone computational engines
– E.g. PADDI, UCLA Mojave
• Adding programmable I/O to embedded
processors
– E.g. Napa 2000
• Augmenting the instruction set of processors
– E.g. GARP, RAW
• Providing programmable accelerator coprocessors to embedded micro’s and DSP
– Chameleon, Pleiades, Morphics
JR.S00 3
Stand-Alone Computational
Engines
UCLA Mojave System
Template Matching for Automatic
Target Recognition
I960 Board
JR.S00 4
As Programmable Interface and I/O
Processor
• Logic used in place of
– ASIC environment
customization
– external FPGA/PLD
devices
• Example
– bus protocols
– peripherals
– sensors, actuators
• Case for:
– Always have some system
adaptation to do – varying
glue logic requirements
– Modern chips have
capacity to hold processor
+ glue logic
– Reduces part count
– Valued added must now be
accommodated on chip
(formerly board level)
JR.S00 5
Example: Interface/Peripherals
• Triscend E5
JR.S00 6
Model: IO Processor
• Array dedicated to
servicing IO channel
– sensor, lan, wan,
peripheral
• Provides
• Case for:
– many protocols, services
– only need few at a time
– dedicate attention, offload
processor
– protocol handling
– stream computation
» compression, encrypt
• Looks like IO peripheral
to processor
JR.S00 7
IO Processing
• Single threaded processor
– cannot continuously monitor multiple data pipes (src,
sink)
– need some minimal, local control to handle events
– for performance or real-time guarantees , may need to
service event rapidly
– E.g. checksum (decode) and acknowledge packet
JR.S00 8
NAPA 1000 Block Diagram
TBT
ToggleBusTM
Transceiver
System
Port
External
Memory
Interface
CR32
RPC
CompactRISCTM
32 Bit Processor
Reconfigurable
Pipeline Cntr
BIU
PMA
Bus Interface
Unit
Pipeline
Memory Array
CR32
Peripheral
Devices
SMA
ALP
Adaptive Logic
Processor
CIO
Configurable
I/O
Scratchpad
Memory Array
JR.S00 9
Source: National Semiconductor
NAPA 1000 as IO Processor
SYSTEM
HOST
Application
Specific
System Port
NAPA1000
CIO
Sensors, Actuators, or
other circuits
Memory Interface
ROM &
DRAM
JR.S00 10
Source: National Semiconductor
I/O Stream Processor
Combines Trimedia VLIW with
Configurable media co-processors
Philips Nexperia NX-2700
A programmable HDTV
media processor
JR.S00 11
Model: Instruction Augmentation
• Observation: Instruction Bandwidth
– Processor can only describe a small number of basic
computations in a cycle
» I bits 2I operations
– This is a small fraction of the operations one could do even in
terms of www Ops
» w22(2w) operations
– Processor could have to issue w2(2 (2w) -I) operations just to
describe some computations
– An a priori selected base set of functions could be very bad for
some applications
JR.S00 12
Instruction Augmentation
• Idea:
– provide a way to augment the processor’s instruction set
– with operations needed by a particular application
– close semantic gap / avoid mismatch
• What’s required:
– some way to fit augmented instructions into stream
– execution engine for augmented instructions
» if programmable, has own instructions
– interconnect to augmented instructions
JR.S00 13
“First” Instruction Augmentation
• PRISM
– Processor Reconfiguration through Instruction Set
Metamorphosis
• PRISM-I
– 68010 (10MHz) + XC3090
– can reconfigure FPGA in one second!
– 50-75 clocks for operations
[Athanas+Silverman: Brown]
JR.S00 14
PRISM-1 Results
Raw kernel speedups
JR.S00 15
PRISM
•
•
•
•
•
FPGA on bus
access as memory mapped peripheral
explicit context management
some software discipline for use
…not much of an “architecture” presented to
user
JR.S00 16
[Razdan+Smith: Harvard]
PRISC
• Takes next step
– what look like if we put it on chip?
– how integrate into processor ISA?
• Architecture:
– couple into register file as “superscalar” functional unit
– flow-through array (no state)
JR.S00 17
PRISC
• ISA Integration
– add expfu instruction
– 11 bit address space for user-defined expfu instructions
– fault on pfu instruction mismatch
» trap code to service instruction miss
– all operations occur in clock cycle
– easily works with processor context switch
» no state + fault on mismatch pfu instr
JR.S00 18
PRISC Results
• All compiled
• working from MIPS
binary
• <200 4LUTs ?
– 64x3
• 200MHz MIPS base
Razdan/Micro27
JR.S00 19
Chimaera
• Start from PRISC idea
–
–
–
–
integrate as functional unit
no state
RFUOPs (like expfu)
stall processor on instruction miss, reload
• Add
– manage multiple instructions loaded
– more than 2 inputs possible
[Hauck: Northwestern]
JR.S00 20
Chimaera Architecture
• “Live” copy of register
file values feed into array
• Each row of array may
compute from register
values or intermediates
(other rows)
• Tag on array to indicate
RFUOP
Results
• Compress 1.11
• Eqntott 1.8
• Life
2.06 (160 hand parallelization)
[Hauck/FCCM97]
JR.S00 21
Instruction Augmentation
• Small arrays with limited state
– so far, for automatic compilation
» reported speedups have been small
– open
» discover less-local recodings which extract greater
benefit
JR.S00 22
GARP
Identified Problems:
• Single-cycle flow-through
– not most promising usage style
• Moving data through Register File to/from
array
– can present a limitation
» bottleneck to achieving high computation rate
[Hauser+Wawrzynek: UCB]
JR.S00 23
GARP
• Integrate as coprocessor
– similar bandwidth to processor as FU
– own access to memory
• Support multi-cycle operation
– allow state
– cycle counter to track operation
• Fast operation selection
– cache for configurations
– dense encodings, wide path to memory
JR.S00 24
GARP
• ISA -- coprocessor operations
– issue gaconfig to make a particular configuration resident
(may be active or cached)
– explicitly move data to/from array
» 2 writes, 1 read
– processor suspend during coprocessor operation
» cycle count tracks operation
– array may directly access memory
» processor and array share memory space
• cache/mmu keeps consistency
» can exploit streaming data operations
JR.S00 25
GARP
• Processor Instructions
JR.S00 26
GARP Array
• Row oriented logic
– denser for datapath
operations
• Dedicated path for
– processor/memory data
• Processor not have to be
involved in
arraymemory path
JR.S00 27
GARP Results
• General results
– 10-20x on stream, feedforward operation
– 2-3x when datadependencies limit
pipelining
[Hauser+Wawrzynek/FCCM97]
JR.S00 28
PRISC/Chimera … GARP
• PRISC/Chimaera
– basic op is single cycle:
expfu (rfuop)
– no state
– could conceivably have
multiple PFUs?
– Discover parallelism =>
run in parallel?
– Can’t run deep pipelines
• GARP
– basic op is multicycle
» gaconfig
» mtga
» mfga
– can have state/deep
pipelining
– ? Multiple arrays viable?
– Identify mtga/mfga w/ corr
gaconfig?
JR.S00 29
Common Theme
• To get around instruction expression limits
– define new instruction in array
» many bits of config … broad expressability
» many parallel operators
– give array configuration short “name” which processor
can callout
» …effectively the address of the operation
But – Impact of using reconfiguration at Instruction
Level seems limited
 Explore opportunities at larger granularity levels
(basic block, task, process)
JR.S00 30
Applicability of Configurable Processing
• Stand-alone computational engines
– E.g. PADDI, UCLA Mojave
• Adding programmable I/O to embedded
processors
– E.g. Napa 2000
• Augmenting the instruction set of processors
– E.g. GARP, RAW
• Providing programmable accelerator coprocessors to embedded micro’s and DSP
– Chameleon, Pleiades, Morphics
JR.S00 31
Example: Chameleon Reconfigurable CoProcessor (network, communication applications)
JTAG
Debugging Port
PCI
Interface
Memory
Controller
Data Memory
Wide Internal communications bus
Bus Manager
& DMA
Controllers
ARC CPU
Instruction
Cache
Background Configuration Plane
Configuration bit stream
Reconfigurable Logic
Array of 32-bit Data Path
Operators & Control Logic
Multiple banks of I/O
Local
Store
Memory
(LSM)
JR.S00 32
Reconfigurable Processor Tools Flow
Customer
Application / IP
(C code)
C Compiler
ARC
Object
Code
RTL
HDL
Synthesis & Layout
Linker
Configuration Bits
Chameleon Executable
C Model
Simulator
C Debugger
Development
Board
JR.S00 33
Heterogeneous Reconfiguration
Reconfigurable Reconfigurable Reconfigurable Reconfigurable
Logic
Datapaths
Arithmetic
Control
In
mux
CLB
CLB
AddrGen
AddrGen
Memory
Memory
Data
Memory
Program
Memory
Datapath
Instruction
Decoder
&
Controller
reg0
reg1
adder
CLB
CLB
buffer
Bit-Level Operations
e.g. encoding
MAC
Dedicated data paths Arithmetic kernels
e.g. Filters, AGU
e.g. Convolution
Data
Memory
RTOS
Process management
JR.S00 34
Multi-granularity Reconfigurable Architecture:
The Berkeley Pleiades Architecture
Configuration Bus
Arithmetic
Processor
Arithmetic
Processor
Communication Network
Configuration
Arithmetic
Processor
Satellite Processor
Dedicated
Arithmetic
Network Interface
Control
Processor
Configurable
Datapath
Configurable
Logic
• Computational kernels are “spawned” to satellite processors
• Control processor supports RTOS and reconfiguration
• Order(s) of magnitude energy-reduction over traditional programmable architectures
JR.S00 35
Matching Computation and Architecture
AddressGen
AddressGen
Memory
Memory
MAC
MAC
Convolution
L
G
C
Control
Processor
Two models of computation:
Two architectural models:
communicating processes + data-flow
sequential control+ data-driven
JR.S00 36
Execution Model of a Data-Flow Kernel
Embedded processor
Code seg
for(i=1;i<=L;i++)
for(k=i;k<=L;k++)
end
start
AddrGen
MEM: in
AddrGen
phi[i][k]= phi[i-1][k-1]
+in[NP-i]*in[NP-k]
-in[NA-1-i]*in[NA-1-k];
MPY
MPY
MEM: phi
ALU
Code seg
• Distributed control and memory
ALU
JR.S00 37
Reconfigurable Kernels for W-CDMA
• Dominant kernel M(MTX)
requires array of MACs and
segmented memories
• Additional operations such as
sqrt(x), 1/x, and Trellis decoding
may be implemented using
FPGA or cordic satellite
JR.S00 38
Inter-Satellite Communication
• Data-driven execution
–
A satellite processor is enabled only when input data is ready
• Data sources generate data of different types: scalars, vectors,
matrices
• Data computing processors handle data inputs of different types
end-of-vector token
1
AddrGen
Memory
1
1
n
Embedded
processor
Data sources
n
n
MPY 1
MPY n
MAC 1
Data computing processors
JR.S00 39
Impact of Architectural Choice
100
18.5
10
1
Pleiades
0.1
0.75
TMS320LC54x
100n
570n
137
TMS320C2xx
1u
Energy*Delay/stage
StrongARM
3.8u
Pleiades
Pleiades
TMS320LC54x
13
TMS320C2xx
10
49
10u
10u
TMS320LC54x
131
100
21u
TMS320C2xx
1000
Delay/stage
StrongARM
1870
3970
1000
100u
Normalized Delay/stage [s]
Energy/stage
StrongARM
Normalized Energy / stage [nJ]
10000
Normalized Energy*Delay / stage [Js*e-14]
10000
Example: 16 point Complex
Radix-2 FFT (Final Stage)
JR.S00 40
Adaptive Multi-User Detector for W-CDMA
Pilot Correlator Unit Using LMS
AG
AG
AG
MEM
MEM
alt
MEM
MEM
alt
s_r
alt
alt
s_i
MAC
MAC
Filter
Zmf_r
Zmf_i
y_r
y_i
MEM alt
MEM alt
MUL
MUL
MUL
MUL
SUB
ACC
ADD
ACC
ADD
ADD
Coefficient Update
SUB
SUB
SUB
ADD
MUL
MUL
MUL
MUL
SUB
SUB
y_r
y_i
MUL
MUL
s_r
Zmf_r
s_i
Zmf_i
JR.S00 41
Architecture Comparison
LMS Correlator at 1.67 MSymbols Data Rate
Complexity: 300 Mmult/sec and 357 Macc/sec
16 Mmacs/mW!
Note: TMS implementation requires 36 parallel processors to meet data rate validity questionable
JR.S00 42
Maia: Reconfigurable Baseband Processor for
Wireless
• 0.25um tech: 4.5mm x 6mm
• 1.2 Million transistors
• 40 MHz at 1V
• 1 mW VCELP voice coder
• Hardware
• 1 ARM-8
• 8 SRAMs & 8 AGPs
• 2 MACs
• 2 ALUs
• 2 In-Ports and 2 Out-Ports
• 14x8 FPGA
JR.S00 43
Reconfigurable Interconnect Exploration
Mesh
Hierarchical Mesh
Module
cluster
Multi-Bus
cluster
N Inputs
B Buses
M Outputs
cluster
JR.S00 44
tion
Software Methodology Flow
Algorithms
C++
Kernel Detection
proc &
Accelerator
PDA Models
Behavioral
Estimation/Exploration
SUIF+ C-IF
Power & Timing Estimation
of Various Kernel Implementations
Premapped
Kernels
Partitioning
Software Compilation
Reconfig. Hardware Mapping
Interface Code Generation
C++ Module
Libraries
JR.S00 45
Hardware-Software Exploration
Macromodel call
JR.S00 46
Implementation Fabrics for Protocols
RACH
req
A protocol =
Extended FSM
RACH
akn
idle
Memory
RACH
slotset
update
read
write
R_ENA
idle
W_ENA
BUF
BUF
Slot_Set_Tbl
2x16
addr
slot_set
<31:0>
Slot_no
<5:0>
Slot Pkt
start end
Intercom TDMA MAC
JR.S00 47
Intercom TDMA MAC
Implementation alternatives
ASIC
FPGA
ARM8
Power 0.26mW
2.1mW
114mW
Energy 10.2pJ/op 81.4pJ/op n*457pJ/op
•
•
•
•
ASIC: 1V, 0.25 m CMOS process
FPGA: 1.5 V 0.25 m CMOS low-energy FPGA
ARM8: 1 V 25 MHz processor; n = 13,000
Ratio: 1 - 8 - >> 400
JR.S00 48
The Software-Defined Radio
FPGA
Embedded uP
Dedicated FSM
Dedicated
DSP
Reconfigurable
DataPath
JR.S00 49
An Industrial Example:
Basestation for Cellular Wireless
1900 MHz
800 MHz
A
B
A
B
A
Antenna
System
D
B
E
F
C
Antenna
System
RF/IF
RF/IF
Tuner
RF/IF
Tuner
Block-Spectrum
Tuner
Block-Spectrum
A/D
Block-Spectrum
A/D
A/D
RF/IF
RF/IF
multiple sectors
Tuner
RF/IF
Tuner
multi-band
Block-Spectrum
Tuner
Block-Spectrum
A/D
Block-Spectrum
A/D
A/D
multiple sectors
multi-band
HIGH-SPEED DIGITAL BUS
...
Modular/Parameterizable
•per carrier
•per TDMA time-slot
•per CDMA code
JR.S00 50
BTS Signal Processing Platforms
Comm
Agent
High-Speed Digital Bus
D/A Conversion
A/D Conversion
Hardwired ASICs
Standard A, F1
Standard A, F2
...
Comm
Agent
N DSP/CPUs
Standard B, F1
...
CPU
JR.S00 51
Basestation of the Next Generation
Wideband
RF
10/
100
or
Gbit
ATM
Data
Networks
JR.S00 52
Coexistence of Multiple Standards In
Product Supply Chain
2G
3G
–
–
–
–
–
–
–
GSM
DCS1800
PCS1900
IS-95
IS-54B
IS-136
PDC
CIRCUIT
VOICE
NARROWBAND
2.5G
–
–
–
–
–
GPRS
HCSD
IS-95 MDR
IS-95 HDR
IS-136 HS
–
–
–
–
ETSI UTRA
ARIB W-CDMA
TIA cdma2000
W-TDMA (UWC)
PACKET
DATA
WIDEBAND
JR.S00 53
Wideband CDMA: MOPS?
No, GOPS!
Single 384 kbps ARIB W-CDMA Channel
Function
MIPS
Digital RRC Channel
3600
Searcher
2100
RAKE
1050
Maximal Ratio Combiner
24
Channel Estimator
12
AGC, AFC
10
Deinterleaver
15
Turbo Coder
90
TOTAL
6901
Source: J. Kohnen et al. “Baseband Solution for WCDMA,” Proc. IEEE
Communication Theory Workshop, May 1999, Aptos, USA.
JR.S00 54
HW Multistandard Solutions
The common approach to hardware design involves:
multiple ASIC’s to support each standard.
DSP
Control Processor
Programmable
Digital
Hardwired
ASIC
Digital
Hardwired
ASIC
Digital
Hardwired
ASIC
Unique
Combinations
IF
RF
IF
RF
IF
RF
Analog
• Hardwired implementation is not scalable or upgradeable to new standards.
• This approach costs time in a time-to-market dominated world.
• Creating new chipsets for every technology combination critically challenges
available design resources!
JR.S00 55
SW Multistandard Solution
Applying instruction-set processor architectures to
all baseband processing would be desireable...
IF
RF
IF
RF
IF
RF
DSP
Control Processor
Programmable
Analog
…but is simply not an good implementation for base stations:
-Unacceptably high cost per channel
-Unacceptably large power per channel
This is definitely not a viable implementation for terminals
JR.S00 56
The Law of Diminishing Returns
• More transistors are being thrown at improving
general-purpose CPU and DSP performance
• Fundamental bounds are being pushed
– limits on instruction-level parallelism
– limits on memory system performance
• Returns per transistor are diminishing
– new architectures realizing only 2-3 instructions/clock
– increasingly large caches to hide DRAM latency
JR.S00 57
Embedded Logic
Evolution
– Increasing fixed-function hardwired content of systems
– Core+Logic becomes de-facto design architecture
– Move to deep sub-micron technology
» rapidly increasing product integration cycles
» increasingly constrained design resources
» sharp increases in cost of “trying out” an idea- NRE
– Design methodologies optimized for random logic, homogeneous architectures,
and lower speed signal processing (I.e. control-flow dominated systems)
– Verification issues dominate design cycle time
Growing Design Cycle Times At Odds With Shrinking Product Cycle
Times
JR.S00 58
FPGA the Solution?
Cellular Handset Using Current FPGA
JR.S00 59
Some Interesting Observations
Don’t use more transistors to stretch
general-purpose performance,
whether for CPUs, DSPs, or
reconfigurable logic.
Don’t use more time to design
dedicated hardwired solutions in cases where
mass customization
is what the market demands.
JR.S00 60
View The “Reconfigurability” Problem From The
System Level
What Are the Application-Specific Performance Needs?
–
–
–
–
What are the applications targeted?
What algorithms are essential to achieving the performance goals?
What are the functions at the heart of these algorithms?
Which functions yield poor price-performance with general-purpose
MOPS and system memory models?
– What is the embedded systems programmer’s model?
– Best performance at what cost:
» Area - instruction-level parallelism, memory hierarchy
» Power- energy requirements on a function basis
» Time- quality and ease of programming for app development
» Pain- forward opportunity vs backward compatibility
JR.S00 61
Successfully Using Reconfigurability
Application-Specific Leverage
Focus on first on applications and constituent algorithms, not the
silicon architecture !
Wireless Communications Transceiver Signal Processing
Minimize the hardware reconfigurability to constrained set
Maximize the software parameterizability and ease of use of the
programmer’s model for flexibility
Define optimal architecture for efficient implementation
JR.S00 62
Application-Specific MOPS in Digital
Communications
RF/IF
Digital
Downconversion
and
Channelization
TDMA
Wideband Signal
Processing Engine
Programmable
DSP
Wideband Channel
Decoder Engine
CDMA
Wideband Signal
Processing Engine
Microprocessor
JR.S00 63
Morphics’ DRL Architecture
Heterogeneous Multiprocessing Engine Using ApplicationSpecific Reconfigurable Logic
DATAFLOW
Large Granularity Kernel
input
m
m
O
input
R
input
m
m
O
input
Clk
output
R
Enable
Small Granularity Kernel
Clk
output
Enable
input
m
m
O
input
R
Clk
output
Enable
JR.S00 64
DRL Kernels
DATA MEMORY
DATA SEQUENCER
PARAMETERIZABLE
CONFIGURABLE
ALU
JR.S00 65
Mapping Software to Target Architecture
DATA MEMORY
DATA SEQUENCER
PARAMETERIZABLE
CONFIGURABLE
ALU
DATA MEMORY
DATA SEQUENCER
PARAMETERIZABLE
CONFIGURABLE
ALU
JR.S00 66
Programmer’s Guide
Document provided with each
processor to enable
application development,
configuration, and system
control via host processor.
Includes complete
•
•
•
•
description of each API function
call
system control functions
variables, parameters
coding examples, and
performance realized (i.e. ROC
curves)
/* morphics soft API usage examples: */
/* search a pilot set in search set
*/
/* maintainance mode.
*/
...
set_ptr = get_next_set();
if (no_need_to_throttle())
{
search_set(set_ptr);
}
...
void search_set(PILOT_TYPE *pilot)
{
/* search threshold is assumed to be set in other places */
morphics_searcher_set_win_size(pilot->win_size);
morphics_searcher_set_pn(pilot->pn);
morphics_searcher_set_int_len(pilot->int_len);
}
/* finger re-assignments */
...
fing[i].pos = morphics_demod_get_fin_pos(i);
...
distance = calc_fing_movement(&fing[i], new_pn);
...
fing[i].slew = distance;
fing[i].pn = new_pn;
morphics_demod_set_fing_iq(i, fing[i].pn);
while ( slew_not_done(morphics_demod_get_status()) )
{
morphics_demod_set_fing_slew(i, fing[i].slew);
}
...
JR.S00 67
Key Pieces of Design Methodology
System-level Profiling
•
•
•
Analyze sequences of operations (arithmetic, memory access, etc)
Analyze communication bottlenecks
Key flexible parameters (algorithm v architecture parameters)
Architecture-level Profiling
•
•
•
•
ALU/kernel definition (sequences of operators)
Memory profile
Type of configurability required for flexibility
Macro-sequencer development
Implementation
•
•
•
•
SW- programmer’s model developed at architecture specification stage
SW- API proven out via behavioral models & demonstrator hardware
VLSI-focus on regular predictable timing and routability
VLSI- embedded reconfigurability in an ASIC flow
JR.S00 68
CDMA Modem Analysis
JR.S00 69
MOPS Breakdown Analysis
JR.S00 70
Prototype Demonstrator
RISC
Microprocessor
Configurable Kernel(s)
Data Router
JR.S00 71
Summary
• Configurable computing is finding its way into the
embedded processor space
• Best suited (so far) for
– Flexible I/O and Interface functionality
– providing task-level acceleration of “parametizable” functions
• Improvement of IP seems limited
• Software flow still subject to improvement
• Might become more interesting with the emergence of
low-current devices (TFT, organic transistors,
molecular computing)
DO NOT FORGET CONFIGURATION OVERHEAD
JR.S00 72