ECE 497NC: Unconventional Computer Architecture

Transcript ECE 497NC: Unconventional Computer Architecture

ECE 412:
Microcomputer Laboratory
Lecture 16: Accelerator Design in the XUP Board
Fall 2006
Lecture 16
Objectives
• Understand accelerator design considerations in a
practical FPGA environment
• Gain knowledge in some details of the XUP platform
required for efficient accelerator design
Fall 2006
Lecture 16
Four Fundamental Models of Accelerator Design
Base
CPU
App
FPGA
CPU
App
OS
(a)
OS service CPU FPGA
acc as
App
User space OS
mmap()
mmaped
I/O device
(c)
Fall 2006
FPGA
(b)
CPU FPGA
App
OS
Device
Driver
(d)
Lecture 16
No OS
Service
(in simple
embedded
systems)
Virtualized
Device with
OS sched
support
Hybrid Hardware/Software Execution Model
• Hardware Accelerator as a DLL
Source code
Compiler analysis/transformations
Compile
Time
Human designed
hardware
Synthesis
User
Runtime
DLL
Application
Linker/Loader
Linux OS
memory
CPU
devices
FPGA
accelerators
User level function or device driver:
Fall 2006
OS modules
Kernel
Runtime
Resource manager
Soft object
Hard object
– Seamless integration of hardware accelerators
into the Linux software stack for use by
mainstream applications
– The DLL approach enables transparent
interchange of software and hardware
components
• Application level execution model
– Compiler deep analysis and transformations
generate CPU code, hardware library stubs and
synthesized components
– FPGA bitmaps as hardware counterpart to
existing software modules.
– Same dynamic linking library interfaces and
stubs apply to both software and hardware
implementation
• OS resource management
– Services (API) for allocation, partial
reconfiguration, saving and restoring the status,
and monitoring
– Multiprogramming scheduler can pre-fetch
hardware accelerators in time for next use
– Control the access to the new hardware to
ensure trust under private or shared use
Lecture 16
MP3 Decoder: Madplay Lib. Dithering as DLL
Noise Shaping
Biasing
Random generator
Dithering
Software Dithering DLL
Application
Application
Decode
DecodeMP3
MP3
Block
• Madplay shared library
dithering function as software
and FPGA DLL
–
–
Read
Read
Sample
Sample
Write
Write
Sample
Sample
DL
DL
Stub
Stub
Hardware
HardwareDithering
DitheringDLL
DLL
OS
Audio_linear_dither() software
profiling shows 97% of
application time
DL (dynamic linker) can switch
the call to hardware or software
implementation
• Used by ~100 video and
audio applications
Hardware Dithering
AC’97
6 cycles
Noise
NoiseShaping
Shaping
Biasing
Random
Randomgenerator
generator
Fall 2006
Sound driver
Lecture 16
Dithering
Dithering
Clipping
Clipping
Quantization
Quantization
FPGA
Clipping
Quantization
CPU-Accelerator Interconnect Options
Motion
Estimation
Accelerator
• PLB (Processor Local Bus)
3
Virtex-II
Pro
BRAM
2
PowerPC
4
PLB
First Access
Bus Read Write
PLB
21
20
OCM
4
3
DCR
3
3
Fall 2006
DDR
Controller
DDR RAM
OCM
–
–
–
–
Wide transfer – 64 bits
Access to DRAM channel
1/3 CPU frequency
Big penalty if bus is busy during
first attempt to access bus
• OCM (On-chip Memory)
interconnect
1
Pipelined
Read Write Arbitration
3
3
15
2
2
3
3
-
Lecture 16
– Narrower – 32 bits
– No direct access to DRAM
channel
– CPU clock frequency
Motion Estimation Design & Experience
Motion
Estimation
Accelerator
3
Virtex-II
Pro
BRAM
PowerPC
4
PLB
DDR
Controller
1
open System Call
mmap System Call
Overhead
cycles mseconds
34,000
113.3
24,000
80.0
Data Marshaling + Write
Initiation
Accelerator Computation
Read From Accelerator
90,600
180
5,400
300
302
0.600
18
1
Total Time per Call
Full Search (HW)
Diamond Search (SW)
Full Search (SW)
96,480
174,911
639,925
322
583
2,133
Accelerator Action
Fall 2006
– This arrangement can only
support accelerators that will be
invoked many times
– Notice dramatic reduction in
computation time
– Notice large overhead in data
marshalling and white
DDR RAM
OCM
2
• Significant overhead in mmap,
open calls
• Full Search gives 10% better
compression
– Diamond Search is sequential, not
suitable for acceleration
Lecture 16
JPEG: An Example
RGB
RGB
to
YUV
Original Image
Y
Downsampl
e
U
Downsampl
e
V
2D Discrete
Cosine Transform
(DCT)
Run-Length
Encoding (RLE)
Huffman Coding
(HC)
Quantization
(QUANT)
Downsampl
e
Compressed Image
Parallel Execution on
Independent Blocks
Fall 2006
Implemented as
Reconfigurable Logic
Accelerator Candidate
Lecture 16
Inherently
Sequential Region
JPEG Accelerator Design & Experience
PowerPC
DDR
Controller
PLB
Action
System Call Overhead
DDR RAM
Virtex-II
Pro
DCRs
Switchable Accelerator
Interconnect Framework
Control
Data
PLB
Control
Data
DCT
DMA Setup
DMA Transfers
Accelerator Execution
Cache Coherence
Data Copies
549
448
987
348
1060
1.83
1.49
3.29
1.16
3.53
Total Time
5,244
17.5
• Based on Model (d)
– System call overhead for each
invocation
– Better protection
DMA
Controller
Input
Buffer
Overhead
cycles mseconds
1,853
6.18
Quant
Output
Buffer
• DCT and Quant are accelerated
DCRs
– Data flows directly from DCT to Quant
Framework
Control
Fall 2006
• Data copy to user DMA buffer
dominates cost
Lecture 16
Execution Flow of DCT System Call
open(/dev/accel); /* only once*/
…
/* construct macroblocks */
macroblock = …
syscall(&macroblock,
num_blocks)
…
Operating System
Hardware
PLB
Enable Accelerator
Access for Application
PPC
Setup DMA Transfer
Poll
PPC
PPC
PPC
Setup DMA Transfer
PPC
Invalidate Cache Range
PPC
Data Copy
PLB
PLB
DCR
PLB
PLB
Memory
DMA
Controller
Accelerator
(Executing)
DMA
Controller
Memory
PLB
PPC
Fall 2006
Memory
Data copy
Flush Cache Range
…
/* macroblock now has
transformed data */
…
Time 
Application
Lecture 16
Memory
Software Versus Hardware Acceleration
Overhead is a
major issue!
Fall 2006
Lecture 16
Device Driver Access Cost
Fall 2006
Lecture 16

ECE 497NC: Unconventional Computer Architecture

Transcript ECE 497NC: Unconventional Computer Architecture

Directory