PROCESSOR ARCHITECTURES FOR MULTIMEDIA APLICATIONS
Download
Report
Transcript PROCESSOR ARCHITECTURES FOR MULTIMEDIA APLICATIONS
PROCESSOR ARCHITECTURES FOR MULTIMEDIA
APPLICATIONS
Oguz Karacuka
What Is Multimedia Processing?
Desktop:
– 3D graphics (games)
– Speech recognition (voice input)
– Video/audio decoding (mpeg-mp3 playback)
Servers:
– Video/audio encoding (video servers, IP telephony)
– Digital libraries and media mining (video servers)
– Computer animation, 3D modeling & rendering (movies)
Embedded:
– 3D graphics (game consoles)
– Video/audio decoding&encoding (set top boxes, PVR...)
– Image processing (digital cameras)
– Signal processing (cellular phones)
Characteristics Of Multimedia Apps.
Requirement for real-time response
– “Incorrect” result often preferred to slow result
– Unpredictability can be bad (e.g. dynamic execution)
Narrow data-types
– Typical width of data in memory: 8 to 16 bits
– Typical width of data during computation: 16 to 32 bits
– 64-bit data types rarely needed
– Fixed-point arithmetic often replaces floating-point
Fine-grain (data) parallelism
– Identical operation applied on streams of input data
– Branches have high predictability
– High instruction locality in small loops or kernels
Characteristics Of Multimedia Apps.
cont.
Coarse-grain parallelism
– Most apps organized as a pipeline of functions
– Multiple threads of execution can be used
Memory requirements
– High bandwidth requirements but can tolerate
high latency
– High spatial locality (predictable pattern) but low
temporal locality
– Cache bypassing and prefetching can be crucial
Examples of Media Functions
Matrix transpose/multiply (3D graphics)
DCT/FFT (Video, audio, communications)
Motion estimation (Video encoding, deinterlacing)
Gamma correction (3D graphics)
Haar transform (Media mining)
Median filter (Image processing)
Separable convolution (Image processing)
Viterbi decode (Communications, speech)
Bit packing (Communications, cryptography)
…
Approaches to Media Processing
Asics/FPGA’s
(Dedicated/Function Specific
Architectures)
Multimedia
Processing
DSP’s
(Flexible Programmable
Architectures)
VLIW with SIMD
extensions
(aka mediaprocessors,
Adapted
Programmable
Architectures)
Vector Processors
General-purpose
processors with
SIMD extensions
Application Example: MPEG Dec.
MPEG Encoder & Decoder Complexity
Function Specific Architectures
Limited (if any) programmability
DSP or RISC core processor for main control
Special hardware accelerators for the DCT, quantization,
entropy encoding, motion estimation...
High efficiency and speed: typically better compared to
programmable architectures.
The silicon area optimization achieved by functionspecific architectures allows lower production cost.
Function Specific Architectures
Programmable Dedicated Architectures
Increased flexibility: enables the processing of
different tasks under software control.
Higher cost for design and manufacturing:
additional hardware for program control is
required.
Require software development for the application:
parallelization strategies have to be applied
Flexible Programmable Architectures
TI’s Multimedia Video Processor (MVP) TMS320C80
Adapted Programmable Architectures
C-Cube’s VRP – VRP2
VLIW Advanced Architectures
Reduce the number of cycles per instruction required for
execution of highly complex and parallel algorithms
Multiple independent functional units that are directly
controlled by long instruction words.
Unefficient use of silicon: requires a giant routing network
of buses and crossbar switches.
All functional units share a common large register file
Code compaction is typically done by a special compiler,
which can predict branch outcomes by applying an
algorithm known as trace scheduling
Can be combined with SIMD arch. for increased parallelism
e.g. : Mitsubishi D30V and Philips Semiconductor’s TriMedia
Philips TriMedia CPU64 Arch.
Philips TriMedia CPU64 Arch.
5 slot VLIW architecture with a 64-bit word size;
27 functional units, offering a choice of operation types
in each slot in the instruction any operation can be guarded
to provide conditional execution without branching;
All functional units provide vector-style subword parallelism
on byte, half-word, or word entities.
instruction set and functional units optimized with respect to
media processing;
a single multi-ported register file with bypass network,
allowing 1-cycle latency operations;
32 kB, 8-way instruction cache 16 kB, 8-way, quasi-dual
ported, data cache;
a variable-length (compressed) instruction set design.
Multiple-instruction, multiple-data
(MIMD) architectures
offer 10 to 100 times more throughput than
existing VLIW and SIMD architectures
Multiple instructions are executed in parallel on
multiple data: a control unit for each data path.
asynchronous nature increases the complexity of
software development.
SIMD Extensions to General Purp. Processors
WHY ?
Performance
– A 1.2GHz Athlon can do MPEG-4 encoding at
6.4fps
– One 384Kbps W-CDMA channel requires 6.9 GOPS
Power consumption
– A 1.2GHz Athlon consumes ~60W
– Power consumption increases with clock
frequency and complexity
Cost
– A 1.2GHz Athlon costs ~$62 to manufacture and
has a list price of ~$600 (module) (year 2000)
– Cost increases with complexity
SIMD Extensions to General Purp. Processors
Motivation
– Low media-processing performance of GPPs
– Cost and lack of flexibility of specialized ASICs for graphics/video
– Underutilized datapaths and registers
Basic idea: sub-word parallelism
– The mismatch between wide data paths and the relatively short
data types found in multimedia applications
– Treat a 64-bit register as a vector of 2 32-bit or 4 16-bit or 8 8-bit
values (short vectors)
– Partition 64-bit datapaths to handle multiple narrow operations in
parallel
Initial constraints
– No additional architecture state (registers)
– No additional exceptions
– Minimum area overhead
Overwiew of SIMD Extensions
Intel’s MMX Example
targeted to accelerate multimedia and
communications applications, especially on the
Internet.
MMX system extends the basic integer instructions:
add, subtract, multiply, compare, and shift into
SIMD versions.
Added DCT / IDCT kernels
MPEG-1 video decompression speed up with MMX is
about 80%,while some other applications, such as
image filtering speed up to 370%.
Summary of SIMD Instructions
Integer arithmetic
– Addition and subtraction with saturation
– Fixed-point rounding modes for multiply and shift
– Sum of absolute differences
– Multiply-add, multiplication with reduction
– Min, max
Floating-point arithmetic
– Packed floating-point operations
– Square root, reciprocal
– Exception masks
Data communication
– Merge, insert, extract
– Pack, unpack (width conversion)
Summary of SIMD Instructions
Comparisons
– Integer and FP packed comparison
– Compare absolute values
– Element masks and bit vectors
Memory
– No new load-store instructions for short vector
– No support for strides or indexing
– Short vectors handled with 64b load and store
instructions
– Pack, unpack, shift, rotate, shuffle to handle
alignment of narrow data-types within a wider one
– Prefetch instructions for utilizing temporal
locality
SIMD Ext. for GPP Summary
Narrow vector extensions for GPPs
– 64b or 128b registers as vectors of 32b, 16b, and 8b
elements
Based on sub-word parallelism and partitioned datapaths
Instructions
– Packed fixed- and floating-point, multiply-add, reductions
– Pack, unpack, permutations
2x to 4x performance improvement over base architecture
– Limited by memory bandwidth
Difficult to use (no compilers)
Overhead of handling alignment and datawidth adjustment
Optimized shared libraries
– Written in assembly, distributed by vendor
– Need well defined API for data format and use
SUMMARY
Computationally intensive multimedia
functions, such as MPEG encoding, HDTV
codecs, 3D processing, and virtual reality,
will still require dedicated processors
We should expect that new generations of
GP processors would devote more and
more transistors to multimedia by
investing some of the available chip real
estate to support multimedia.