Updated PowerPoint Talk - The University of Texas at Austin

Download Report

Transcript Updated PowerPoint Talk - The University of Texas at Austin

“Human
beings are great programmers,
Computers are poor actors”
VLIW DSP vs. SuperScalar
Implementation of a Baseline H.263
[email protected]
Baseline H.263 Video Encoding
I: Intra frame: Discrete Cosine
Transform (DCT) is used to
reduce spatial redundancy
within a frame.
I
P
P
…
I
P: Predicted frame: Motion
compensated prediction (MCP)
used to reduce temporal
redundancy. DCT is used to
reduce spatial redundancy in
the prediction error.
Frame
VLIW DSP vs. SuperScalar
Implementation of a Baseline H.263
2
Baseline H.263 Encoder
Coding control
Control info
VLC: Variable Length Coding
Video
input
-
2-D
DCT
Q
VLC
Q-1
Quantizer index
for transform
coefficient
2-D
IDCT
+
MCP
ME: Motion Estimation
ME
Motion vectors
VLIW DSP vs. SuperScalar
Implementation of a Baseline H.263
VLC
3
H.263 Encoder
• Goals: baseline H.263 encoder only
– Evaluate performance of compiled C code on Very Long
Instruction Word (VLIW) Digital Signal Processors (DSPs)
and superscalar processors
– Hand optimize H.263 video encoder on VLIW DSP
• University of British Columbia (UBC)
H.263 Version 2 (H.263+) video codec
– By Prof. Faouzi Kossentini’s group: http://spmg.ece.ubc.ca
– 23000 lines (720 kbytes) of C code targeted for PCs
– Baseline H.263 and many optional H.263+ modes
– Primarily for research purposes
VLIW DSP vs. SuperScalar
Implementation of a Baseline H.263
4
TMS320C6701 Processor
• Up to 8 32-bit instructions are executed in
one instruction cycle in an in-order way
• 2 32-bit data paths, with 16 32-bit registers
and 16 16-bit data memory banks
Program Fetch
Control
Registers
Instruction Dispatch
Instruction Decode
A Register File
Control
Logic
B Register File
Test/
Emulation
L1
S1
M1
D1
L2
S2
M2
Interrupts
D2 control
TMS320C6701 CPU Core
VLIW DSP vs. SuperScalar
Implementation of a Baseline H.263
5
TMS320C6701 EVM
• TMS320C6701 processor
– 11 - 17 stages of pipeline, depending on instruction
• External memory
– 256 kB of 133 MHz synchronous burst static random-access
memory (SBSRAM)
– 8 MB of 100 MHz synchronous dynamic RAM (SDRAM) in
two 16-bit RAM banks
• 100 MHz clock speed due to SDRAM
• Development environment
– Code Composer: Interactive real-time debugging
– Simulator: Does not report pipeline stalls
VLIW DSP vs. SuperScalar
Implementation of a Baseline H.263
6
SimpleScalar Simulator
• Superscalar processor reorders sequential
instructions based on data dependencies for
parallel (out-of-order) execution
• SimpleScalar is configurable superscalar
simulator: http://www.simplescalar.org
Fetch
Dispatch
Scheduler
Execute
Memory
Memory
Writeback
TLB: Translation lookahead buffer
Instruction
cache
Virtual
memory
Data cache Data-TLB
Commit
Six pipeline stages for out-of-order simulation
VLIW DSP vs. SuperScalar
Implementation of a Baseline H.263
7
Comparison of Processors
Processor
TMS320C6701
SimpleScalar
Data path
2 x 32 bits
64 bits
Instruction
8 x 32 bits
64 bits
Program memory
64 kB
256 kB
Data memory
64 kB
8/256 kB (L1/L2)
Transistors
0.3 M
100 M
Power
1.05 W
30 W
VLIW DSP vs. SuperScalar
Implementation of a Baseline H.263
8
Encoder Profile for VLIW DSP
(with level two C optimization only)
VLIW DSP vs. SuperScalar
Implementation of a Baseline H.263
Others
Reconstruction
(Clip_MB)
Interpolation
Quantization
DCT
Motion
compensated
prediction
SAD
1476 Mcycles/frame for
128 x 96 resolution with fullsearch motion estimation
Motion
estimation
100
90
80
70
60
50
40
30
20
10
0
9
Encoder Profile for SuperScalar
(1-way with level two C optimization)
VLIW DSP vs. SuperScalar
Implementation of a Baseline H.263
Others
Reconstruction
(Clip_MB)
Interpolation
DCT
196 Mcycles/frame for
128 x 96 resolution with fullsearch motion estimation
Motion
estimation
70
60
50
40
30
20
10
0
10
H.263 Encoder Comparison
(with level 2 C optimization only)
• Frame resolution: 128 x 96 (Sub-QCIF)
• Full search motion estimation
• Clock speed: 100 MHz
Cycle
counts
per frame
Frames
per
second
VLIW DSP
C6701
SimpleScalar
(1-way)
1476 M
196 M
0.07
0.5
VLIW DSP vs. SuperScalar
Implementation of a Baseline H.263
11
VLIW DSP Memory Optimizations
• Internal program memory holds
– Computationally intensive routines
– Commonly used runtime support functions from TI libraries
(memcpy, memcmp and memset)
• Internal data memory holds
–
–
–
–
Macroblocks and search area for motion estimation
Macroblocks for DCT, quantization, coding, reconstruction
Local data for computationally intensive routines
Stack
• Speedup: 29 times over level two optimization
VLIW DSP vs. SuperScalar
Implementation of a Baseline H.263
12
VLIW DSP Code Optimizations
• Compiler intrinsics gave little improvement
• Wrote assembly routines
– Parallel assembly: SAD, Clip_MB (clips overflowing values)
– Linear assembly: Interpolate, FillMBData (pack copy of pixel
data into macroblock structures)
• Rewriting the C code
– Unroll loops and pipeline computations
– Use 32-bit packed data I/O to slower external RAM
– Avoid pipeline stalls due to memory bank conflicts
• Speedup: 4 times over level two C optimization
VLIW DSP vs. SuperScalar
Implementation of a Baseline H.263
13
VLIW DSP Optimizations
(assembly routines only)
Speedup after
Memory
Optimizations
54 x
Speedup after
Code
Optimizations
4x
Overall
Speedup
Clip_MB
7.5 x
13 x
228 x
DCT
(from TI)
Interpolate
26 x
3x
138 x
4x
8x
22 x
FillMBData
7x
8x
22 x
Name
SAD
VLIW DSP vs. SuperScalar
Implementation of a Baseline H.263
264 x
14
VLIW DSP Encoder Profile
(after all C6701 optimizations)
60
24 Mcycles/frame for
128 x 96 resolution with fullsearch motion estimation
50
40
30
VLIW DSP vs. SuperScalar
Implementation of a Baseline H.263
Others
Reconstruction
(Clip_MB)
Interpolation
Quantization
DCT
Motion
compensated
prediction
0
Motion
estimation
10
SAD
20
15
Superscalar Encoder Profile
(256-way SimpleScalar processor)
Others
Reconstruction
(Clip_MB)
Interpolate
DCT
28 Mcycles/frame for
128 x 96 resolution with fullsearch motion estimation
Motion
estimation
60
50
40
30
20
10
0
VLIW DSP vs. SuperScalar
Implementation of a Baseline H.263
16
Subroutine Comparisons
Cycle counts /
Subroutine
SAD
Optimized
VLIW DSP
256-way
SimpleScalar
Ratio
290
6017
1:20.7
384
12,885
1:33.5
Clip_MB
1,173
6,250
1:5.3
FillMBData
8,740
3,082
2.8:1
750,000
425,703
1.8:1
DCT
Interpolate
VLIW DSP vs. SuperScalar
Implementation of a Baseline H.263
17
H.263 Encoder Comparison
• Frame resolution: 128 x 96 (Sub-QCIF)
• Full search motion estimation
• Clock speed: 100 MHz
Cycle
counts per
frame
Frames
per second
VLIW DSP
C6701
SimpleScalar
(256-way)
24 M
28 M
4
3.5
VLIW DSP vs. SuperScalar
Implementation of a Baseline H.263
18
Conclusions
• With level 2 optimization only
–
–
–
–
One-way superscalar is 7.5x faster than VLIW DSP
Four-way to one-way issue speedup is 2.88x
256-way to four-way speedup is 2.4x
Variable length coding much faster on superscalar
• VLIW DSP hand optimization produces 61x
speedup vs. level two C optimization
– Placement of often-used data and code on-chip
– Hand coded SAD, interpolation, and reconstruction
– 14% faster than 256-way superscalar version
http://www.ece.utexas.edu/~sheikh/h263
VLIW DSP vs. SuperScalar
Implementation of a Baseline H.263
19