Rotorcraft Program Review

Download Report

Transcript Rotorcraft Program Review

Performance Evaluation of Two
Emerging Media Processors:
VIRAM and Imagine
Leonid Oliker
Future Technologies Group
Computational Research Division
LBNL
www.nersc.gov/~oliker
Sourav Chatterji, Jason Duell, Manikandan Narayanan
Motivation

Commodity cache-based SMP clusters perform at small %
of peak for memory intensive problems (esp irregular prob)

But “gap” between processor performance and DRAM
access times continues to grow (60%/yr vs. 7%/yr)

Power and packaging are becoming significant bottlenecks

Better software is improving some problems:


ATLAS, FFTW, Sparsity, PHiPAC
Alternative arch allow tighter integration of proc & memory
Can we build HPC systems w/ high-end media proc tech?


VIRAM: PIM technology combines embedded DRAM with vector
coprocessor to exploit large bandwidth potential
IMAGINE: Stream-aware memory supports large processing potential
of SIMD controlled VLIW clusters
Motivation

General purpose procs badly suited for data intensive ops





Application-specific ASICs


Large caches not useful
Low memory bandwidth
Superscalar methods of increasing ILP inefficient
Power consumption
Good, but expensive/slow to design.
Solution: general purpose “memory aware” processors

Large number of ALUs: to exploit data-parallelism

Huge memory bandwidth: to keep ALUs busy

Concurrency: overlap memory w/ computation
VIRAM Overview
 MIPS core (200 MHz)
 Main memory system
 8 banks w/13 MB of on-chip DRAM
 Large 6.4 GBytes/s on-chip peak bandwidth
 Cach-less Vector unit
 Energy efficient way to express fine-grained
parallelism and exploit bandwidth
 Single issue, in order
 Low power consumption: 2.0 W
 Peak vector performance
 1.6/3.2/6.4 Gops
 1.6 Gflops (single-precision)
 Fabricated by IBM: Taped-out 02/2003
 To hide DRAM access load/store, arithmetic
instructions deeply pipelined (15 stages)
 We use simulator with Cray’s vcc compiler
VIRAM Vector Lanes

Parallel lane design has adv in performance, design complex, scalability

Each lanes has 2 ALUs ( 1 for FP) and receives identical control signal

Vector instr specify 64 way-parallelism, hardware exec 8-way

8 KB vector register file partitioned into 32 vector registers

Variable data widths: 4 lanes 64-bit, 8 lanes for 32 bit, 16 for 8 bit


Data width cut in half, # of elems per register (and peak) doubles
Limitations: no 64-bit FP & compiler doesn’t generate fused MADD
VIRAM Power Efficiency
1000
VIRAM
100
MOPS/Watt
R10K
P-III
10
P4
Sparc
1
EV6
Transitive
GUPS
SPMV
Hist
Mesh
0.1

Comparable performance with lower clock rate

Large power/performance advantage for VIRAM from

PIM technology, data parallel execution model
Stream Processing
Example: stereo
depth extraction

Data and Functional
Parallelism

High Comp rate

Little Data Reuse


Producer-Consumer and
Spatial locality
Ex: Multimedia, sign proc,
graphics

Stream: ordered set of records (homogenous, arbitrary data type)

Stream programming: data is streams, compu is kernel

Kernel loop through all stream elements (sequential order)

Perform compound (multiword) operation on each stream elem

Vectors perform single arith op on each vector elem (then store in reg)
Imagine Overview

Host sends inst to stream controller, SC
issues commands to on-chip modules

“Vector VLIW” processor

Coprocessor to off-chip host
processor

8 arithmet clusters control
in SIMD w/ VLIW instr

Central 128KB Stream
Register File @ 32GB/s

SRF can overlap comp with
mem (double buff)

SRF cab reuse intermed
results (prod-cons local)

Stream-aware mem sys
with 2.7 GB/s off-chip

544 GB/s interclustr comm
Imagine Arithmetic Clusters

400 MHz clock, 8 clusters w/ 6 FU each (48 FU total)

Reads/writes streams to SRF

Each cluster 3 ADD, 2 MULT, 1 DIV/SQRT, 1 scratch, & 1 comm unit

32 bit arch: subword operations support 16 and 8 bit data (no 64 bit support)

Local registers on functional units hold 16 words each (total 1.5 KB)

Clusters receive VLIW-style instructions broadcast from microcontroller.
VIRAM and Imagine
Bwdth GB/s
6.4
IMAGINE
Memory
2.7
Peak Fl 32bit
1.6 GF/s
20 GF/s
20
Peak Fl/Wd
Speed MHz
Chip Area
Data widths
Transistors
Pwr Consmp
1
200
15x18mm
64/32/16
130 x 106
2 Watts
30
400
12x12mm
32/16/8
21 x 106
10 Watts
2.5
VIRAM
IMAGINE
SRF
32

Imagine order of magnitude higher performance

VIRAM twice mem bandwidth, less power consumption

Notice peak Flop/Word ratios
% of Peak
SQMAT Architectural Probe
3x3 Matrix Multiply
50%
40%
30%
20%
10%
0%
VIRAM
IMAGINE
8
16
32
64
128
256
512
1024
Vector/Stream Length (L)

Sqmat: scalable synthetic probe, control comput intensity, vector len

Imagine stream model req large # of ops per word to amortize mem ref
Poor use of SRF, no producer-consumer locality

Long stream helps hide mem latency but only 7% of algorithmic peak

VIRAM: performs well for low op/word (40% when L=256)

Vector pipeline overlap comp/mem, on-chip DRAM (hi bdwth, low laten)
80000
CYCLES VIRAM
70000
CYCLES IMAGINE
60000
MFLOPS VIRAM
50000
MFLOPS IMAGINE
40000
30000
20000
10000
0
8
16
32
4500
4000
3500
3000
2500
2000
1500
1000
500
0
MFLOPS
CYCLES
SQMAT: Performance
Crossover
64 128 256 512 1024
Vector/Stream Length(L)

Large number of ops/word N10 where N=3x3

Crossover point L=64 (cycles) , L = 256 (MFlop)

Imagine power becomes apparent almost 4x VIRAM at L=1024
Codes at this end of spectrum greatly benefit from Imagine arch
VIRAM/Imagine
Optimization


Optimization strat: speed
up slower of comp or mem
Restructure computation
for better kernel perform


Add more computation for
better memory perform



Mem is waiting for ALUS
ALU memory starved
Subtle overlap effects:
vect chaining, stream doub buff
Example optimization RGB→YIQ conversion from EEMBC
Input format: R1G1B1R2G2R2R3G3B3…
Required format: R1R2R3… G1G2G3… B1B2B3….
VIRAM RGB→YIQ
Optimization
VIRAM: poor memory performance
•
Strided accesses (~1/2 performance)
- RGBRGBRGB… -- strided loads → RRR…GGG…BBB…
- Only 4 address generators for 8 addresses (sufficient for 64 bit)
•
Word operations on byte data (1/4th performance)
Optimization: replace strided w/ unit access, using in-register shuffle
•
Increased computational overhead (packing and unpacking)
VIRAM RGB→YIQ
Results
VIRAM RGB->YIQ
VIRAM
Integer ops (M/sec)
2,500.00
Kernel Memory
(cycles) (cycles)
2,400.00
2,300.00
2,200.00
2,100.00
2,000.00
1,900.00
small
medium
Original
optimized
large
Unoptimized
114
95
Optimized
108
17
Chunk Size
64
Used functional units instead of memory to extract
components, increasing the computational overhead
Imagine RGB→YIQ
Optimization

Imagine bottleneck is comp due poor ALU schedule (left)


Unoptimized 15 cycles per pixel
Software pipelining makes VLIW schedule denser (right)

Optimized 8 cycles per pixel
Imagine RGB→YIQ
Results
Imagine RGB->YIQ
Imagine
Integer ops (M/sec)
6,000.00
Kernel Memory
(cycles) (cycles)
5,000.00
4,000.00
3,000.00
2,000.00
1,000.00
0.00
small
Original
medium
large
Unoptimized
2153
1167
Optimized
1147
1165
Chunk Size
1024
software pipelined
Optimized kernel takes only ½ the cycles per element
Memory is now the new bottleneck
EEMBC Benchmark
Benchmark
Vec addition
RGB →YIQ
RGB →CMYK
Gray Filter
Autocorrelation
Width VIR/IMA
32/32 bits
32/32 bits
16/8 bits
16/32 bits
16/32 bits
Application Area
Microbenchmark
EEMBC Consumer
EEMBC Consumer
EEMBC Consumer
EEMBC Telecom
VIRAM GOPS
VIRAM GB/sec
Imagine GOPS
Imagine GB/sec
6.00
5.00
4.00
3.00
2.00
1.00
0.00
Integer ops (G/sec)
Autocorr:
speech
Autocorr:
pulse
RGB>CMYK:
RGB>YIQ:
6.00
5.00
4.00
3.00
2.00
1.00
0.00
64K
Vector
Bandwidth (GB/sec)




Remarks
c[i]=a[i]+b[i]
Color-conver
Color-conver
3x3 convolu
Dot product
Vec-add: one add/elem, perf
limited by memory system
RGB →(YIQ,CMYK): VIRAM
limited by processing
(cannot use avail bdwidth)
Grayfiler: Difficult to
efficiently impl on Imagine
(sliding 3x3 window)
Autocorr: Uses short
streams, Imagine host
latency is high
Scientific Kernels
SPMV Performance
Matrix
Perform
Rows/NNZ Metric
LSHAPE
1008
6958
LARGE
DIS
10000
117820
% Peak
Cycles
MFlop/s
% Peak
Cycles
CRS
2.8%
67K
44
3.2%
802K
MFlop/s
91
VIRAM
Imagine
SegSum Ellpck CRS Streams Ellpck
7.4%
31% 1.1%
0.8%
1.2%
24K
5.6K 40K
48K
38K
118
496
136
114
149
8.4%
32% 1.5%
0.6%
6.3%
567K
641K 742K 1840K 754K
135
511
192
77
870

Algorithmic peak: VIRAM 8 ops/cycle, Imag 32 ops/cycle

LSHAPE: finite element matrix, LARGEDIS pseudo-random nnz

Imagine lacks irreg access, reorder matrix before kernelC

VIRAM better suited for this class of apps (low comp/mem)
Scientific Kernels
Complex QR Decomposition
Complex QR Decomposition
Matrix
Performance
MITRE
% of Peak
RT_STRAP
Total Cycles
192x96
MFlops/s
complex
VIRAM
Imagine
34.1%
5189K
65.5%
712K
546
10480

A=QR Q orthrog & A upper triag,

Blocked Househoulder variant – rich in level 3 BLAS ops

Complex elems increases ops/word & locality (1 MUL = 6 ops)

VIRAM uses CLAPACK port (insertion of vector directives)

Imagine: complex indexing of matrix stream (each iter smaller matrix)

Imagine over 10GFlops (19x VIRAM) – well suited for this arch
Low VIRAM perf due strided access and compiler limitations
Overview

Significantly different balance of memory organization

Relative performance depends on computational intensity

Programming complexity is high for both approaches, although
VIRAM is based on established vector technology

For well-suited applications IMAGINE processor can sustain over
10GFlop/s (simulated results)

Large # homogeneous computation required to sufficiently
saturate IMAGINE while VIRAM can operate on small vector sizes

IMAGINE can take advantage of producer-consumer locality

Both present significant reduction in power and space

May be used as coprocessors in future generation architectures
Next Generation
•CODE: next generation of VIRAM
–More functional units/ faster clock speed
–Local registers per unit instead of single register file.
–Looking more like Imagine…
•Multi VIRAM architecture – network interface issues?
•Brook: new language for Imagine
–Eliminate exposure of hardware details (# of clusters)
• Streaming Supercomputer – multi Imagine configuration
– Streams can be used for functional/data parallelism
•Currently evaluating DIVA architecture