pim-workshop04 - Computer Science Division

Download Report

Transcript pim-workshop04 - Computer Science Division

Latency vs. Bandwidth
Which Matters More?
Katherine Yelick
U.C. Berkeley and LBNL
Joint with with:
Xiaoye Li, Lenny Oliker, Brian Gaeke, Parry Husbands (LBNL)
The Berkeley IRAM group: Dave Patterson, Joe Gebis, Dave Judd,
Christoforos Kozyrakis, Sam Williams,…
The Berkeley Bebop group: Jim Demmel, Rich Vuduc, Ben Lee,
Rajesh Nishtala,…
Blame the Memory Bus
 Many scientific applications run at less than 10% of
hardware peak, even on a single processor
 The trend is to blame the memory bus
 Is this accurate?
 Need to understand bottlenecks to
 Design better machines
 Design better algorithms
µProc
CPU 60%/yr.
Performance
1000
100
Processor-Memory
Performance Gap:
(grows 50% / year)
10
 Two parts
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
DRAM
1
 Algorithm bottlenecks on microprocessors
 Bottlenecks on a PIM system, VIRAM
K. Yelick, PIM Software 2004
DRAM
7%/yr.
Time
Note: this is latency,
not bandwidth.
Memory Intensive Applications
 Poor performance is especially problematic for memoryintensive applications
 Low ratio of arithmetic operations to memory
 Irregular memory access patterns
 Example
 Sparse matrix-vector multiply (dominant kernel of NAS CG)
 Many scientific applications do this by some perspective
 Compute y = y + A*x
 Matrix is stored as two main arrays:
– Column index array (int)
– Value array (floating point)
 For each element y[i] compute

Sj x[index[j]] * value[j]
y
 So latency (to x) dominates, right?
 Irregular
 Not necessarily in cache
K. Yelick, PIM Software 2004
x
Performance Model is Revealing
 A simple analytical model for sparse matvec kernel
 # loads from memory * cost of load + # loads from cache …
 Two versions:
 Only compulsory misses to
source vector, x
 All accesses to x produce
a miss to memory
 Conclusion
 Cache misses to source
(memory latency) is not
the dominant cost
 PAPI measurements
confirm
 So bandwidth to the matrix dominates, right?
K. Yelick, PIM Software 2004
Memory Bandwidth Measurements
 Yes, but be careful about how you measure bandwidth
 Not a constant
K. Yelick, PIM Software 2004
An Architectural Probe
 Sqmat is a tunable probe to measure architectures
 Stream of small matrices
 Square each matrix to some power: computational intensity
 The stream may be direct (dense), or indirect (sparse)
. . .
. . .
 If indirect, how frequently is there a non-unit stride jump
 Parameters:




Matrix size within stream
Computational Intensity
Indirection (yes/no)
# unit strides before jump
K. Yelick, PIM Software 2004
Cost of Indirection
 Adding a second load stream for indexes into stream has a
big effect on some machines
 This is truly a bandwidth issue
6
Itanium 2
Opteron
Pow er3
Pow er4
Slowdown
5
4
3
2
1
1
K. Yelick, PIM Software 2004
2
4
8
16 32 64 128 256 512
Num ber of squarings (com putational
intensity)
Cost of Irregularity
Opteron
Itanium2
S=1
S=2
S=4
S=8
S=16
S=128
4
Slowdown
4
3
3
2
2
1
1
1
1
2
4 8 16 32 64
# of s quarings
Power3
2
4
#
8
16
32
64
o f S q uar i ng s
15
Power4
12
9
6
3
0
1
2
4
8
16
32
# o f sq u a r i n g s
 Slowdown relative to the previous slide results
 Even a tiny bit of irregularity (1/S) can have a big effect
K. Yelick, PIM Software 2004
64
What Does This Have to Do with PIMs?
Mflop/s as with varying stream lengths
8
MFlops
16
3500
32
3000
64
2500
128
2000
256
1500
512
1000
1024
500
0
IMAGINE
IRAM
DIVA
Power3
 Performance of Sqmat on PIMs and others for 3x3 matrices, squared
10 times (high computational intensity!)
 Imagine much faster for long streams, slower for short ones
K. Yelick, PIM Software 2004
VIRAM Overview
 Technology: IBM SA-27E
14.5 mm
0.18mm CMOS, 6 metal layers
 290 mm2 die area
225 mm2 for memory/logic
20.0 mm
 Transistor count: ~130M
 13 MB of DRAM
 Power supply
1.2V for logic, 1.8V for DRAM
 Typical power consumption: 2.0 W
0.5 W (scalar) + 1.0 W (vector) + 0.2 W
(DRAM) + 0.3 W (misc)
 MIPS Scalar core + 4-lane vector
 Peak vector performance
1.6/3.2/6.4 Gops wo. multiply-add
(64b/32b/16b operations)
3.2/6.4 /12.8 Gops w. madd
1.6 Gflops (single-precision)
K. Yelick, PIM Software 2004
Vector IRAM ISA Summary
Scalar
Vector
ALU
Vector
Memory
MIPS64 scalar instruction set
alu op
load
store
s.int
u.int
s.fp
d.fp
.v
.vv
.vs
.sv
s.int
u.int
8
16
32
64
•91 instructions
•660 opcodes
unit stride
constant stride
indexed
ALU operations:
integer, floating-point, fixed-point and DSP,
convert, logical, vector processing,
flag processing
K. Yelick, PIM Software 2004
VIRAM Compiler
Frontends
C
C++
Fortran95
Optimizer
Cray’s
PDGCS
Code Generators
T3D/T3E
C90/T90/X1
SV2/VIRAM
 Based on the Cray’s production compiler
 Challenges:
 narrow data types and scalar/vector memory consistency
 Advantages relative to media-extensions:
 powerful addressing modes and ISA independent of datapath
width
K. Yelick, PIM Software 2004
Compiler and OS Enhancements
 Compiler based on Cray PDGCS
 Outer-loop vectorization
 Strided and indexed vector loads and stores
 Vectorization of loops with if statements
 Full predicated execution of vector instructions using flag registers
 Vectorization of reductions and FFTs
 Instructions for simple, intra-register permutations
 Automatic for reductions, manual (or StreamIT) for FFTs
 Vectorization of loops with break statements
 Software speculation support for vector loads
 OS development
 MMU-based virtual memory
 OS performance
 Dirty and valid bits for registers to reduce context switch
overhead
K. Yelick, PIM Software 2004
HW Resources Visible to Software
Vector IRAM
Pentium III
Transparent to SW
Visible to SW
Visible to SW
Transparent to SW
• Software (applications/compiler/OS) can control
– Main memory, registers, execution datapaths
K. Yelick, PIM Software 2004
VIRAM Chip Statistics
Technology
IBM SA-27E, 0.18um CMOS, 6 layers of copper
Deep trench DRAM cell, full speed logic
Area
270 mm2: 65 mm2 logic, 140 mm2 for DRAM
Transistors
~130 millions: 7.5M logic, 122.5 DRAM
Supply
1.2V logic, 1.8V DRAM, 3.3V I/O
Clock
200 MHz
Power
2W: 0.5W MIPS core, 1W vector unit, 0.5W DRAM-I/O
Package
304-lead quad ceramic package (125 signal I/Os)
Crossbar BW
12.8 Gbytes/s per direction (load or store, peak)
Peak
Performance
Integer wo. madd: 1.6/3.2/6.4 Gops (64b/32b/16b)
Integer w. madd: 3.2/6.4/12.8 Gops (64b/32b/16b)
FP: 1.6 Gflops (32b, wo. madd)
K. Yelick, PIM Software 2004
VIRAM Design Statistics
RTL model
170K lines of Verilog
Design
Methodology
Synthesized: MIPS core, vector unit control, FP datapath
Full-custom: vector reg. file, crossbar, integer datapaths
Macros: DRAM, SRAM for caches
IP Sources
UC Berkeley (Vector coprocessor, crossbar, I/O)
MIPS Technologies (MIPS core)
IBM (DRAM/SRAM macros)
MIT (FP Datapath)
Verification
566K lines of directed tests (9.8M lines of assembly)
4 months of random testing on 20 linux workstations
Design team
5 graduate students
Status
Place & route, chip assembly
Tape-out
October, 2002
Design time
~2.5 years
K. Yelick, PIM Software 2004
VIRAM Chip
DRAM
MIPS
4 64-bit Vector Lanes
I/O
DRAM
K. Yelick, PIM Software 2004
 Taped out to IBM in October ‘02
 Received wafers in June 2003.
 Chips were thinned, diced, and
packaged.
 Parts were sent to ISI, who
produced test boards.
Demonstration System
 Based on the MIPS Malta
development board
 PCI, Ethernet, AMR, IDE, USB,
CompactFlash, parallel, serial
 VIRAM daughter-card




Designed at ISI-East
VIRAM processor
Galileo GT64120 chipset
1 DIMM slot for external DRAM
 Software support and OS
 Monitor utility for debugging
 Modified version of MIPS Linux
K. Yelick, PIM Software 2004
Benchmarks for Scientific Problems
 Dense and Sparse Matrix-vector multiplication
 Compare to tuned codes on conventional machines
 Transitive-closure (small & large data set)
 On a dense graph representation
 NSA Giga-Updates Per Second (GUPS, 16-bit & 64-bit)
 Fetch-and-increment a stream of “random” addresses
 Sparse matrix-vector product:
 Order 10000, #nonzeros 177820
 Computing a histogram
 Used for image processing of a 16-bit greyscale image: 1536 x 1536
 2 algorithms: 64-elements sorting kernel; privatization
 Also used in sorting
 2D unstructured mesh adaptation
 initial grid: 4802 triangles, final grid: 24010
K. Yelick, PIM Software 2004
Sparse MVM Performance
 Performance is matrix-dependent: lp matrix
 compiled for VIRAM using “independent” pragma
 sparse column layout
 Sparsity-optimized for other machines
 sparse row (or blocked row) layout
250
200
150
100
50
0
Power PC
604e
Alpha
21264
MIPS 10K
Sun Ultra I
VIRAM8
VIRAM4
K. Yelick, PIM Software 2004
Power and Performance on BLAS-2
MFLOPS
MFLOPS/Watt
400
300
200
100
0
VIRAM
Sun Ultra I Sun Ultra IIMIPS R12K
Alpha
21264
PowerPC
G3
 100x100 matrix vector multiplication (column layout)




Power3
630
VIRAM result compiled, others hand-coded or Atlas optimized
VIRAM performance improves with larger matrices
VIRAM power includes on-chip main memory
8-lane version of VIRAM nearly doubles MFLOPS
K. Yelick, PIM Software 2004
MOPS
Performance Comparison
VIRAM
1000
900
800
700
600
500
400
300
200
100
0
R10K
P-III
P4
Sparc
EV6
Transitive
GUPS
SPMV (reg)
SPMV (rand)
Hist
Mesh
 IRAM designed for media processing
 Low power was a higher priority than high performance
 IRAM (at 200MHz) is better for apps with sufficient parallelism
K. Yelick, PIM Software 2004
Power Efficiency
1000
MOPS/Watt
100
VIRAM
R10K
P-III
P4
Sparc
EV6
10
1
Transitive
GUPS
SPMV (reg)
SPMV (rand)
Hist
0.1
 Same data on a log plot
 Includes both low power processors (Mobile PIII)
 The same picture for operations/cycle
K. Yelick, PIM Software 2004
Mesh
6000
1000
900
800
700
600
500
400
300
200
100
0
MB/s
5000
MOPS
MB/s
4000
3000
2000
1000
0
Transitive
GUPS
SPMV
SPMV Histogram
(Regular) (Random)
 What is the bottleneck in each case?
Mesh
 Transitive and GUPS are limited by bandwidth (near 6.4GB/s peak)
 SPMV and Mesh limited by address generation and bank conflicts
 For Histogram there is insufficient parallelism
K. Yelick, PIM Software 2004
Mops
Which Problems are Limited by Bandwidth?
Summary of 1-PIM Results
 Programmability advantage
 All vectorized by the VIRAM compiler (Cray vectorizer)
 With restructuring and hints from programmers
 Performance advantage
 Large on applications limited only by bandwidth
 More address generators/sub-banks would help irregular
performance
 Performance/Power advantage
 Over both low power and high performance processors
 Both PIM and data parallelism are key
K. Yelick, PIM Software 2004
Alternative VIRAM Designs
“VIRAM-4Lane”
4 lanes, 8 Mbytes
~190
mm2
3.2 Gops at 200MHz
K. Yelick, PIM Software 2004
“VIRAM-2Lanes”
“VIRAM-Lite”
2 lanes, 4 Mbytes
1 lanes, 2 Mbytes
~120 mm2
~60 mm2
1.6 Gops at 200MHz
0.8 Gops at 200MHz
Compiled Multimedia Performance
1 Lane
2 Lanes
4 Lanes
Million Operations per second
4000
floating-point
integer
3000
2000
1000
0
matmul 64x64
saxpy 4K
fir filter
decrypt
detect
convolve
compose
 Single executable for multiple implementations
 Linear scaling with number of lanes
 Remember, this is a 200MHz, 2W processor
K. Yelick, PIM Software 2004
colorspace
Third Party Comparison (I)
0
Corner Turn
Coherent Sidelobe
Canceller
Beam Steering
PPC G3-400MHz
M32R/D-80MHz
PPC G4-733MHz
Pentium III-733MHz
VIRAM-200MHz
Imagine-400MHz
K. Yelick, PIM Software 2004
Imagine
VIRAM
PPC-G4
Pentium III
5
VIRAM
10
PPC-G4
Pentium III
15
Imagine
20
PPC-G4
Pentium III
Speedup over PPC G3
25
Imagine
VIRAM
ISI Results for SLIIC Kernels (Performance)
Third Party Comparison (II)
VIRAM
ISI-East Results for SLIIC Kernels (Peformance/Watt)
35
5
0
Corner Turn
Coherent Sidelobe
Canceller
Beam Steering
PPC G3-400MHz
M32R/D-80MHz
PPC G4-733MHz
Pentium III-733MHz
VIRAM-200MHz
Imagine-400MHz
K. Yelick, PIM Software 2004
Imagine
VIRAM
PPC-G4
Pentium III
10
VIRAM
15
PPC-G4
Pentium III
20
Imagine
25
Imagine
30
PPC-G4
Pentium III
Improvement over PPC G3
40
Vectors VS. SIMD or VLIW
 SIMD
 Short, fixed-length, vector extensions
 Require wide issue or ISA change to scale
 They don’t support vector memory accesses
 Difficult to compile for
 Performance wasted for pack/unpack, shifts, rotates…
 VLIW
 Architecture for instruction level parallelism
 Orthogonal to vectors for data parallelism
 Inefficient for data parallelism
 Large code size (3X for IA-64?)
 Extra work for software (scheduling more instructions)
 Extra work for hardware (decode more instructions)
K. Yelick, PIM Software 2004
Vector Vs. Wide Word SIMD: Example
 Vector instruction sets have
 Strided and scatter/gather load/store operations
 SIMD extensions load contiguous memory
 Implementation-independent vector length
 SIMD extensions change ISA with bit wide in hardware
 Simple example: conversion from RGB to YUV
 Thanks to Christoforos Kozyrakis
Y = [( 9798*R + 19235*G + 3736*B) / 32768]
U = [(-4784*R - 9437*G + 4221*B) / 32768] + 128
V = [(20218*R – 16941*G – 3277*B) / 32768] + 128
K. Yelick, PIM Software 2004
VIRAM Code
RGBtoYUV:
vlds.u.b
vlds.u.b
vlds.u.b
xlmul.u.sv
xlmadd.u.sv
xlmadd.u.sv
vsra.vs
xlmul.u.sv
xlmadd.u.sv
xlmadd.u.sv
vsra.vs
vadd.sv
xlmul.u.sv
xlmadd.u.sv
xlmadd.u.sv
vsra.vs
vadd.sv
vsts.b
vsts.b
vsts.b
subu
bnez
r_v, r_addr, stride3,
g_v, g_addr, stride3,
b_v, b_addr, stride3,
o1_v, t0_s,
r_v
o1_v, t1_s,
g_v
o1_v, t2_s,
b_v
o1_v, o1_v,
s_s
o2_v, t3_s,
r_v
o2_v, t4_s,
g_v
o2_v, t5_s,
b_v
o2_v, o2_v,
s_s
o2_v, a_s,
o2_v
o3_v, t6_s,
r_v
o3_v, t7_s,
g_v
o3_v, t8_s,
b_v
o3_v, o3_v,
s_s
o3_v, a_s,
o3_v
o1_v, y_addr, stride3,
o2_v, u_addr, stride3,
o3_v, v_addr, stride3,
pix_s,pix_s, len_s
pix_s, RGBtoYUV
K. Yelick, PIM Software 2004
addr_inc
addr_inc
addr_inc
#
#
#
#
load R
load G
load B
calculate Y
# calculate U
# calculate V
addr_inc
addr_inc
addr_inc
# store Y
# store U
# store V
MMX Code (1)
RGBtoYUV:
movq
mm1,
pxor
mm6,
movq
mm0,
psrlq
mm1,
punpcklbw
movq
mm7,
punpcklbw
movq
mm2,
pmaddwd mm0,
movq
mm3,
pmaddwd mm1,
movq
mm4,
pmaddwd mm2,
movq
mm5,
pmaddwd mm3,
punpckhbw
pmaddwd mm4,
paddd
mm0,
pmaddwd mm5,
movq
mm1,
paddd
mm2,
movq
mm6,
[eax]
mm6
mm1
16
mm0,
mm1
mm1,
mm0
YR0GR
mm1
YBG0B
mm2
UR0GR
mm3
UBG0B
mm7,
VR0GR
mm1
VBG0B
8[eax]
mm3
mm1
K. Yelick, PIM Software 2004
ZEROS
ZEROS
mm6;
paddd
mm4,
movq
mm5,
psllq
mm1,
paddd
mm1,
punpckhbw
movq
mm3,
pmaddwd mm1,
movq
mm7,
pmaddwd mm5,
psrad
mm0,
movq
TEMP0,
movq
mm6,
pmaddwd mm6,
psrad
mm2,
paddd
mm1,
movq
mm5,
pmaddwd mm7,
psrad
mm1,
pmaddwd mm3,
packssdw
pmaddwd mm5,
psrad
mm4,
movq
mm1,
mm5
mm1
32
mm7
mm6,
ZEROS
mm1
YR0GR
mm5
YBG0B
15
mm6
mm3
UR0GR
15
mm5
mm7
UBG0B
15
VR0GR
mm0,
mm1
VBG0B
15
16[eax]
MMX Code (2)
paddd
mm6,
movq
mm7,
psrad
mm6,
paddd
mm3,
psllq
mm7,
movq
mm5,
psrad
mm3,
movq
TEMPY,
packssdw
movq
mm0,
punpcklbw
movq
mm6,
movq
TEMPU,
psrlq
mm0,
paddw
mm7,
movq
mm2,
pmaddwd mm2,
movq
mm0,
pmaddwd mm7,
packssdw
add
eax,
add
edx,
movq
TEMPV,
mm7
mm1
15
mm5
16
mm7
15
mm0
mm2,
TEMP0
mm7,
mm0
mm2
32
mm0
mm6
YR0GR
mm7
YBG0B
mm4,
24
8
mm4
K. Yelick, PIM Software 2004
mm6
ZEROS
mm3
movq
mm4,
pmaddwd mm6,
movq
mm3,
pmaddwd mm0,
paddd
mm2,
pmaddwd
pxor
mm7,
pmaddwd mm3,
punpckhbw
paddd
mm0,
movq
mm6,
pmaddwd mm6,
punpckhbw
movq
mm7,
paddd
mm3,
pmaddwd mm5,
movq
mm4,
pmaddwd mm4,
psrad
mm0,
paddd
mm0,
psrad
mm2,
paddd
mm6,
movq
mm5,
mm6
UR0GR
mm0
UBG0B
mm7
mm4,
mm7
VBG0B
mm1,
mm6
mm1
YBG0B
mm5,
mm5
mm4
YR0GR
mm1
UBG0B
15
OFFSETW
15
mm5
mm7
MMX Code (3)
pmaddwd mm7,
psrad
mm3,
pmaddwd mm1,
psrad
mm6,
paddd
mm4,
packssdw
pmaddwd mm5,
paddd
mm7,
psrad
mm7,
movq
mm6,
packssdw
movq
mm4,
packuswb
movq
mm7,
paddd
mm1,
paddw
mm4,
psrad
mm1,
movq
[ebx],
packuswb
movq
mm5,
packssdw
paddw
mm5,
paddw
mm3,
UR0GR
15
VBG0B
15
OFFSETD
mm2,
VR0GR
mm4
15
TEMPY
mm0,
TEMPU
mm6,
OFFSETB
mm5
mm7
15
mm6
mm4,
TEMPV
mm3,
mm7
mm7
K. Yelick, PIM Software 2004
mm6
mm7
mm2
mm4
movq
[ecx], mm4
packuswb
mm5,
add
ebx,
8
add
ecx,
8
movq
[edx], mm5
dec
edi
jnz
RGBtoYUV
mm3
Summary
 Combination of Vectors and PIM
 Simple execution model for hardware – pushes complexity to
compiler
 Low power/footprint/etc.
 PIM provides bandwidth needed by vectors
 Vectors hid latency effectively
 Programmability
 Programmable from “high” level language
 More compact instruction stream
 Works well for:
 Applications with fine-grained data parallelism
 Memory intensive problems
 Both scientific and multimedia applications
K. Yelick, PIM Software 2004
The
End
K. Yelick, PIM Software 2004
Algorithm Space
Search
Reuse
Grobner Basis
(“Symbolic LU”)
FFTs
Sparse
iterative
solvers
Asynchronous
discrete even
simulation
Sparse
direct
solvers
Regularity
K. Yelick, PIM Software 2004
Sorting
Two-sided
dense linear
algebra
One-sided
dense linear
algebra
VIRAM Overview
 MIPS core (200 MHz)
14.5 mm
 Single-issue, 8 Kbyte I&D caches
 Vector unit (200 MHz)
 32 64b elements per register
 256b datapaths, (16b, 32b, 64b ops)
 4 address generation units
20.0 mm
 Main memory system
 13 MB of on-chip DRAM in 8 banks
 12.8 GBytes/s peak bandwidth
 Typical power consumption: 2.0 W
 Peak vector performance
 1.6/3.2/6.4 Gops wo. multiply-add
 1.6 Gflops (single-precision)
 Fabrication by IBM
 Tape-out in O(1 month)
K. Yelick, PIM Software 2004
Benchmarks for Scientific Problems
 Dense Matrix-vector multiplication
 Compare to hand-tuned codes on conventional machines
 Transitive-closure (small & large data set)
 On a dense graph representation
 NSA Giga-Updates Per Second (GUPS, 16-bit & 64-bit)
 Fetch-and-increment a stream of “random” addresses
 Sparse matrix-vector product:
 Order 10000, #nonzeros 177820
 Computing a histogram
 Used for image processing of a 16-bit greyscale image: 1536 x 1536
 2 algorithms: 64-elements sorting kernel; privatization
 Also used in sorting
 2D unstructured mesh adaptation
 initial grid: 4802 triangles, final grid: 24010
K. Yelick, PIM Software 2004
Power and Performance on BLAS-2
MFLOPS
MFLOPS/Watt
400
300
200
100
0
VIRAM
Sun Ultra I Sun Ultra IIMIPS R12K
Alpha
21264
PowerPC
G3
 100x100 matrix vector multiplication (column layout)




Power3
630
VIRAM result compiled, others hand-coded or Atlas optimized
VIRAM performance improves with larger matrices
VIRAM power includes on-chip main memory
8-lane version of VIRAM nearly doubles MFLOPS
K. Yelick, PIM Software 2004
MOPS
Performance Comparison
VIRAM
1000
900
800
700
600
500
400
300
200
100
0
R10K
P-III
P4
Sparc
EV6
Transitive
GUPS
SPMV (reg)
SPMV (rand)
Hist
Mesh
 IRAM designed for media processing
 Low power was a higher priority than high performance
 IRAM (at 200MHz) is better for apps with sufficient parallelism
K. Yelick, PIM Software 2004
MOPS/Watt
Power Efficiency
500
450
400
350
300
250
200
150
100
50
0
VIRAM
R10K
P-III
P4
Sparc
EV6
Transitive
GUPS
SPMV (reg)
SPMV (rand)
Hist
Mesh
 Huge power/performance advantage in VIRAM from both
 PIM technology
 Data parallel execution model (compiler-controlled)
K. Yelick, PIM Software 2004
Power Efficiency
1000
MOPS/Watt
100
VIRAM
R10K
P-III
P4
Sparc
EV6
10
1
Transitive
GUPS
SPMV (reg)
SPMV (rand)
Hist
0.1
 Same data on a log plot
 Includes both low power processors (Mobile PIII)
 The same picture for operations/cycle
K. Yelick, PIM Software 2004
Mesh
6000
1000
900
800
700
600
500
400
300
200
100
0
MB/s
5000
MOPS
MB/s
4000
3000
2000
1000
0
Transitive
GUPS
SPMV
SPMV Histogram
(Regular) (Random)
 What is the bottleneck in each case?
Mesh
 Transitive and GUPS are limited by bandwidth (near 6.4GB/s peak)
 SPMV and Mesh limited by address generation and bank conflicts
 For Histogram there is insufficient parallelism
K. Yelick, PIM Software 2004
Mops
Which Problems are Limited by Bandwidth?
Summary of 1-PIM Results
 Programmability advantage
 All vectorized by the VIRAM compiler (Cray vectorizer)
 With restructuring and hints from programmers
 Performance advantage
 Large on applications limited only by bandwidth
 More address generators/sub-banks would help irregular
performance
 Performance/Power advantage
 Over both low power and high performance processors
 Both PIM and data parallelism are key
K. Yelick, PIM Software 2004
Analysis of a Multi-PIM System
 Machine Parameters
 Floating point performance
 PIM-node dependent
 Application dependent, not theoretical peak
 Amount of memory per processor
 Use 1/10th Algorithm data
 Communication Overhead
 Time processor is busy sending a message
 Cannot be overlapped
 Communication Latency
 Time across the network (can be overlapped)
 Communication Bandwidth
 Single node and bisection
 Back-of-the envelope calculations !
K. Yelick, PIM Software 2004
Real Data from an Old Machine (T3E)
Sparse Matrix-Vector Multiply (T3E)
250
Mflops
200
UPC + Prefetch
MPI (Aztec)
UPC Bulk
UPC Small
150
100
50
0
1
2
4
8
16
Processors
 UPC uses a global address space
 Non-blocking remote put/get model
 Does not cache remote data
K. Yelick, PIM Software 2004
32
Running Sparse MVM on a Pflop PIM
Ops/sec
Put/Get
1.E+16
Blocking read/w rite
1.E+15
Synchronous MP
1.E+14
Asynch MP
Peak
1.E+13
1.E+12
1.E+11
1.E+10
1.E+09
1.E+08




9
16 2
38
32 4
76
65 8
53
6
81
96
40
48
20
24
2
10
51
6
25
8
12
64
32
16
8
4
2
1
1.E+07
1 GHz * 8 pipes * 8 ALUs/Pipe = 64 GFLOPS/node peak
8 Address generators limit performance to 16 Gflops
500ns latency, 1 cycle put/get overhead, 100 cycle MP overhead
Programmability differences too: packing vs. global address space
K. Yelick, PIM Software 2004
Effect of Memory Size
1.E+16
1.E+15
Ops/sec
1.E+14
1.E+13
Put/Get
1.E+12
Blocking read/w rite
1.E+11
Synchronous MP
1.E+10
Asynch MP
1.E+09
Peak
1.E+08
52
6.
3
10
52
.5
21
05
.0
42
10
.0
26
3.
1
13
1.
6
65
.8
32
.9
16
.4
8.
2
4.
1
2.
1
1.
0
0.
5
0.
3
1.E+07
MB/node of data
 For small memory nodes or smaller problem sizes
 Low overhead is more important
 For large memory nodes and large problems packing is better
K. Yelick, PIM Software 2004
Conclusions
 Performance advantage for PIMS depends on application
 Need fine-grained parallelism to utilize on-chip bandwidth
 Data parallelism is one model with the usual trade-offs
 Hardware and programming simplicity
 Limited expressibility
 Largest advantages for PIMS are power and packaging
 Enables Peta-scale machine
 Multiprocessor PIMs should be easier to program
 At least at scale of current machines (Tflops)
 Can we bget rid of the current programming model hierarchy?
K. Yelick, PIM Software 2004
Benchmarks
 Kernels
 Designed to stress memory systems
 Some taken from the Data Intensive Systems Stressmarks
 Unit and constant stride memory
 Dense matrix-vector multiplication
 Transitive-closure
 Constant stride
 FFT
 Indirect addressing
 NSA Giga-Updates Per Second (GUPS)
 Sparse Matrix Vector multiplication
 Histogram calculation (sorting)
 Frequent branching a well and irregular memory acess
 Unstructured mesh adaptation
K. Yelick, PIM Software 2004
Conclusions and VIRAM Future Directions
 VIRAM outperforms Pentium III on Scientific problems
 With lower power and clock rate than the Mobile Pentium
 Vectorization techniques developed for the Cray PVPs applicable.
 PIM technology provides low power, low cost memory system.
 Similar combination used in Sony Playstation.
 Small ISA changes can have large impact
 Limited in-register permutations sped up 1K FFT by 5x.
 Memory system can still be a bottleneck
 Indexed/variable stride costly, due to address generation.
 Future work:




Ongoing investigations into impact of lanes, subbanks
Technical paper in preparation – expect completion 09/01
Run benchmark on real VIRAM chips
Examine multiprocessor VIRAM configurations
K. Yelick, PIM Software 2004
Management Plan
 Roles of different groups and PIs
 Senior researchers working on particular class of benchmarks





Parry: sorting and histograms
Sherry: sparse matrices
Lenny: unstructured mesh adaptation
Brian: simulation
Jin and Hyun: specific benchmarks
 Plan to hire additional postdoc for next year (focus on Imagine)
 Undergrad model used for targeted benchmark efforts
 Plan for using computational resources at NERSC
 Few resourced used, except for comparisons
K. Yelick, PIM Software 2004
Future Funding Prospects
 FY2003 and beyond




DARPA initiated DIS program
Related projects are continuing under Polymorphic Computing
New BAA coming in “High Productivity Systems”
Interest from other DOE labs (LANL) in general problem
 General model
 Most architectural research projects need benchmarking
 Work has higher quality if done by people who understand apps.
 Expertise for hardware projects is different: system level design,
circuit design, etc.
 Interest from both IRAM and Imagine groups show level of
interest
K. Yelick, PIM Software 2004
Long Term Impact
 Potential impact on Computer Science
 Promote research of new architectures and microarchitectures
 Understand future architectures
 Preparation for procurements
 Provide visibility of NERSC in core CS research areas
 Correlate applications: DOE vs. large market problems
 Influence future machines through research
collaborations
K. Yelick, PIM Software 2004
Benchmark Performance on IRAM Simulator
 IRAM (200 MHz, 2 W) versus Mobile Pentium III (500 MHz, 4 W)
K. Yelick, PIM Software 2004
Project Goals for FY02 and Beyond
 Use established data-intensive scientific benchmarks with
other emerging architectures:
 IMAGINE (Stanford Univ.)
 Designed for graphics and image/signal processing
 Peak 20 GLOPS (32-bit FP)
 Key features: vector processing, VLIW, a streaming memory
system. (Not a PIM-based design.)
 Preliminary discussions with Bill Dally.
 DIVA (DARPA-sponsored: USC/ISI)




Based on PIM “smart memory” design, but for multiprocessors
Move computation to data
Designed for irregular data structures and dynamic databases.
Discussions with Mary Hall about benchmark comparisons
K. Yelick, PIM Software 2004
Media Benchmarks
4
3.5
3
GOPS
2.5
2
1.5
1
0.5
flo
at
FF
T
ul
at
m
m
1K
sa
xp
y
64
sa
xp
y
fil
te
r
fir
fix
ed
FF
T
de
cr
yp
t
de
te
ct
co
nv
ol
ve
co
lo
rs
pa
ce
co
m
po
si
tio
n
0
 FFT uses in-register permutations, generalized reduction
 All others written in C with Cray vectorizing compiler
K. Yelick, PIM Software 2004
Integer Benchmarks
1 lane
2 lane
pe
ak
de
cr
yp
t
de
te
ct
co
nv
ol
ve
po
si
t io
n
4 lane
co
m
co
lo
rs
pa
ce
7000
6000
5000
4000
3000
2000
1000
0
 Strided access important, e.g., RGB
 narrow types limited by address generation
 Outer loop vectorization and unrolling used
 helps avoid short vectors
 spilling can be a problem
K. Yelick, PIM Software 2004
Status of benchmarking software release
Optimized
Optimized
GUPS GUPS
inner loop Docs
Optimized
vector
histogram
code
Vector histogram Pointer
code generator Jumping
w/Update Conjugate
Pointer Gradient
Transitive
Jumping (Matrix) Field
Standard random number generator
Test cases (small and large working sets)
GUPS C codes
Build
Unoptimized
 Future work:
•
•
Neighborhood
and test scripts (Makefiles, timing, analysis, ...)
Write more documentation, add better test cases as we find them
Incorporate media benchmarks, AMR code, library of frequently-used compiler
flags & pragmas
K. Yelick, PIM Software 2004
Status of benchmarking work
 Two performance models:
 simulator (vsim-p), and trace analyzer (vsimII)
 Recent work on vsim-p:
 Refining the performance model for double-precision FP
performance.
 Recent work on vsimII:
 Making the backend modular
 Goal: Model different architectures w/ same ISA.
 Fixing bugs in the memory model of the VIRAM-1 backend.
 Better comments in code for better maintainability.
 Completing a new backend for a new decoupled cluster
architecture.
K. Yelick, PIM Software 2004
Comparison with Mobile Pentium
 GUPS: VIRAM gets 6x more GUPS
Transitive
Data element
width
16 bit
32
bit
64 bit
Mobile Pentium
GUPS
.045
.046
.036
VIRAM GUPS
.295
.295
.244
Update
7
Pointer
6
5
0.0006
0.0035
0.0005
0.003
4
VIRAM 4
lane
3
2
1
transitive
transitive
transitive
transitive
transitive
transitive
transitive
transitive
0
50 100150200250350450550
Matrix size
total execution time (seconds)
P-III
total execution time (seconds)
Total execution time (seconds)
8
0.0004
0.0003
0.0002
0.0001
0
K. Yelick, PIM Software 2004
0.002
0.0015
0.001
0.0005
0
update update update update update
0tiny
test
test2
test3
Working set size
VIRAM=30-50%
faster than P-III
0.0025
test4
pointer
pointer
pointer
pointer
0tiny
test
test2
test3
w orking set size
Ex. time for VIRAM rises much more slowly w/
data size than for P-III
Sparse CG
 Solve Ax = b; Sparse matrix-vector multiplication
dominates.
 Traditional CRS format requires:
 Indexed load/store for X/Y vectors
 Variable vector length, usually short
 Other formats for better vectorization:
 CRS with narrow band (e.g., RCM ordering)
 Smaller strides for X vector
 Segmented-Sum (Modified the old code developed for Cray PVP)
 Long vector length, of same size
 Unit stride
 ELL format: make all rows the same length by padding zeros
 Long vector length, of same size
 Extra flops
K. Yelick, PIM Software 2004
SMVM Performance
 DIS matrix: N = 10000, M = 177820 (~ 17 nonzeros per
row)
 IRAM results (MFLOPS)
SubBanks
1
2
4
8
CRS
91
106
109
110
CRS
banded
110
110
110
110
SEG-SUM
135
154
163
165
ELL (4.6 X
more
flops)
511
(111)
570
(124)
612
(133)
632
(137)
 Mobile PIII (500 MHz)
 CRS: 35 MFLOPS
K. Yelick, PIM Software 2004
2D Unstructured Mesh Adaptation
 Powerful tool for efficiently solving computational problems with
evolving physical features (shocks, vortices, shear layers, crack
propagation)
 Complicated logic and data structures
 Difficult to achieve high efficiently
 Irregular data access patterns (pointer chasing)
 Many conditionals / integer intensive
 Adaptation is tool for making numerical solution cost effective
 Three types of element subdivision
K. Yelick, PIM Software 2004
Vectorization Strategy and Performance Results
 Color elements based on vertices (not edges)
 Guarantees no conflicts during vector operations
 Vectorize across each subdivision (1:2, 1:3, 1:4) one color at a time
 Difficult: many conditionals, low flops, irregular data access,
dependencies
 Initial grid: 4802 triangles, Final grid 24010 triangles
Pentium III 500
1 Lane
2 Lanes
4 Lanes
61
18
14
13
Time (ms)
 Preliminary results demonstrate VIRAM 4.5x faster than Mobile
Pentium III 500
 Higher code complexity (requires graph coloring + reordering)
K. Yelick, PIM Software 2004