Transcript sqmat-NSA04

Performance Analysis, Modeling,
and Optimization: Understanding
the Memory Wall
Leonid Oliker (LBNL) and
Katherine Yelick (UCB and LBNL)
Berkeley Institute for
Performance Studies

Joint venture between U.C. Berkeley (Demmel & Yelick)

And LBNL (Oliker, Strohmaier, Bailey, and others)

Three performance techniques:


Analysis (benchmarking)
Modeling (prediction)
Investigating Architectural Balance
using Adaptable Probes
Kaushik Datta, Parry Husbands, Paul Hargrove, Shoaib Kamil, Leonid
Oliker, John Shalf, Katherine Yelick
Overview

Gap between peak and sustained performance well known problem in HPC

Generally attributed to memory system, but difficult to identify bottleneck

Application benchmarks too complex to isolate specific architectural features

Microbenchmarks too narrow to predict actual code performance

We use adaptable probes to isolate performance limitations:



Single processor probes



Give application developers possible optimizations
Give hardware designers feedback on current and proposed architectures
Sqmat captures regular and irregular memory access patterns (such as dense and
sparse linear algebra)
Stencil captures nearest neighbor computation (work in progress)
Architectures examined:


Commercial: Intel Itanium2, AMD Opteron, IBM Power3, IBM Power4, G5
Research: Imagine, Iram, DIVA
Sqmat overview

Sqmat based on matrix multiplication and linear solvers

Java program used to generate optimally unrolled C code

Square a set of matrices M times in (use enough matrices to exceed cache)


Each matrix is size NxN



M controls computational intensity (CI) - the ratio between flops and mem access
N controls working set size: 2N2 registers required per matrix
Direct Storage: Sqmat’s matrix entries stored continuously in memory
Indirect: Entries accessed indirectly through pointer
Parameter S controls degree of indirection, S matrix entries stored contiguously,
then random jump in memory
–percent of algorithmic peak
Unit Stride Algorithmic Peak
–100%
–90%
–80%
–70%
–60%
–50%
–40%
–30%
–Itanium 2
–Opteron
–Power3
–Power4
–20%
–10%
–0%
–1
–10
–100intensity (CI)
–computational
–1000
–10000

Curve increases until memory system fully utilized, plateaus when FPU units saturate

Itanium2 requires longer time to achieve plateau due to register spill penalty

Opteron’s SIMD nature of SSE2 inhibits high algorithmic peak

Power3 effective hiding latency of cache-access

Power4’s deep pipeline inhibits ability to find sufficient ILP to saturate FPUs
Slowdown due to Indirection
5
Itanium 2
Opteron
Power3
Power4
slowdown
Unit stride
access via
indirection
(S=1)
4
3
2
1
1


2
4
8
16 M 32
64 128 256 512
Operton, Power3/4 less 10% penalty once M>8 - demonstrating bandwidth
between cache and processor effectively delivers addresses and values
Itanium2 showing high penalty for indirection - issue is currently under
invesigation
Cost of Irregularity (1)
–3.5
–4
–100% (S=1)
–50% (S=2)
–12.5% (S=8)
–6.25% (S=16)
–2.5
–2
–1.5
–3
–2.5
–2
–1.5
–1
–1
–1
–2
–4
–8
–16–M–32 –64 –128 –256 –512
– Irregularity on Itanium, N=4

–100% (S=1)
–50% (S=2)
–25% (S=4)
–12.5% (S=8)
–6.25% (S=16)
–3.13% (S=32)
–1.56% (S=64)
–0.78% (S=128)
–3.5
–25% (S=4)
–slowdown for irregular access
–slowdown for irregular access
–3
–1
–2
–4
–8
–16–M–32 –64 –128 –256 –512
– Irregularity on Opteron, N=4
Itanium and Opteron perform well for irregular accesses due to:


Itanium2’s L2 caching of FP values (reduces cost of cache miss)
Opteron’s low memory latency due to on-chip memory controller
Cost of Irregularity (2)
–15
–slowdown for irregular access
–21
–16
–11
–6
–1
–11
–9
–7
–5
–3
–1
–1
–2
–4
–8
–16–M–32 –64 –128 –256 –512
– Irregularity on Power3, N=4

–100% (S=1)
–50% (S=2)
–25% (S=4)
–12.5% (S=8)
–6.25% (S=16)
–3.13% (S=32)
–1.56% (S=64)
–0.78% (S=128)
–0.39% (S=256)
–random accesses
–13
–slowdown for irregular access
–100% (S=1)
–50% (S=2)
–25% (S=4)
–12.5% (S=8)
–6.25% (S=16)
–3.13% (S=32)
–1.56% (S=64)
–0.78% (S=128)
–0.39% (S=256)
–random accesses
–1
–2
–4
–8
–16–M–32 –64 –128 –256 –512
– Irregularity on Power4, N=4
Power3 and Power4 perform well for irregular accesses due to:


Power3’s high penalty cache miss (35 cycles) and limited prefetch abilities
Power4’s requires 4 cache-line hit to activate prefetching
Tolerating Irregularity

S50



Start with some M at S= (indirect unit stride)
For a given M, how large must S be to achieve at least 50% of the
original performance?
M50


Start with M=1, S=
At S=1 (every access random), how large must M be to achieve 50%
of the original performance
Tolerating Irregularity
S50: What % of memory access can be random
before performance decreases by half?
M50: How much computational intensity is
required to hide penalty of all random access?
CI required to hide indirection
0%
0.8%
1%
1.6%
6.3%
10%
25%
100%
Itanium 2
Opteron
Power3
Power4
Gather/Scatter expensive on commodity
cache-based systems
Power4 can is only 1.6% (1 in 64)
Itanium2: much less sensitive at 25% (1 in 4)
Computational Intensity
(CI)
% Indirection
Reducing performance by 50%
200
149.3
150
74.7
100
50
9.3
18.7
0
Itanium 2 Opteron
Power3
Power4
Huge amount of computation may be required to
hide overhead of irregular data access
Itanium2 requires CI of about 9 flops/word
Power4 requires CI of almost 75!
Interested in developing application driven architectural probes
for evaluation of emerging petascale systems
Emerging Architectures


General purpose processors badly suited for data intensive ops

Large caches not useful if re-use is low

Low memory bandwidth, especially for irregular patterns

Superscalar methods of increasing ILP inefficient

Power consumption
Application-specific ASICs


Good, but expensive/slow to design.
Solution: general purpose “memory aware” processors

Large number of ALUs: to exploit data-parallelism

Huge memory bandwidth: to keep ALUs busy

Concurrency: overlap memory w/ computation
VIRAM Overview
 MIPS core (200 MHz)
 Main memory system
 8 banks w/13 MB of on-chip DRAM
 Large 6.4 GBytes/s on-chip peak bandwidth
 Cache-less Vector unit
 Energy efficient way to express fine-grained
parallelism and exploit bandwidth
 Single issue, in order
 Low power consumption: 2.0 W
 Peak vector performance
 1.6/3.2/6.4 Gops
 1.6 Gflops (single-precision)




Fabricated by IBM
Deep pipelines mask DRAM latency
Cray’s vcc compiler adapted to VIRAM
Simulator used for results
VIRAM Power Efficiency
1000
VIRAM
100
MOPS/Watt
R10K
P-III
10
P4
Sparc
1
EV6
Transitive
GUPS
SPMV
Hist
0.1

Comparable performance with lower clock rate

Large power/performance advantage for VIRAM from

PIM technology, data parallel execution model
Mesh
Imagine Overview

Host sends instuctions to stream controller,
SC issues commands to on-chip modules

“Vector VLIW” processor

Coprocessor to off-chip host
processor

8 arithmetic clusters control in
SIMD w/ VLIW instructions

Central 128KB Stream
Register File @ 32GB/s

SRF can overlap computation
w/ memory (double buffering)

SRF cab reuse intermediate
results (proc-consum locality)

Stream-aware memory system
with 2.7 GB/s off-chip

544 GB/s intercluster comm
VIRAM and Imagine
VIRAM
Bandwidth GB/s
Peak Float 32bit
Peak Float/Word
Speed MHz
Chip Area
Data widths
Transistors
Power Consmp
6.4
1.6 GF/s
1
200
15x18mm
64/32/16
130 x 106
2 Watts
IMAGINE
Memory
2.7
20 GF/s
30
400
12x12mm
32/16/8
21 x 106
10 Watts

Imagine order of magnitude higher performance

VIRAM twice memory bandwidth, less power consumption

Notice peak Flop/Word ratios
IMAGINE
SRF
32
20
2.5
What Does This Have to Do
PIMs?
Mflop/s aswith
with varying
stream lengths
8
MFlops
16
3500
32
3000
64
2500
128
2000
256
1500
512
1000
1024
500
0
IMAGINE
IRAM
DIVA
Power3

Performance of Sqmat on PIMs and others for 3x3 matrices, squared 10 times
(high computational intensity!)

Imagine much faster for long streams, slower for short ones
80000
CYCLES VIRAM
70000
CYCLES IMAGINE
60000
MFLOPS VIRAM
50000
MFLOPS IMAGINE
40000
30000
20000
10000
0
8
16
32
4500
4000
3500
3000
2500
2000
1500
1000
500
0
MFLOPS
CYCLES
SQMAT: Performance
Crossover
64 128 256 512 1024
Vector/Stream Length(L)

Large number of ops/word N10 where N=3x3

Crossover point L=64 (cycles) , L = 256 (MFlop)

Imagine power becomes apparent almost 4x VIRAM at L=1024
Codes at this end of spectrum greatly benefit from Imagine arch
Stencil Probe

Stencil computations core of wide range of scientific applications




We developing adaptable stencil probe to model range of computations
Findings isolate importance of streaming memory accesses which engage
automatic prefetch engines, thus greatly increasing memory throughput
Previous L1 tiling techniques mostly ineffective for stencil computations on
modern microprocessors




Applications include Jacobi solvers, complex multigrid, block structured AMR
Small blocks inhibit automatic prefetching performance
Modern large on-chip L2/L3 caches have similar bandwidth to L1
Currently investigating tradeoffs between blocking and prefetching (paper in
preparation)
Interested in exploring potential benefits of enhancing commodity processors
with explicitly programmable prefetching