CS267: Introduction

Download Report

Transcript CS267: Introduction

Minimizing Communication in
Numerical Linear Algebra
www.cs.berkeley.edu/~demmel
Introduction, Technological Trends
Jim Demmel
EECS & Math Departments, UC Berkeley
[email protected]
1
Outline (of all lectures)
• Technology trends
- Why all computers must be parallel processors
- Arithmetic is cheap, what costs is moving data
- Between processors, or levels of memory hierarchy
• “Direct” Linear Algebra
- Lower bounds on how much data must be moved to
solve linear algebra problems like Ax=b, Ax = λx, etc
- Algorithms that attain these lower bounds
- Not in standard libraries like Sca/LAPACK (yet!)
- Large speed-ups possible
• “Iterative” Linear Algebra (Krylov Subspace Methods)
- Ditto
• Extensions, open problems… (time permitting)
Summer School Lecture 1
2
Outline (of all lectures)
• Technology trends
- Why all computers must be parallel processors
- Arithmetic is cheap, what costs is moving data
- Between processors, or levels of memory hierarchy
• “Direct” Linear Algebra
- Lower bounds on how much data must be moved to
solve linear algebra problems like Ax=b, Ax = λx, etc
- Algorithms that attain these lower bounds
- Not in standard libraries like Sca/LAPACK (yet!)
- Large speed-ups possible
• “Iterative” Linear Algebra (Krylov Subspace Methods)
- Ditto
• Extensions, open problems… (time permitting)
Summer School Lecture 1
3
Technology Trends: Microprocessor Capacity
Moore’s Law
2X transistors/Chip Every 1.5 years
Called “Moore’s Law”
Microprocessors have
become smaller, denser,
and more powerful.
Gordon Moore (co-founder of
Intel) predicted in 1965 that the
transistor density of
semiconductor chips would
double roughly every 18
months.
Slide source: Jack Dongarra
Summer School Lecture 1
4
Summer School Lecture 1
5
Impact of Device Shrinkage
• What happens when the feature size (transistor size) shrinks
by a factor of x ?
• Clock rate goes up by x because wires are shorter
- actually less than x, because of power consumption
• Transistors per unit area goes up by x2
• Die size also tends to increase
- typically another factor of ~x
• Raw computing power of the chip goes up by ~ x4 !
- typically x3 is devoted to either on-chip
- parallelism: hidden parallelism such as ILP
- locality: caches
• So most programs x3 times faster, without changing them
Summer School Lecture 1
6
But there are limiting forces
Manufacturing costs and yield problems limit use of density
•
Moore’s 2nd law (Rock’s
law): costs go up
Demo of
0.06
micron
CMOS
Source: Forbes Magazine
•
Yield
-What percentage of the chips
are usable?
-E.g., Cell processor (PS3) is
sold with 7 out of 8 “on” to
improve yield
7
Power Density Limits Serial Performance
Summer School Lecture 1
8
Parallelism Revolution is Happening Now
• Chip density is
continuing
increase ~2x
every 2 years
- Clock speed is not
- Number of
processor cores
may double instead
• There is little or
no more hidden
parallelism (ILP)
to be found
• Parallelism must
be exposed to
and managed by
software
Source: Intel, Microsoft (Sutter) and
Stanford (Olukotun, Hammond)
Summer School Lecture 1
9
Motif/Dwarf: Common Computational Methods
1
2
3
4
5
6
7
8
9
10
11
12
13
HPC
ML
Games
DB
SPEC
Embed
(Red Hot  Blue Cool)
Health Image Speech Music Browser
Finite State Mach.
Combinational
Graph Traversal
Structured Grid
Dense Matrix
Sparse Matrix
Spectral (FFT)
Dynamic Prog
N-Body
MapReduce
Backtrack/ B&B
Graphical Models
Unstructured Grid
Summer School Lecture 1
Outline (of all lectures)
• Technology trends
- Why all computers must be parallel processors
- Arithmetic is cheap, what costs is moving data
- Between processors, or levels of memory hierarchy
• “Direct” Linear Algebra
- Lower bounds on how much data must be moved to
solve linear algebra problems like Ax=b, Ax = λx, etc
- Algorithms that attain these lower bounds
- Not in standard libraries like Sca/LAPACK (yet!)
- Large speed-ups possible
• “Iterative” Linear Algebra (Krylov Subspace Methods)
- Ditto
• Extensions, open problems… (time permitting)
Summer School Lecture 1
11
Understanding Bandwidth and Latency (1/2)
• Bandwidth and Latency are metrics for the cost of moving data
• For a freeway between Berkeley and Sacramento
- Bandwidth: how many cars/hour can get from Berkeley to Sac?
- #cars/hour = density (#cars/mile)  velocity (miles/hour)  #lanes
- Latency: how long does it take for first car to get from Berkeley to Sac?
- #hours = distance (#miles)  velocity (miles/hour)
• For accessing a disk
- Bandwidth: how many bits/second can you read from disk?
- #bits/sec = density (#bits/cm)  velocity (cm/sec)  #read_heads
- Latency: how long do you wait for first bit?
- #seconds = distance_to_move_read_head (cm)  velocity (cm/sec) +
distance_half_way_around (cm)  velocity (cm/sec)
• For sending data from one processor to another – same idea
• For moving data from DRAM to on-chip cache - same idea …
Summer School Lecture 1
12
Understanding Bandwidth and Latency (2/2)
• Bandwidth = #bits(or words…)/second that can be
communicated
• Latency = #seconds for first bit(or word) to arrive
• Notation: use #words as unit
- 1/Bandwith ≡ , units of seconds/word
- Latency ≡ , units of seconds
• Basic Timing Model for communication
- to move one “message” of n bits (or words) costs =  +  n
• We will model running time of an algorithm as sum of 3 terms:
-
•
# flops * time_per_flop
# words moved / bandwidth
# messages * latency
Which term is largest? That’s the one to minimize
Summer School Lecture 1
13
Experimental Study of Memory (Membench)
• Microbenchmark for memory system performance
s
•
for array A of length L from 4KB to 8MB by 2x
for stride s from 4 Bytes (1 word) to L/2 by 2x
time
timethe
thefollowing
followingloop
loop
(repeat
(repeatmany
manytimes
timesand
andaverage)
average)
for
fori ifrom
from00totoLL by s
load
loadA[i]
A[i]from
frommemory
memory(4(4Bytes)
Bytes)
Summer School Lecture 1
1 experiment
14
Memory Hierarchy on a Sun Ultra-2i
Sun Ultra-2i, 333 MHz
Array length
See www.cs.berkeley.edu/~yelick/arvindk/t3d-isca95.ps for details
Summer School Lecture 1
15
Memory Hierarchy
• Most programs have a high degree of locality in their accesses
- spatial locality: accessing things nearby previous accesses
- temporal locality: reusing an item that was previously accessed
• Memory hierarchy tries to exploit locality
processor
control
Second
level
cache
(SRAM)
datapath
registers
on-chip
Main
memory
Secondary
storage
(Disk)
(DRAM)
Tertiary
storage
(Disk/Tape)
cache
Speed
1ns
10ns
100ns
10ms
10sec
Size
B
KB
MB
GB
TB
Summer School Lecture 1
16
Membench: What to Expect
average cost per access
memory
time
size > L1
cache hit
time
total size < L1
s = stride
• Consider the average cost per load
- Plot one line for each array length, time vs. stride
- Small stride is best: if cache line holds 4 words, at most ¼ miss
- If array is smaller than a given cache, all those accesses will hit (after
the first run, which is negligible for large enough runs)
- Picture assumes only one level of cache
- Values have gotten more difficult to measure on modern procs
Summer School Lecture 1
17
Memory Hierarchy on a Sun Ultra-2i
Sun Ultra-2i, 333 MHz
Array length
Mem: 396 ns
(132 cycles)
L2: 2 MB,
12 cycles (36 ns)
L1: 16 B line
L1:
16 KB
2 cycles (6ns)
L2: 64 byte line
8 K pages,
32 TLB entries
See www.cs.berkeley.edu/~yelick/arvindk/t3d-isca95.ps for details
Summer School Lecture 1
18
Processor-DRAM Gap (latency)
• Memory hierarchies are getting deeper
- Processors get faster more quickly than memory
CPU
“Moore’s Law”
Processor-Memory
Performance Gap:
(grows 50% / year)
DRAM
DRAM
7%/yr.
100
10
1
µProc
60%/yr.
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
Performance
1000
Time
Summer School Lecture 1
19
Why avoiding communication is important (2/2)
• Running time of an algorithm is sum of 3 terms:
-
# flops * time_per_flop
# words moved / bandwidth
# messages * latency
communication
• Time_per_flop << 1/ bandwidth << latency
• Gaps growing exponentially with time
Annual improvements
Time_per_flop
60%
Bandwidth
Latency
Network
26%
15%
DRAM
23%
7%
• Goal : reorganize linear algebra to avoid communication
•
•
•
Between all memory hierarchy levels
• L1
L2
DRAM
network, etc
Not just hiding communication (speedup  2x )
Arbitrary speedups possible
Summer School Lecture 1
20
Memory Hierarchy on a Pentium III
Katmai processor on Millennium, 550 MHz
Array size
L2: 512 KB
60 ns
L1: 64K
5 ns, 4-way?
L1: 32 byte line ?
Summer School Lecture 1
21
Memory Hierarchy on an IBM Power3 (Seaborg)
Power3, 375 MHz
Array size
Mem: 396 ns
(132 cycles)
L2: 8 MB
128 B line
9 cycles
L1: 32 KB
128B line
.5-2 cycles
Summer School Lecture 1
22
Implications so far
• Communication (moving data) is much more expensive than arithmetic, and
getting more so over time
• Communication occurs at many levels in the machine
- Between levels of memory hierarchy (registers  L1  L2  DRAM  disk…)
- Between levels of parallelism
- Between cores on a multicore chip
- Between chips in a multisocket board (eg CPU and GPU)
- Between boards in a rack
- Between racks in a cluster/supercomputer/”cloud”
- Between cities in a grid
- All of these are expensive compared to arithmetic
• What are we going to do about it?
- Not enough to hope cache policy will deal with it; we need better algorithms
- True not just for linear algebra
• Strategy: deal with two levels, try to apply recursion
Summer School Lecture 1
23
Outline of lectures
1. Introduction, technological trends
2. Case study: Matrix Multiplication
3. Communication Lower Bounds for Direct Linear Algebra
4. Optimizing One-sided Factorizations (LU and QR)
5. Multicore and GPU implementations
6. Communication-optimal Eigenvalue and SVD algorithms
7. Optimizing Sparse-Matrix-Vector-Multiplication (SpMV)
8. Communication-optimal Krylov Subspace Methods
9. Further topics, time permitting…
Lots of open problems (i.e. homework…)
Summer School Lecture 1
24
Collaborators
• Grey Ballard, UCB EECS
• Ioana Dumitriu, U. Washington
• Laura Grigori, INRIA
• Ming Gu, UCB Math
• Mark Hoemmen, UCB EECS
• Olga Holtz, UCB Math & TU Berlin
• Julien Langou, U. Colorado Denver
• Marghoob Mohiyuddin, UCB EECS
• Oded Schwartz , TU Berlin
• Hua Xiang, INRIA
• Kathy Yelick, UCB EECS & NERSC
• BeBOP group at Berkeley
Summer School Lecture 1
25
Where to find papers, slides, video, software
• bebop.cs.berkeley.edu
• www.cs.berkeley.edu/~demmel/cs267_Spr10
• parlab.eecs.berkeley.edu
- parlab.cs.berkeley.edu/2009bootcamp
- Sign up for 2010 bootcamp available
Summer School Lecture 1
26
EXTRA SLIDES
Summer School Lecture 1
27