CS267: Introduction
Download
Report
Transcript CS267: Introduction
Minimizing Communication in
Numerical Linear Algebra
www.cs.berkeley.edu/~demmel
Introduction, Technological Trends
Jim Demmel
EECS & Math Departments, UC Berkeley
[email protected]
1
Outline (of all lectures)
• Technology trends
- Why all computers must be parallel processors
- Arithmetic is cheap, what costs is moving data
- Between processors, or levels of memory hierarchy
• “Direct” Linear Algebra
- Lower bounds on how much data must be moved to
solve linear algebra problems like Ax=b, Ax = λx, etc
- Algorithms that attain these lower bounds
- Not in standard libraries like Sca/LAPACK (yet!)
- Large speed-ups possible
• “Iterative” Linear Algebra (Krylov Subspace Methods)
- Ditto
• Extensions, open problems… (time permitting)
Summer School Lecture 1
2
Outline (of all lectures)
• Technology trends
- Why all computers must be parallel processors
- Arithmetic is cheap, what costs is moving data
- Between processors, or levels of memory hierarchy
• “Direct” Linear Algebra
- Lower bounds on how much data must be moved to
solve linear algebra problems like Ax=b, Ax = λx, etc
- Algorithms that attain these lower bounds
- Not in standard libraries like Sca/LAPACK (yet!)
- Large speed-ups possible
• “Iterative” Linear Algebra (Krylov Subspace Methods)
- Ditto
• Extensions, open problems… (time permitting)
Summer School Lecture 1
3
Technology Trends: Microprocessor Capacity
Moore’s Law
2X transistors/Chip Every 1.5 years
Called “Moore’s Law”
Microprocessors have
become smaller, denser,
and more powerful.
Gordon Moore (co-founder of
Intel) predicted in 1965 that the
transistor density of
semiconductor chips would
double roughly every 18
months.
Slide source: Jack Dongarra
Summer School Lecture 1
4
Summer School Lecture 1
5
Impact of Device Shrinkage
• What happens when the feature size (transistor size) shrinks
by a factor of x ?
• Clock rate goes up by x because wires are shorter
- actually less than x, because of power consumption
• Transistors per unit area goes up by x2
• Die size also tends to increase
- typically another factor of ~x
• Raw computing power of the chip goes up by ~ x4 !
- typically x3 is devoted to either on-chip
- parallelism: hidden parallelism such as ILP
- locality: caches
• So most programs x3 times faster, without changing them
Summer School Lecture 1
6
But there are limiting forces
Manufacturing costs and yield problems limit use of density
•
Moore’s 2nd law (Rock’s
law): costs go up
Demo of
0.06
micron
CMOS
Source: Forbes Magazine
•
Yield
-What percentage of the chips
are usable?
-E.g., Cell processor (PS3) is
sold with 7 out of 8 “on” to
improve yield
7
Power Density Limits Serial Performance
Summer School Lecture 1
8
Parallelism Revolution is Happening Now
• Chip density is
continuing
increase ~2x
every 2 years
- Clock speed is not
- Number of
processor cores
may double instead
• There is little or
no more hidden
parallelism (ILP)
to be found
• Parallelism must
be exposed to
and managed by
software
Source: Intel, Microsoft (Sutter) and
Stanford (Olukotun, Hammond)
Summer School Lecture 1
9
Motif/Dwarf: Common Computational Methods
1
2
3
4
5
6
7
8
9
10
11
12
13
HPC
ML
Games
DB
SPEC
Embed
(Red Hot Blue Cool)
Health Image Speech Music Browser
Finite State Mach.
Combinational
Graph Traversal
Structured Grid
Dense Matrix
Sparse Matrix
Spectral (FFT)
Dynamic Prog
N-Body
MapReduce
Backtrack/ B&B
Graphical Models
Unstructured Grid
Summer School Lecture 1
Outline (of all lectures)
• Technology trends
- Why all computers must be parallel processors
- Arithmetic is cheap, what costs is moving data
- Between processors, or levels of memory hierarchy
• “Direct” Linear Algebra
- Lower bounds on how much data must be moved to
solve linear algebra problems like Ax=b, Ax = λx, etc
- Algorithms that attain these lower bounds
- Not in standard libraries like Sca/LAPACK (yet!)
- Large speed-ups possible
• “Iterative” Linear Algebra (Krylov Subspace Methods)
- Ditto
• Extensions, open problems… (time permitting)
Summer School Lecture 1
11
Understanding Bandwidth and Latency (1/2)
• Bandwidth and Latency are metrics for the cost of moving data
• For a freeway between Berkeley and Sacramento
- Bandwidth: how many cars/hour can get from Berkeley to Sac?
- #cars/hour = density (#cars/mile) velocity (miles/hour) #lanes
- Latency: how long does it take for first car to get from Berkeley to Sac?
- #hours = distance (#miles) velocity (miles/hour)
• For accessing a disk
- Bandwidth: how many bits/second can you read from disk?
- #bits/sec = density (#bits/cm) velocity (cm/sec) #read_heads
- Latency: how long do you wait for first bit?
- #seconds = distance_to_move_read_head (cm) velocity (cm/sec) +
distance_half_way_around (cm) velocity (cm/sec)
• For sending data from one processor to another – same idea
• For moving data from DRAM to on-chip cache - same idea …
Summer School Lecture 1
12
Understanding Bandwidth and Latency (2/2)
• Bandwidth = #bits(or words…)/second that can be
communicated
• Latency = #seconds for first bit(or word) to arrive
• Notation: use #words as unit
- 1/Bandwith ≡ , units of seconds/word
- Latency ≡ , units of seconds
• Basic Timing Model for communication
- to move one “message” of n bits (or words) costs = + n
• We will model running time of an algorithm as sum of 3 terms:
-
•
# flops * time_per_flop
# words moved / bandwidth
# messages * latency
Which term is largest? That’s the one to minimize
Summer School Lecture 1
13
Experimental Study of Memory (Membench)
• Microbenchmark for memory system performance
s
•
for array A of length L from 4KB to 8MB by 2x
for stride s from 4 Bytes (1 word) to L/2 by 2x
time
timethe
thefollowing
followingloop
loop
(repeat
(repeatmany
manytimes
timesand
andaverage)
average)
for
fori ifrom
from00totoLL by s
load
loadA[i]
A[i]from
frommemory
memory(4(4Bytes)
Bytes)
Summer School Lecture 1
1 experiment
14
Memory Hierarchy on a Sun Ultra-2i
Sun Ultra-2i, 333 MHz
Array length
See www.cs.berkeley.edu/~yelick/arvindk/t3d-isca95.ps for details
Summer School Lecture 1
15
Memory Hierarchy
• Most programs have a high degree of locality in their accesses
- spatial locality: accessing things nearby previous accesses
- temporal locality: reusing an item that was previously accessed
• Memory hierarchy tries to exploit locality
processor
control
Second
level
cache
(SRAM)
datapath
registers
on-chip
Main
memory
Secondary
storage
(Disk)
(DRAM)
Tertiary
storage
(Disk/Tape)
cache
Speed
1ns
10ns
100ns
10ms
10sec
Size
B
KB
MB
GB
TB
Summer School Lecture 1
16
Membench: What to Expect
average cost per access
memory
time
size > L1
cache hit
time
total size < L1
s = stride
• Consider the average cost per load
- Plot one line for each array length, time vs. stride
- Small stride is best: if cache line holds 4 words, at most ¼ miss
- If array is smaller than a given cache, all those accesses will hit (after
the first run, which is negligible for large enough runs)
- Picture assumes only one level of cache
- Values have gotten more difficult to measure on modern procs
Summer School Lecture 1
17
Memory Hierarchy on a Sun Ultra-2i
Sun Ultra-2i, 333 MHz
Array length
Mem: 396 ns
(132 cycles)
L2: 2 MB,
12 cycles (36 ns)
L1: 16 B line
L1:
16 KB
2 cycles (6ns)
L2: 64 byte line
8 K pages,
32 TLB entries
See www.cs.berkeley.edu/~yelick/arvindk/t3d-isca95.ps for details
Summer School Lecture 1
18
Processor-DRAM Gap (latency)
• Memory hierarchies are getting deeper
- Processors get faster more quickly than memory
CPU
“Moore’s Law”
Processor-Memory
Performance Gap:
(grows 50% / year)
DRAM
DRAM
7%/yr.
100
10
1
µProc
60%/yr.
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
Performance
1000
Time
Summer School Lecture 1
19
Why avoiding communication is important (2/2)
• Running time of an algorithm is sum of 3 terms:
-
# flops * time_per_flop
# words moved / bandwidth
# messages * latency
communication
• Time_per_flop << 1/ bandwidth << latency
• Gaps growing exponentially with time
Annual improvements
Time_per_flop
60%
Bandwidth
Latency
Network
26%
15%
DRAM
23%
7%
• Goal : reorganize linear algebra to avoid communication
•
•
•
Between all memory hierarchy levels
• L1
L2
DRAM
network, etc
Not just hiding communication (speedup 2x )
Arbitrary speedups possible
Summer School Lecture 1
20
Memory Hierarchy on a Pentium III
Katmai processor on Millennium, 550 MHz
Array size
L2: 512 KB
60 ns
L1: 64K
5 ns, 4-way?
L1: 32 byte line ?
Summer School Lecture 1
21
Memory Hierarchy on an IBM Power3 (Seaborg)
Power3, 375 MHz
Array size
Mem: 396 ns
(132 cycles)
L2: 8 MB
128 B line
9 cycles
L1: 32 KB
128B line
.5-2 cycles
Summer School Lecture 1
22
Implications so far
• Communication (moving data) is much more expensive than arithmetic, and
getting more so over time
• Communication occurs at many levels in the machine
- Between levels of memory hierarchy (registers L1 L2 DRAM disk…)
- Between levels of parallelism
- Between cores on a multicore chip
- Between chips in a multisocket board (eg CPU and GPU)
- Between boards in a rack
- Between racks in a cluster/supercomputer/”cloud”
- Between cities in a grid
- All of these are expensive compared to arithmetic
• What are we going to do about it?
- Not enough to hope cache policy will deal with it; we need better algorithms
- True not just for linear algebra
• Strategy: deal with two levels, try to apply recursion
Summer School Lecture 1
23
Outline of lectures
1. Introduction, technological trends
2. Case study: Matrix Multiplication
3. Communication Lower Bounds for Direct Linear Algebra
4. Optimizing One-sided Factorizations (LU and QR)
5. Multicore and GPU implementations
6. Communication-optimal Eigenvalue and SVD algorithms
7. Optimizing Sparse-Matrix-Vector-Multiplication (SpMV)
8. Communication-optimal Krylov Subspace Methods
9. Further topics, time permitting…
Lots of open problems (i.e. homework…)
Summer School Lecture 1
24
Collaborators
• Grey Ballard, UCB EECS
• Ioana Dumitriu, U. Washington
• Laura Grigori, INRIA
• Ming Gu, UCB Math
• Mark Hoemmen, UCB EECS
• Olga Holtz, UCB Math & TU Berlin
• Julien Langou, U. Colorado Denver
• Marghoob Mohiyuddin, UCB EECS
• Oded Schwartz , TU Berlin
• Hua Xiang, INRIA
• Kathy Yelick, UCB EECS & NERSC
• BeBOP group at Berkeley
Summer School Lecture 1
25
Where to find papers, slides, video, software
• bebop.cs.berkeley.edu
• www.cs.berkeley.edu/~demmel/cs267_Spr10
• parlab.eecs.berkeley.edu
- parlab.cs.berkeley.edu/2009bootcamp
- Sign up for 2010 bootcamp available
Summer School Lecture 1
26
EXTRA SLIDES
Summer School Lecture 1
27