a recent talk

Download Report

Transcript a recent talk

Cartesian Computations
Larry Carter
with thanks to
Bowen Alpern, Jeanne Ferrante,
Kang Su Gatlin, Nick Mitchell, Michelle Strout
April, 2011
Cartesian Computations
Outline
• Architectural trends
– Computation is (almost) free
– Data movement and memory aren’t
• Cartesian computations
– Examples
– Static, tidy and lazy implementations
• Analysis of Cartesian computations
– Computation density and temp data ratio
– Some theorems
• Application to examples
2
Cartesian Computations
Technology trends
Everything’s getting smaller
chip
m
feature width
light
mm

µm
atom
nm
Å
Energy consumption limits computation speed
Need to keep chip from melting
Electricity costs are non-trivial!
Energy is dominated by cost to move data
In 32 nm technology (available today):
0.1 pJ (picoJoule) to add two 32-bit numbers
2 pJ to move a 32-bit number 1 mm on chip
320 pJ to move a 32-bit number onto or off a chip
these numbers may be reduced by 10x with heroic methods, e.g. very low power&speed
3
Cartesian Computations
Computation is (almost) free
On a 20 x 20 mm chip, you can fit:
1,000,000 32-bit adders,
… or …
256 MB fast memory (SRAM)
Memory is what takes up area
For 100 Watts of power, you can
Perform 1015 additions per second (one petaOp), … or …
Read data from DRAM at 128 GB/sec
Memory and moving data is what consumes power
4
Cartesian Computations
(Super)computer architecture
Core = CPU + caches
Proc
L1
L2
Chip = bunch of cores
High bandwidth interconnect
P
P
L1
L1
L2
L2
P
P
L1
L1
L2
L2
Node = bunch of chips + DRAM
Tb/sec bandwidth between chips
Less bandwidth to memory
DRAM
System = bunch of nodes
Switch or grid connection
OK bandwidth for block transfer
5
Cartesian Computations
Models of computation
RAM
PRAM
Proc
Memory
Proc
Proc
Proc
Proc
Memory
6
Cartesian Computations
The point being …
(P)RAM analysis counts arithmetic operations, ignoring
cost of memory accesses and data movement
Real costs, both time and energy, are dominated by
memory accesses and data movement
These costs vary by over 100x depending on where data goes
not even considering disk
Algorithm analysis should focus on data movement!
The rest of this talk will study data movement for a
common class of scientific computation kernels
7
Cartesian Computations
Outline
• Architectural trends
– Computation is (almost) free
– Data movement and memory aren’t
• Cartesian computations
– Examples
– Static, tidy and lazy implementations
• Analysis of Cartesian computations
– Computation density and temp data ratio
– Some theorems
• Application to examples
8
Cartesian Computations
An incomplete taxonomy of kernels
Compute intensive
E.g. well-tuned matrix multiply, seismic migration, …
Runtime is dominated by core speed
Streaming applications
E.g. simple encoding, scanning, one pass of quicksort, …
Runtime dominated by sequential access to memory
Dynamic data structures
E.g. binary search trees, quadtrees, graph algorithms
Runtime dominated by random access to memory
and …
9
Cartesian Computations
Cartesian computation
Given large data structures A and B
A generates temporary data T, which updates B
For each chunk of A and B, there ’s a “tile” of computation
A
B
There may be data dependences, but we’ll assume that reordering is
possible.
In a uniform Cartesian computation, for all tiles of a given size, the
amount of computation is roughly the same.
10
Cartesian Computations
A messy detail
For very small tiles, the uniformity assumption fails
How small is “very small” depends on the computation
Most very small tiles will have no temp data or work
If we only execute non-empty very small tiles, there is a savings in
communication costs.
But using very small tiles requires much more communication than
using large tiles
• In this talk, we assume that such a strategy is too expensive!
• We also leave the open the problem of non-uniform distributions.
11
Cartesian Computations
Sparse matrix-vector product y = Mx
A = input vector (x), B = output vector (y)
x
T might be pairs of the form {i, Mij xj}
y
Matrix M could be:
* *
* *
*
known implicitly (e.g. regular grid), or
streamed in from local memory
stored in order of use
M
A tile is a rectangular subset of M
The memory characteristics of y = Mx
depend on the density & distribution of nonzeros
depend on the implementation
12
Cartesian Computations
A different (?) example
Histogram (from NAS Integer Sort Benchmark)
For i=1 to numkeys
Tally[Key[i]]++
Key is A; Tally is B;
Temp data is indices into tile’s
subset of tally
Key
3
1
4
1
5
9
2
6
5
3
5
8
9
7
5
6
5
0 1 2 3 4 5 6 7 8 9
Tally
Transpose, Bit Reversal (used in FFTs), and other Permutations are
similar
13
Cartesian Computations
Generalization: Map-Reduce
Input A is a set of (key,value) pairs
Distributed by “key” in big computer
“Map” takes each (k,v) to a list {(k1,v1), (k2,v2), … }
Resulting data is sorted according to new keys
Sj = {v | (kj, v) comes from Map}
This is intermediate data of Cartesian computation
“Reduce” is applied to each Sj, producing a new pair (kj, Reduce(Sj))
(or more generally, any number of new pairs)
This is output set B
14
Cartesian Computations
In general …
In a uniform Cartesian computation,
–
chunks of A are used to generate temporary data
may need extra “one-time” data too
–
chunks of B consume the temporary data
N.B. the intermediate values of B are not called temp data
–
every chunk of A generates data for every chunk of B
Program for executing Cartesian computations:
for each “architectural unit” (cache, core, chip, node, …), choose
15
–
tile size and shape
–
whether to move temp data or to move chunks of A and/or B
–
how to use multiple processors and share memory
Cartesian Computations
Implementing Cartesian computations
Tidy Implementation
– Temp data is consumed as soon as it is generated
• relevant chunks of both A and B must be in local memory
– A and B chunks are copied or moved so each is together at some time
Temp data doesn’t move; but each bit of A and B may move many times
• particularly if the tiles are small
Lazy Implementation
– Bring bits of A into a processor; generate temp data
– Move temp either:
• To another processor, or
• To some level of cache memory
– Consume temp data where or when appropriate chunk of B is available
Temp data moves, but A and B may move less than for tidy program
16
Cartesian Computations
Two types of lazy implementation
1.
Static partitioning:
–
Partition B among memory of all “architectural units”
B must fit in totality of these memories
–
Stream A in a few bits at a time, generate temp data and send it to
unit owning appropriate chunk of B
A and B only moved once 
Temp data may need to travel long distance 
Aside: we could swap the roles of A and B.
2. Bucket tiling:
•
Stream A in, generate & store temp data in local memory
Store A in bucket depending on which chunk of B will consume it
•
Later, bring B in a chunk at a time, consume temp data
•
Repeat the above until done
A and/or B may move in & out many times
17
.
Cartesian Computations
Bucket Tiling Example: histogram
For i=1 to numkeys
Tally[Key[i]]++
becomes …
For ii=0 to numKeys by chunk_size
// Break set of Keys into chunks
Bend[*] = 0
For i=ii to ii + chunk_size
// Put keys into buckets
k = Key[i]>>16
// (indexed by high bits)
Temp[k][Bend[k]++] = Key[i] && mask
For k = 0 to numBuckets
// Process each bucket
For i = 0 to Bend[k]
Tally[(k<<16)+Temp[k][i]]++
Choose chunk_size so that Temp fits in L2 cache;
Choose “16” so random accesses to Tally are in L1 cache
Code ran 3x faster (very old experiment)
18
Cartesian Computations
Hybrid implementations
Many variants are possible:
Different strategies at different levels of granularity
E.g. tidy implementation within node; lazy within chip
If a chunk of A (or B) is needed in several places, it can either move
from place or there can be duplicate copies
If B has copies, they need to be combined at end.
Different strategies for different regions
E.g., if (non-uniform) problem has dense regions, use static
partitioning but duplicate denser parts of A and/or B to reduce
communication
19
Cartesian Computations
Outline
• Architectural trends
– Computation is (almost) free
– Data movement and memory aren’t
• Cartesian computations
– Examples
– Static, tidy and lazy implementations
• Analysis of Cartesian computations
– Computation density and temp data ratio
– Some theorems
• Application to examples
20
Cartesian Computations
Data movement in Cartesian computations
Focus on moving of A, B, and temp data …
ignore other data
… into an architectural component
DRAM
cache, core, chip, node
Same analysis for sequential or parallel execution.
Useful metrics:
Bits moved into component
gives approximate energy or power requirement (power = energy/time)
ignores difference between moving from DRAM vs. another chip
ignores difference between near and distant network moves
Computation/communication ratio
= ops/sec ÷ bits/sec = speed ÷ bandwidth
21
Cartesian Computations
Quantifying data movement
Measure chunks of A and B in bits
Tiles area is measured in “square bits”
In a uniform problem, tile area is proportional to work
Define compute density  = work/bits2 (e.g. floatOps/bit2)
Measure temp data in bits
Suppose an a x b tile generates t bits temp data.
Define (temp) data density d = t/ab (d is “bits/bit2”)
In a uniform problem, d is independent of tile choice
Suppose all temp data is moved off-chip (or off-core or off-node):
(almost happen in static partitioning – occasionally B is local)
bits moved = abd
compute speed / bandwidth ≤ /d
22
Cartesian Computations
y = Mx example
Suppose x and y are 64-bit floats.
x
Suppose 1/400 of matrix entries are non-zero
y
similar to NAS cg class B benchmark
 = 2 floatOps/(400 x 64 bits x 64 bits)
* *
* *
*
≈ 1.2 x 10-6 floatOp/bit2
Suppose temp data is {64 bit float, 32-bit index}
d = 96 bits/(400 x 64 x 64 bit2)
M
≈ 58 x 10-6 bit-1
Note: /d = 1 floatOp/48 bits
≈ compute/communication for static partitioning
Typical supercomputer has only 4 bit DRAM bandwidth per floatOp
23
Cartesian Computations
Lower bound on tidy implementation
Recall: the intermediate data is consumed immediately in the same
component as it is produced.
And, every square bit of A x B co-resides in memory at some time
Theorem: If a machine component with memory capacity c executes
an a x b tile tidily, a>c and b>c, then ab/c bits must move into the
component during the computation, even if the memory contained
data at the start.
Note: This is better than abd (static partitioning) if 1/c < d
Corollary: computation speed ≤  x memory capacity x bandwidth
24
Cartesian Computations
Proof (if memory is initially empty):
Assign a penny to each square bit (i,j) in tile
total of ab pennies
The first time i and j are in the same core, give penny to
whichever load operation brought i or j in later
no load can get more than c pennies
Thus, there must be at least ab/c load operations.
Proof (general case):
Messy.
25
Cartesian Computations
Upper bound on tidy implementations
Tiling an a x b Cartesian problem:
For all chunks Bi of B of size c(1-)
Move Bi into local memory
Stream through all nibbles of A of size 
Execute tile nibble x Bi
B moves into component once  b bits of B are moved
At most b/c(1-) + 1 chunks of B  ab/c(1-) + a bits of A moved
Temp data is always local.
Total: ab/c(1-) + a + b bits
For large a and b and small , this is about ab/c
matching the lower bound
26
Cartesian Computations
Hierarchical tiling example
B
Partition A among DRAM of nodes
Each cache-sized chunk of B
makes a tour of all nodes
When at a node,
A
DRAM
size
chunk resides in cache of one core
local part of A streams into core
Cache
size
relevant tile is executed
Data movement of B between nodes is ab / DRAM size
corresponds to horizontal lines in picture
Movement of A into cores is ab / cache size
vertical lines in picture
27
Cartesian Computations
Lower bound for bucket tiling
Recall: Bucket tiling has only two operations:
move bits of A into component, generate temp data
move bits of B into component, consume temp data
Theorem: A bucket tiled uniform Cartesian computation, where temp
data is stored in a component with memory capacity c, moves at
least ab(2d/c)½ bits of A and B into the component.
Proof: See Carter & Gatlin, FOCS ’98
But (d/c)½ is the geometric mean of d and 1/c, so
this is worse than abd (static partitioning) or ab/c (tidy) or both.
static partitioning
tiling
Possible advantage over static partitioning:
If B doesn’t fit in sum of local memories, can’t static partition
Bucket tiling communication is to nearest neighbor
28
Cartesian Computations
Upper bound on bucket tiling
Partition space into m x n tiles
Entire a x b problem needs ab/mn tiles
Tile generates mnd bits of temp data; must be ≤ c
For each tile:
Move m bits of A into node and create temp data from it
Consume temp data by moving and updating n bits of B
m+n bits of A and B moved per tile (temp data stays in cache)
Data movement is minimized if m = n = (c/d)½
so total communication = ab/(c/d) x 2(c/d) ½ = 2ab (d/c)½
6x6 tile needs
12 bits moved
(3 bits/point)
29
4x9 tile needs
13 bits moved
Cartesian Computations
(Theoretically) better implementation
Form triangular tile with side m
m(m-1)d/2 temp data; must be ≤ c; so m  (2c/d)½
Slide tile along by deleting one column, placing one row
communication = ab (2d/c)½
sqrt(2) times better (4 vs 3 points/bit moved in example)
matches lower bound
delete
30
generate
Cartesian Computations
Which implementation is best?
Each core has cache of size c1
Node has memory of size c2 >> c1
a = |A| > c2, b = |B| > c2
d = temp data ratio
blahblahblah
Method
31
DRAM
Bits of cache to Bits to nearest
shared memory neighbor node
N
e
t
w
o
r
k
Bits to random
node
Static
partitioning
(relatively
small)
none
~abd
Bucket tiling
abd
2ab(d/c2)½
none
Tidy
ab/c1
ab/c2
none
Cartesian Computations
Sparse Matrix-Vector example
Each core has cache of size c1
Node has memory of size c2 >> c1
a = |A| > c2, b = |B| > c2
d = 58 x 10-6  (2KB)-1
blahblahblah
Method
Bits of cache to
shared memory
Bits to nearest
neighbor node
Bits to random
node
Static
partitioning
(relatively
small)
none
abd = ab/2KB
Worst!
Bucket tiling
abd
2ab(d/c2)½ =
none
ab×.015/c2½
Tidy
32
ab/c1
ab/c2
c1 > 2KB
c2 >> 2 KB
none
Cartesian Computations
Histogram example
Key
Suppose A is 64-bit integers and B is 16-bit counters
5
There’s one increment operation per row
An a x b tile (a and b are bits) represents a/64 rows
In a uniform problem, tile contains b/|B| of indices
so tile has (a/64)x(b/|B|) = ab/64|B| “hits”
5
6
Tally
|B| is size of B in bits
 = work/bit2 = (ab/64|B|)/ab = 1 / 64|B| adds/bit2
Suppose we need 32-bit indices for temp data
(fewer bits are needed since index is limited in tile)
d = t/ab = (ab/64|B|) x 32/ab = 1 / (2|B|)
Note: /d = 1 add/32 bits (not at all surprising)
≈ compute/communication for static partitioning
33
Cartesian Computations
Histogram example
Each core has cache of size c1
Node has memory of size c2 >> c1
a = |A| > c2, b = |B| > c2
d = 1/(2b)
blahblahblah
Method
Bits of cache to Bits to nearest
shared memory neighbor node
Bits to random
node
Static
partitioning
(relatively
small)
none
abd = a/2
BEST!
Bucket tiling
abd = a/2
2ab(d/c2)½
none
= a (2b/c2)½ > a
Tidy
34
ab/c1
ab/c2
b/c1 >> 1
b/c2 > 1
none
Cartesian Computations
Conclusions, etc
Computation is free
A theory for data movement (analogous to a theory of computation)
is needed for today’s architectures
For Cartesian computations, we can relate the three orthogonal
aspects of computers:
computation speed
bandwidth
memory
and choose algorithms accordingly
Results are needed for other classes of computations !
And architectural features (e.g. sequential access)!
35
Cartesian Computations
Backup slides
36
Cartesian Computations
Implementing Cartesian computations
(put some words here)
P1
P1
P2
2->1
P2
P3
1->2
1->3
1->4
2->3
etc.
P3
P4
37
B
P4
Static partitioning
A
P1
P2
P3
P4
Tidy (tiling)
Cartesian Computations
Algorithmic vs. memory analysis
• MergeSort or QuickSort
– O(N lg N) operations
– Sequential memory access
• BucketSort or CountSort
– O(N) operations
– Random memory access
38
Cartesian Computations
Algorithmic vs. memory analysis
• Sandia study of Linear Algebra algorithms
Millions of floating point ops needed for sample problem
– 1950: 1,000,000,000,000 Mops (Cramer’s rule)
– 1965: 10,000,000 Mops (Gaussian Elimination)
– 1975: 300 Mops (Gauss-Seidel)
– 1985: 8 Mop (Conjugate Gradient)
• Each algorithmic improvement results in less locality in
memory references.
39
Cartesian Computations
Integer Sort Example
• Loop from NAS IS benchmark
For i=1 to numkeys
Tally[Key[i]]++ Random access to Tally (BAD)
• Bucket-tiling*
For i=1 to numkeys
// Count keys per bucket
Bcount[Key[i]>>12]++
Bcount in cache (GOOD); Key is sequential (OK)
For k=1 to maxKey>>12 // Find bucket end points
Bend[k] = Bend[k-1]+Bcount[k]
Bend fits in cache (GOOD)
For i=1 to numkeys
// Put keys into buckets
Temp[Bend[Key[i]>>12]++] = Key[i]
Bend in cache; Key and Temp are sequential (OK)
For i=1 to numkeys
// Count keys
Tally[Temp[i]]++
Active parts of Tally stay in cache (GOOD); Temp OK
Bottom Line: Code is over 2 times faster (on IBM RS6000).
*Alpern & Carter ‘94, “Towards a Model for Portable Parallel Performance”
40
Cartesian Computations
Inspector – Executor paradigm
• Introduced by Chaos Group at U. Md. (led by Joel Saltz) e.g. 1995
TPDS paper
• “Inspector” is runtime code that:
– Iterates through loops without executing, records information
about data access order
– Reorders data in arrays or chooses new execution order
– Generates new index arrays
• “Executor” is (modified) code
– Does the actual computation
– Uses reordered data and index arrays
41
Cartesian Computations