No Slide Title
Download
Report
Transcript No Slide Title
Scaling the Cray MTA
Burton Smith
Cray Inc.
1
Overview
The MTA is a uniform shared memory multiprocessor
with
Every 64-bit memory word also has a full/empty bit
latency tolerance using fine-grain multithreading
no data caches or other hierarchy
no memory bandwidth bottleneck, absent hot-spots
Load and store can act as receive and send, respectively
The bit can also implement locks and atomic updates
Every processor has 16 protection domains
One is used by the operating system
The rest can be used to multiprogram the processor
We limit the number of “big” jobs per processor
2
Multithreading on one processor
1
2
i =n
.
.
.
3
i =2
i =1
Programs
running in
parallel
i =n
Subproblem
A
i =3
4
.
Su bproblem
B
.
.
i =1
Serial
Code
i =0
Concurrent
threads of
computation
Subproblem A
....
Hardware
stream s
(128)
Unused streams
streams
Instruction
Ready
Pool;
Pipeline of
executing
instructions
3
Multithreading on multiple
processors
1
2
i =n
.
.
.
3
i =2
.
Su bproblem
B
i =1
Programs
running in
parallel
i =n
Subproblem
A
i =3
4
.
.
i =1
Serial
Code
i =0
Concurrent
threads of
computation
Subproblem A
...
...
...
...
Multithreaded
across
multiple
processors
4
Typical MTA processor utilization
Percent utilization |
100
80
60
40
20
0
0
10
20
30
40
50
60
70
80
90
Number of st reams
5
Processor features
Multithreaded VLIW with three operations per 64-bit word
31 general-purpose 64-bit registers per stream
Paged program address space (4KB pages)
Segmented data address space (8KB–256MB segments)
The ops. are named M(emory), A(rithmetic), and C(ontrol)
Privilege and interleaving is specified in the descriptor
Data addressability to the byte level
Explicit-dependence lookahead
Multiple orthogonally generated condition codes
Explicit branch target registers
Speculative loads
Unprivileged traps
and no interrupts at all
6
Supported data types
8, 16, 32, and 64-bit signed and unsigned integer
64-bit IEEE and 128-bit “doubled precision” floating
point
conversion to and from 32-bit 1EEE is supported
Bit vectors and matrices of arbitrary shape
64-bit pointer with 16 tag bits and 48 address bits
64-bit stream status word (SSW)
64-bit exception register
7
Bit vector and matrix operations
The usual logical operations and shifts are available in
both A and C versions
t=tera_bit_tally(u)
t=tera_bit_odd_{and,nimp,or,xor}(u,v)
x=tera_bit_{left,right}_{ones,zeros}(y,z)
t=tera_shift_pair_{left,right}(t,u,v,w)
t=tera_bit_merge(u,v,w)
t=tera_bit_mat_{exor,or}(u,v)
t=tera_bit_mat_transpose(u)
t=tera_bit_{pack,unpack}(u,v)
8
The memory subsystem
Program addresses are hashed to logical addresses
We use an invertible matrix over GF(2)
The result is no stride sensitivity at all
logical addresses are then interleaved among
physical memory unit numbers and offsets
The mumber of memory units can be a power of 2
times any factor of 315=5*7*9
1, 2, or 4 GB of memory per processor
The memory units support 1 memory reference per
cycle per processor
plus instruction fetches to the local processor’s L2
cache
9
Memory word organization
64-bit words
Big-endian partial word order
addressing halfwords, quarterwords, and bytes
4 tag bits per word
with 8 more bite for SECDED
with four more SECDED bits
The memory implements a 64-bit fetch-and-add
operation
full/empty
forward
trap 1
trap 0
63
tag bits
0
data value
10
Synchronized memory operations
Each word of memory has an associated full/empty bit
Normal loads ignore this bit, and normal stores set it full
Sync memory operations are available via data declarations
Sync loads atomically wait for full, then load and set empty
Sync stores atomically wait for empty, then store and set full
Waiting is autonomous, consuming no processor issue
cycles
After a while, a trap occurs and the thread state is saved
Sync and normal memory operations usually take the same
time because of this “optimistic” approach
In any event, synchronization latency is tolerated
11
I/O Processor (IOP)
There are as many IOPs as there are
processors
An IOP program describes a sequence of unitstride block transfers to or from anywhere in
memory
Each IOP drives a 100MB/s (32-bit) HIPPI
channel
both directions can be driven simultaneously
memory-to-memory copies are also possible
We soon expect to be leveraging off-the-shelf
buses and microprocessors as outboard
devices
12
The memory network
The current MTA memory network is a 3–D toroidal
mesh with pure deflection (“hot potato”) routing
It must deliver one random memory reference per
processor per cycle
The most expensive part of the system is its wires
This is a general property of high bandwidth systems
Larger systems will need more sophisticated
topologies
When this condition is met, the topology is transparent
Surprisingly, network topology is not a dead subject
Unlike wires, transistors keep getting faster and
cheaper
We should use transistors aggressively to save wires
13
Our problem is bandwidth, not
latency
In any memory network, concurrency = latency x
bandwidth
Multithreading supplies ample memory network
concurrency
Bandwidth (not latency) limits practical MTA system size
and large MTA systems will have expensive memory
networks
In future, systems will be differentiated by their
bandwidths
even to the point of implementing uniform shared memory
System purchasers will buy the class of bandwidth they
need
System vendors will make sure their bandwidth scales
properly
The issue is the total cost of a given amount of bandwidth
How much bandwidth is enough?
14
Reducing the number and cost of
wires
Use on-wafer and on-board wires whenever possible
Use the highest possible bandwidth per wire
Use optics (or superconductors) for long-distance
interconnect to avoid skin effect
Use direct interconnection network topologies
Indirect networks waste wires
Use symmetric (bidirectional) links for fault tolerance
Leverage technologies from other markets
DWDM is not quite economical enough yet
Disabling an entire cycle preserves balance
Base networks on low-diameter graphs of low degree
bandwidth per node degree /average distance
15
Graph symmetries
Suppose G=(v, e) is a graph with vertex set v and
directed edge set evv
G is called bidirectional when (x,y)G implies (y,x)G
Bidirectional links are helpful for fault reconfiguration
An automorphism of G is a mapping : v v such
that (x,y) is in G if and only if ((x), (y)) is also
G is vertex-symmetric when for any pair of vertices
there is an automorphism mapping one vertex to the
other
G is edge-symmetric when for any pair of edges there
is an automorphism mapping one edge to the other
Edge and vertex symmetries help in balancing
network load
16
Specific bandwidth
Consider an n-node edge-symmetric bidirectional
network with (out-)degree and link bandwidth
Let message destinations be uniformly distributed
among the nodes
hashing memory addresses helps guarantee this
Let d be the average distance (in hops) between
nodes
Assume every node generates messages at
bandwidth b
so the total aggregate link bandwidth available is n
then nbd n and therefore b/d
The ratio d of degree to average distance limits the
ratio b/ of injection bandwidth to link bandwidth
We call d the specific bandwidth of the network
17
Graphs with average distance
degree
Degree
Diamet er = Degree
Diamet er = Degree+1
3
20
38
4
95
364
5
532
2,734
6
7,8 17
13,056
7
35,154
93,744
8
234,360
8 20,260
9
3,019,632
15,68 6,400
Source: Bermond, Delorme, and Quisquater, JPDC 3 (1986), p. 433
18
Cayley graphs
Groups are a good source of low-diameter graphs
The vertices of a Cayley graph are the group elements
The edges leaving a vertex are generators of the
group
Cayley graphs are always vertex-symmetric
Generator g goes from node x to node x ·g
Premultiplication by y ·x-1 is an automorphism taking x
to y
A Cayley graph is edge-symmetric if and only if every
pair of generators is related by a group automorphism
Example: the k-ary n-cube is a Cayley graph of (k)n
(k)n is the n-fold direct product of the integers modulo
k
The 2n generators are (1,0…0), (-1,0…0) ,…(0,0…-1)
This graph is clearly edge-symmetric
19
Another example: the Star graph
The Star graph is an edge-symmetric Cayley graph of
the group Sn of permutations on n symbols
The generators are the exchanges of the rightmost
symbol with every other symbol position
It therefore has n! vertices and degree n-1
For moderate n, the specific bandwidth is close to 1
Degree
Hypercube 8 ary n-cube St ar graph
3
8
16
24
4
16
64
120
5
32
128
720
6
64
512
5040
7
128
1024
40,320
8
256
4096
362,8 8 0
20
The Star graph of size 4! = 24
3210
3201
3012
3102
3021
3120
2310
2301
2013
2103
2031
2130
1302
1203
1320
1230
1023
1032
0312
0213
0321
0231
0123
0132
21
Conclusions
The Cray MTA is a new kind of high performance
system
It will scale to 64 processors in 2001 and 256 in 2002
scalar multithreaded processors
uniform shared memory
fine-grain synchronization
simple programming
future versions will have thousands of processors
It extends the capabilities of supercomputers
scalar parallelism, e.g. data base
fine-grain synchronization, e.g. sparse linear systems
28