No Slide Title

Transcript No Slide Title

Scaling the Cray MTA
Burton Smith
Cray Inc.
1
Overview

The MTA is a uniform shared memory multiprocessor
with




Every 64-bit memory word also has a full/empty bit



latency tolerance using fine-grain multithreading
no data caches or other hierarchy
no memory bandwidth bottleneck, absent hot-spots
Load and store can act as receive and send, respectively
The bit can also implement locks and atomic updates
Every processor has 16 protection domains



One is used by the operating system
The rest can be used to multiprogram the processor
We limit the number of “big” jobs per processor
2
Multithreading on one processor
1
2
i =n
.
.
.
3
i =2
i =1
Programs
running in
parallel
i =n
Subproblem
A
i =3
4
.
Su bproblem
B
.
.
i =1
Serial
Code
i =0
Concurrent
threads of
computation
Subproblem A
....
Hardware
stream s
(128)
Unused streams
streams
Instruction
Ready
Pool;
Pipeline of
executing
instructions
3
Multithreading on multiple
processors
1
2
i =n
.
.
.
3
i =2
.
Su bproblem
B
i =1
Programs
running in
parallel
i =n
Subproblem
A
i =3
4
.
.
i =1
Serial
Code
i =0
Concurrent
threads of
computation
Subproblem A
...
...
...
...
Multithreaded
across
multiple
processors
4
Typical MTA processor utilization
Percent utilization |
100
80
60
40
20
0
0
10
20
30
40
50
60
70
80
90
Number of st reams
5
Processor features

Multithreaded VLIW with three operations per 64-bit word




31 general-purpose 64-bit registers per stream
Paged program address space (4KB pages)
Segmented data address space (8KB–256MB segments)







The ops. are named M(emory), A(rithmetic), and C(ontrol)
Privilege and interleaving is specified in the descriptor
Data addressability to the byte level
Explicit-dependence lookahead
Multiple orthogonally generated condition codes
Explicit branch target registers
Speculative loads
Unprivileged traps

and no interrupts at all
6
Supported data types


8, 16, 32, and 64-bit signed and unsigned integer
64-bit IEEE and 128-bit “doubled precision” floating
point





conversion to and from 32-bit 1EEE is supported
Bit vectors and matrices of arbitrary shape
64-bit pointer with 16 tag bits and 48 address bits
64-bit stream status word (SSW)
64-bit exception register
7
Bit vector and matrix operations









The usual logical operations and shifts are available in
both A and C versions
t=tera_bit_tally(u)
t=tera_bit_odd_{and,nimp,or,xor}(u,v)
x=tera_bit_{left,right}_{ones,zeros}(y,z)
t=tera_shift_pair_{left,right}(t,u,v,w)
t=tera_bit_merge(u,v,w)
t=tera_bit_mat_{exor,or}(u,v)
t=tera_bit_mat_transpose(u)
t=tera_bit_{pack,unpack}(u,v)
8
The memory subsystem

Program addresses are hashed to logical addresses






We use an invertible matrix over GF(2)
The result is no stride sensitivity at all
logical addresses are then interleaved among
physical memory unit numbers and offsets
The mumber of memory units can be a power of 2
times any factor of 315=5*7*9
1, 2, or 4 GB of memory per processor
The memory units support 1 memory reference per
cycle per processor

plus instruction fetches to the local processor’s L2
cache
9
Memory word organization

64-bit words


Big-endian partial word order


addressing halfwords, quarterwords, and bytes
4 tag bits per word


with 8 more bite for SECDED
with four more SECDED bits
The memory implements a 64-bit fetch-and-add
operation
full/empty
forward
trap 1
trap 0
63
tag bits
0
data value
10
Synchronized memory operations






Each word of memory has an associated full/empty bit
Normal loads ignore this bit, and normal stores set it full
Sync memory operations are available via data declarations
Sync loads atomically wait for full, then load and set empty
Sync stores atomically wait for empty, then store and set full
Waiting is autonomous, consuming no processor issue
cycles



After a while, a trap occurs and the thread state is saved
Sync and normal memory operations usually take the same
time because of this “optimistic” approach
In any event, synchronization latency is tolerated
11
I/O Processor (IOP)



There are as many IOPs as there are
processors
An IOP program describes a sequence of unitstride block transfers to or from anywhere in
memory
Each IOP drives a 100MB/s (32-bit) HIPPI
channel



both directions can be driven simultaneously
memory-to-memory copies are also possible
We soon expect to be leveraging off-the-shelf
buses and microprocessors as outboard
devices
12
The memory network


The current MTA memory network is a 3–D toroidal
mesh with pure deflection (“hot potato”) routing
It must deliver one random memory reference per
processor per cycle


The most expensive part of the system is its wires


This is a general property of high bandwidth systems
Larger systems will need more sophisticated
topologies


When this condition is met, the topology is transparent
Surprisingly, network topology is not a dead subject
Unlike wires, transistors keep getting faster and
cheaper

We should use transistors aggressively to save wires
13
Our problem is bandwidth, not
latency


In any memory network, concurrency = latency x
bandwidth
Multithreading supplies ample memory network
concurrency


Bandwidth (not latency) limits practical MTA system size


and large MTA systems will have expensive memory
networks
In future, systems will be differentiated by their
bandwidths



even to the point of implementing uniform shared memory
System purchasers will buy the class of bandwidth they
need
System vendors will make sure their bandwidth scales
properly
The issue is the total cost of a given amount of bandwidth

How much bandwidth is enough?
14
Reducing the number and cost of
wires



Use on-wafer and on-board wires whenever possible
Use the highest possible bandwidth per wire
Use optics (or superconductors) for long-distance
interconnect to avoid skin effect



Use direct interconnection network topologies


Indirect networks waste wires
Use symmetric (bidirectional) links for fault tolerance


Leverage technologies from other markets
DWDM is not quite economical enough yet
Disabling an entire cycle preserves balance
Base networks on low-diameter graphs of low degree

bandwidth per node  degree /average distance
15
Graph symmetries


Suppose G=(v, e) is a graph with vertex set v and
directed edge set evv
G is called bidirectional when (x,y)G implies (y,x)G





Bidirectional links are helpful for fault reconfiguration
An automorphism of G is a mapping  : v  v such
that (x,y) is in G if and only if ((x), (y)) is also
G is vertex-symmetric when for any pair of vertices
there is an automorphism mapping one vertex to the
other
G is edge-symmetric when for any pair of edges there
is an automorphism mapping one edge to the other
Edge and vertex symmetries help in balancing
network load
16
Specific bandwidth

Consider an n-node edge-symmetric bidirectional
network with (out-)degree  and link bandwidth 


Let message destinations be uniformly distributed
among the nodes




hashing memory addresses helps guarantee this
Let d be the average distance (in hops) between
nodes
Assume every node generates messages at
bandwidth b


so the total aggregate link bandwidth available is n
then nbd  n and therefore b/d
The ratio d of degree to average distance limits the
ratio b/ of injection bandwidth to link bandwidth
We call d the specific bandwidth of the network
17
Graphs with average distance 
degree
Degree
Diamet er = Degree
Diamet er = Degree+1
3
20
38
4
95
364
5
532
2,734
6
7,8 17
13,056
7
35,154
93,744
8
234,360
8 20,260
9
3,019,632
15,68 6,400
Source: Bermond, Delorme, and Quisquater, JPDC 3 (1986), p. 433
18
Cayley graphs



Groups are a good source of low-diameter graphs
The vertices of a Cayley graph are the group elements
The  edges leaving a vertex are generators of the
group


Cayley graphs are always vertex-symmetric



Generator g goes from node x to node x ·g
Premultiplication by y ·x-1 is an automorphism taking x
to y
A Cayley graph is edge-symmetric if and only if every
pair of generators is related by a group automorphism
Example: the k-ary n-cube is a Cayley graph of (k)n



(k)n is the n-fold direct product of the integers modulo
k
The 2n generators are (1,0…0), (-1,0…0) ,…(0,0…-1)
This graph is clearly edge-symmetric
19
Another example: the Star graph




The Star graph is an edge-symmetric Cayley graph of
the group Sn of permutations on n symbols
The generators are the exchanges of the rightmost
symbol with every other symbol position
It therefore has n! vertices and degree n-1
For moderate n, the specific bandwidth is close to 1
Degree
Hypercube 8 ary n-cube St ar graph
3
8
16
24
4
16
64
120
5
32
128
720
6
64
512
5040
7
128
1024
40,320
8
256
4096
362,8 8 0
20
The Star graph of size 4! = 24
3210
3201
3012
3102
3021
3120
2310
2301
2013
2103
2031
2130
1302
1203
1320
1230
1023
1032
0312
0213
0321
0231
0123
0132
21
Conclusions

The Cray MTA is a new kind of high performance
system





It will scale to 64 processors in 2001 and 256 in 2002


scalar multithreaded processors
uniform shared memory
fine-grain synchronization
simple programming
future versions will have thousands of processors
It extends the capabilities of supercomputers


scalar parallelism, e.g. data base
fine-grain synchronization, e.g. sparse linear systems
28

No Slide Title

Transcript No Slide Title

Directory