Alverson_Gemini_2010..
Download
Report
Transcript Alverson_Gemini_2010..
Bob Alverson, Duncan Roweth,
Larry Kaplan
Cray Inc.
Cray Inc. Hot Interconnects
1
Overview
Network Interface
Router
Reliability, Availability, and Serviceability Features
Software Stack
Performance
Cray Inc. Hot Interconnects
2
Integrated NIC and Router
External HSS Monitoring
Supports 2 Nodes per ASIC
Advanced Resiliency Features
Hardware Global Address Support
Advanced NIC designed to efficiently
support
MPI
One-sided MPI
Shmem
UPC, Coarray FORTRAN
Cray Inc. Hot Interconnects
3
Y
Z
X
Z
Y
X
Cray Inc. Hot Interconnects
4
Fast Memory Access (FMA) – fine grain remote PUT/GET
Block Transfer Engine (BTE) – offload for long transfers
Completion Queue (CQ) – client notification
Atomic Memory Op (AMO) – fetch&add, etc.
net rsp
LB
net req
net
req
FMA
ht trsp
net
req
ht irsp
S
S
I
D
net
req
net
req
O
R
B
net
rsp
NPT
vc0
vc1
vc1
net
rsp
ht np
ireq
net req
ht p
ireq
BTE
net
req
H
A
R
B
ht np req
ht p req
ht p req
AMO
net req
ht p req
NAT
CQ
net rsp headers
net
req
RMT
vc0
net req
RAT
net rsp
CLM
Cray Inc. Hot Interconnects
LM
HT3 Cave
ht p req
ht np req
LB Ring
T
A
R
B
Router Tiles
ht treq np
NL
ht treq p
5
Single-sided
Processor stores become remote
PUT or GET
FMA descriptors hold state to help
determine destination node and
memory location
FMA PUT for short messages
Uncached processor store to
Gemini window translated
directly to network packet
FMA GET allows reverse direction
data transfer of 1 to 64 bytes
Cray Inc. Hot Interconnects
6
Driver managed
BTE PUT for long messages
DMA transfer to offload data movement from processor
BTE SEND for IP traffic, etc.
Send message to remote node
Single receive queue for all sources
Upper level protocol covers lost messages
BTE GET support for simplified data transfers
In lieu of involving remote side for PUT
Cray Inc. Hot Interconnects
7
Hardware remote atomic memory operations in the NIC
Add, Compare & Swap, Logical Operations
Executed at the node with the memory
AMO cache for hot locations
Up to 64 locations with AMOs in process
Global operations support
Barriers
Counters
Collectives (reductions, global sum)
Cray Inc. Hot Interconnects
8
6x8 tile matrix
Input queue to one of 6
subswitches
Route to one of 8 output
buffers
Hashed routing preserves
order to cachelines
Adaptive routing
Cray Inc. Hot Interconnects
9
Route around stalled or down links
If a link goes down, adaptive routing mask updated in hardware to
exclude it
OS traffic uses adaptive routing only, recovers from finite loss of packets
Quiesce and re-route to repair deterministic routes
Congestion feedback to allow
routing around bottlenecks
Potential for improved performance
on difficult traffic patterns such as
transpose
Packets reordered in receive buffer
(DRAM)
Separate notification (completion
event) when all stored
Cray Inc. Hot Interconnects
10
General Network Packet Format
24 bit flit
Maximum size packet is
7+24+1=32 flit Put
request of 64 bytes
Minimum is 2 flit Put
response
23
22
21
20
19
18
17
16
15
14
13
12
11
phit 0
vc
destination[15:0]
phit 1
payload
optional hash bits
10
9
8
7
6
5
4
2
1
0
h
a r=0 v
3
p
c
p
c
p
c
1
p
c
2
payload
payload
phit 2
…
CRC-16
last phit
R
R
ok
R
Network Request Packet Format
23
22
21
20
19
18
17
16
vc
phit 0
15
14
13
12
11
10
9
8
7
6
destination[15:0]
address[23:6]
phit 1
MDH[11:0]
phit 4
BTEvc
phit 6
1
0
c
p
c
p
c
DstID SrcID vm ra p
c
reserved
addr[45:40]
cmd[5:0]
addr
[39:38]
dt pt
mask[15:0]
phit 5
3
address[37:24]
source[15:0]
phit 3
4
a r=0v=0 p
F ca rmt b
ptag[7:0]
phit 2
5
h
SSID[7:0]
p
c
p
c
p
c
2
1
0
size
p
c
packetID[11:0]
Data Payload (up to 24 phits)
23
22
21
20
19
18
17
16
15
14
13
12
11
data[19:0]
phit n
10
9
8
7
6
5
4
3
(phit n +1)
data[41:20]
p
c
(phit n +2)
data[63:42]
p
c
Cray Inc. Hot Interconnects
11
Automatic link-level retries
HT3 support including automatic retries and improved CRC
Most internal data structures are at least parity protected
The longer the occupancy of data at a location, the stronger the
protection
Errors reported as precisely as possible
Payload errors reported directly to user
Control errors often cannot be associated with a particular
transaction
In all cases OS or HSS can be notified of the error
Router errors included
Reported at the point of error
Endpoint(s) (user) see a timeout
Cray Inc. Hot Interconnects
12
MPICH
MPICH2
SHMEM
PGAS
DMAPP
User level Gemini Network Interface
(uGNI)
MRT-size page support
Registration Cache
support
Kernel level GNI
(kGNI)
Direct Access
Cray COW solution
Lustre Network Driver
(LND)
Direct Access
GART Resource
Management
(GRM)
IOCTL or System Call
Linux Core
GNI Core
IP over
Gemini
Fabric
(IPoGIF)
Gemini Hardware Abstraction Layer
(GHAL)
Cray Inc. Hot Interconnects
13
Latency
Bandwidth
Atomic operations
Cray Inc. Hot Interconnects
14
Gemini expanded to HT3 at up to 5.2 GT/s
Expect to sustain greater than 6 GB/s user data injection
Network bandwidth is limited by XT packaging
Link speed from 3.125 to 6.25 Gbit/sec
In some cases, double wide X & Z links also offer
increased bandwidth
Gemini relies on user level threads
MPI processing limits to 2M messages/sec per thread
Scales beyond 10M msg/sec per NIC
Cray Inc. Hot Interconnects
15
2.5
One way PUT in
2.0
Time (microsecs)
750ns
Waiting for Ack in
only 1.1 us
Remote GET
increases to 1.4 us
PUT, ping-pong
PUT, at source
GET
1.5
1.0
0.5
0.0
8
16
32
64
128
256
512
1024
Size (bytes)
Cray Inc. Hot Interconnects
16
7000
Peak bandwidth
6000
Bandwidth (Mbytes/sec)
reached with small
transfers
Multiple threads
reach peak with
smaller, still,
transfers
5000
4000
PPN=1
3000
PPN=2
PPN=4
2000
1000
0
8
16
32
64 128 256 512 1K 2K 4K 8K 16K 32K 64K
Size (bytes)
Cray Inc. Hot Interconnects
17
120
100
100 Mupdates/sec
Random locations
(GUPS) still over 45
Mupdates/sec
AMO rate (millions)
Hot location reaches
1 AMO
8192 AMOs
80
60
40
20
0
0
256
512
768
1024
Number of processes
Cray Inc. Hot Interconnects
18
Gemini provides low latency, and performance for fine grain
operations
Gemini has features to scale in performance and reliability to
large system size
Questions?
Cray Inc. Hot Interconnects
19