Alverson_Gemini_2010..

Download Report

Transcript Alverson_Gemini_2010..

 Bob Alverson, Duncan Roweth,
Larry Kaplan
 Cray Inc.
Cray Inc. Hot Interconnects
1
 Overview
 Network Interface
 Router
 Reliability, Availability, and Serviceability Features
 Software Stack
 Performance
Cray Inc. Hot Interconnects
2
 Integrated NIC and Router
 External HSS Monitoring
 Supports 2 Nodes per ASIC
 Advanced Resiliency Features
 Hardware Global Address Support
 Advanced NIC designed to efficiently
support
 MPI
 One-sided MPI
 Shmem
 UPC, Coarray FORTRAN
Cray Inc. Hot Interconnects
3
Y
Z
X
Z
Y
X
Cray Inc. Hot Interconnects
4
 Fast Memory Access (FMA) – fine grain remote PUT/GET
 Block Transfer Engine (BTE) – offload for long transfers
 Completion Queue (CQ) – client notification
 Atomic Memory Op (AMO) – fetch&add, etc.
net rsp
LB
net req
net
req
FMA
ht trsp
net
req
ht irsp
S
S
I
D
net
req
net
req
O
R
B
net
rsp
NPT
vc0
vc1
vc1
net
rsp
ht np
ireq
net req
ht p
ireq
BTE
net
req
H
A
R
B
ht np req
ht p req
ht p req
AMO
net req
ht p req
NAT
CQ
net rsp headers
net
req
RMT
vc0
net req
RAT
net rsp
CLM
Cray Inc. Hot Interconnects
LM
HT3 Cave
ht p req
ht np req
LB Ring
T
A
R
B
Router Tiles
ht treq np
NL
ht treq p
5
 Single-sided
 Processor stores become remote
PUT or GET
 FMA descriptors hold state to help
determine destination node and
memory location
 FMA PUT for short messages
 Uncached processor store to
Gemini window translated
directly to network packet
 FMA GET allows reverse direction
data transfer of 1 to 64 bytes
Cray Inc. Hot Interconnects
6
 Driver managed
 BTE PUT for long messages
 DMA transfer to offload data movement from processor
 BTE SEND for IP traffic, etc.
 Send message to remote node
 Single receive queue for all sources
 Upper level protocol covers lost messages
 BTE GET support for simplified data transfers
 In lieu of involving remote side for PUT
Cray Inc. Hot Interconnects
7
 Hardware remote atomic memory operations in the NIC
 Add, Compare & Swap, Logical Operations
 Executed at the node with the memory
 AMO cache for hot locations
 Up to 64 locations with AMOs in process
 Global operations support
 Barriers
 Counters
 Collectives (reductions, global sum)
Cray Inc. Hot Interconnects
8
 6x8 tile matrix
 Input queue to one of 6
subswitches
 Route to one of 8 output
buffers
 Hashed routing preserves
order to cachelines
 Adaptive routing
Cray Inc. Hot Interconnects
9
 Route around stalled or down links
 If a link goes down, adaptive routing mask updated in hardware to
exclude it
 OS traffic uses adaptive routing only, recovers from finite loss of packets
 Quiesce and re-route to repair deterministic routes
 Congestion feedback to allow
routing around bottlenecks
 Potential for improved performance
on difficult traffic patterns such as
transpose
 Packets reordered in receive buffer
(DRAM)
 Separate notification (completion
event) when all stored
Cray Inc. Hot Interconnects
10
General Network Packet Format
 24 bit flit
 Maximum size packet is
7+24+1=32 flit Put
request of 64 bytes
 Minimum is 2 flit Put
response
23
22
21
20
19
18
17
16
15
14
13
12
11
phit 0
vc
destination[15:0]
phit 1
payload
optional hash bits
10
9
8
7
6
5
4
2
1
0
h
a r=0 v
3
p
c
p
c
p
c
1
p
c
2
payload
payload
phit 2
…
CRC-16
last phit
R
R
ok
R
Network Request Packet Format
23
22
21
20
19
18
17
16
vc
phit 0
15
14
13
12
11
10
9
8
7
6
destination[15:0]
address[23:6]
phit 1
MDH[11:0]
phit 4
BTEvc
phit 6
1
0
c
p
c
p
c
DstID SrcID vm ra p
c
reserved
addr[45:40]
cmd[5:0]
addr
[39:38]
dt pt
mask[15:0]
phit 5
3
address[37:24]
source[15:0]
phit 3
4
a r=0v=0 p
F ca rmt b
ptag[7:0]
phit 2
5
h
SSID[7:0]
p
c
p
c
p
c
2
1
0
size
p
c
packetID[11:0]
Data Payload (up to 24 phits)
23
22
21
20
19
18
17
16
15
14
13
12
11
data[19:0]
phit n
10
9
8
7
6
5
4
3
(phit n +1)
data[41:20]
p
c
(phit n +2)
data[63:42]
p
c
Cray Inc. Hot Interconnects
11
 Automatic link-level retries
 HT3 support including automatic retries and improved CRC
 Most internal data structures are at least parity protected
 The longer the occupancy of data at a location, the stronger the
protection
 Errors reported as precisely as possible
 Payload errors reported directly to user
 Control errors often cannot be associated with a particular
transaction
 In all cases OS or HSS can be notified of the error
 Router errors included
 Reported at the point of error
 Endpoint(s) (user) see a timeout
Cray Inc. Hot Interconnects
12
MPICH
MPICH2
SHMEM
PGAS
DMAPP
User level Gemini Network Interface
(uGNI)
MRT-size page support
Registration Cache
support
Kernel level GNI
(kGNI)
Direct Access
Cray COW solution
Lustre Network Driver
(LND)
Direct Access
GART Resource
Management
(GRM)
IOCTL or System Call
Linux Core
GNI Core
IP over
Gemini
Fabric
(IPoGIF)
Gemini Hardware Abstraction Layer
(GHAL)
Cray Inc. Hot Interconnects
13
 Latency
 Bandwidth
 Atomic operations
Cray Inc. Hot Interconnects
14
 Gemini expanded to HT3 at up to 5.2 GT/s
 Expect to sustain greater than 6 GB/s user data injection
 Network bandwidth is limited by XT packaging
 Link speed from 3.125 to 6.25 Gbit/sec
 In some cases, double wide X & Z links also offer
increased bandwidth
 Gemini relies on user level threads
 MPI processing limits to 2M messages/sec per thread
 Scales beyond 10M msg/sec per NIC
Cray Inc. Hot Interconnects
15
2.5
 One way PUT in
2.0
Time (microsecs)
750ns
 Waiting for Ack in
only 1.1 us
 Remote GET
increases to 1.4 us
PUT, ping-pong
PUT, at source
GET
1.5
1.0
0.5
0.0
8
16
32
64
128
256
512
1024
Size (bytes)
Cray Inc. Hot Interconnects
16
7000
 Peak bandwidth
6000
Bandwidth (Mbytes/sec)
reached with small
transfers
 Multiple threads
reach peak with
smaller, still,
transfers
5000
4000
PPN=1
3000
PPN=2
PPN=4
2000
1000
0
8
16
32
64 128 256 512 1K 2K 4K 8K 16K 32K 64K
Size (bytes)
Cray Inc. Hot Interconnects
17
120
100
100 Mupdates/sec
 Random locations
(GUPS) still over 45
Mupdates/sec
AMO rate (millions)
 Hot location reaches
1 AMO
8192 AMOs
80
60
40
20
0
0
256
512
768
1024
Number of processes
Cray Inc. Hot Interconnects
18
 Gemini provides low latency, and performance for fine grain
operations
 Gemini has features to scale in performance and reliability to
large system size
 Questions?
Cray Inc. Hot Interconnects
19