Chapter 7 (excl. 7.9): Scalable Multiprocessors

Transcript Chapter 7 (excl. 7.9): Scalable Multiprocessors

Chapter 7 (excl. 7.9):
Scalable Multiprocessors
EECS 570: Fall 2003 -- rev3
1
Outline
Scalability
– bus doesn't hack it: need scalable interconnect (network)
Realizing Programming Models on Scalable Systems
– network transactions
– protocols
• shared address space
• message passing (synchronous, asynchronous, etc.)
• active messages
– safety: buffer overflow, fetch deadlock
Communication Architecture Design Space
– where does network attach to processor node?
– how much hardware interpretation of the network transaction?
– impact on cost & performance
EECS 570: Fall 2003 -- rev3
2
Scalability
How do bandwidth, latency, cost, and packaging scale with P?
• Ideal:
– latency, per-processor bandwidth, per-processor cost are constants
– packaging does not create upper bound, does not exacerbate others
• Bus:
– per-processor BW scales as 1/P
– latency increases with P:
• queuing delays for fixed bus length (linear at saturation?)
• as bus length is increased to accommodate more CPUs, clock must slow
• Reality:
– “scalable” may just mean sub-Linear dependence (e.g., logarithmic)
– practical limits ($/customer), sweet spot + growth path
– switched interconnect (network)
EECS 570: Fall 2003 -- rev3
3
Aside on Cost-Effective Computing
Traditional view: efficiency = speedup(P)/P
Efficiency < I → parallelism not worthwhile?
But much of a computer's cost is NOT in the
processor (memory, packaging, interconnect,
etc.)
• [Wood & Hill, IEEE Computer 2/95]
Let Costup(P) = Cost(P)/Cost(1)
Parallel computing cost-effective:
Speedup(P) > Costup(P)
E.g. for SGI PowerChallenge w/500MB:
Costup(32) = 86
EECS 570: Fall 2003 -- rev3
4
Network Transaction Primitive
Communication Network
serialized msg

output buffer
Source Node
input buffer
Destination Node
one-way transfer of information from source to destination
causes some action at the destination
• process info and/or deposit in buffer
• state change (e.g., set flag, interrupt program)
• maybe initiate reply (separate network transaction)
EECS 570: Fall 2003 -- rev3
5
Network Transaction Correctness Issues
• protection
– User/user, user/system what if VM doesn't apply?
– Fault containment (large machine component MTBF may be low)
• format
– Variable length? Header info?
– Affects efficiency, ability to handle in HW
• buffering/flow control
– Finite buffering in network itself
– Messages show up announced e.g., many-to-one pattern
• deadlock avoidance
– If you're not careful -- details later
• action
– What happens on delivery? How may options are provided?
• system guarantees
– Delivery, ordering
EECS 570: Fall 2003 -- rev3
6
Performance Issues
Key parameters: latency, overhead, bandwidth
LogP: lat/over/gap(BW) as function of P
CPUS
NIS
Network
NIR
CPUR
EECS 570: Fall 2003 -- rev3
7
Programming Models
What is user's view of network transaction?
– Depends on system architecture
– Remember layered approach: OS/compiler/Iibrary
may implement alternative model on top of HWprovided model
We'll look at three basic ones:
– Active Messages: “assembly language” for msgpassing systems
– Message Passing"- MPI-style interface, as seen by
appl. programmers
– Shared Address Space: ignoring cache coherence for
now
EECS 570: Fall 2003 -- rev3
8
Active Messages
Request
handler
Reply
handler
User-level analog of network transaction
– invoke handler function at receiver to extract packet from network
– grew out of attempts to do dataflow on msg-passing machines & remote
procedure calls
– handler may send reply, but no other messages
– Event notification: interrupts, polling, events?
– May also perform memory-to-memory transfer
Flexible (can do almost any action on msg reception), but requires tight
cooperation between CPU and network for high performance
– May be better to have HW do a few things faster
EECS 570: Fall 2003 -- rev3
9
Message Passing
Basic idea:
– Send(dest, tag, buffer) -- tag is arbitrary integer
– Recv(src, tag, buffer ) -- src/tag may be wildcard (“any”)
Completion semantics:
– Receive completes after data transfer complete from matching
send
– Synchronous send completes after matching receive and data
sent
– Asynchronous send completes after send buffer may be reused
• msg may simply be copied into alternate buffer, on src or dest node
Blocking vs. non-blocking:
– does function wait for “completion” before returning
– non-blocking: extra function calls to check for completion
– assume blocking for now
EECS 570: Fall 2003 -- rev3
10
Synchronous Message Passing
Three-phase operation: ready-to-send, ready-to-receive, transfer
– Can skip 1st phase if receiver initiates & specifies source
Overhead, latency tend to be high
Transfer can achieve high bandwidth w/sufficient msg length
Programmer must avoid deadlock (e.g. pairwise exchange)
EECS 570: Fall 2003 -- rev3
11
Asynch. Msg Passing: Conservative
Same as synchronous. except msg can be buffered
on sender
– Allows computation to continue sooner
–FallDeadlock
EECS 570:
2003 -- rev3 still (not so much) an issue -- buffering is finite
12
Asynch. Message Passing: Optimistic
Sender just ships data, hopes receiver can handle it
Benefit: lower latency
Problems.
– receive was posted need fast lookup while data streams
in
– receive not posted buffer? nack? discard?
EECS 570: Fall 2003 -- rev3
13
Key Features of Msg Passing Abstraction
Source knows send data address, dest. knows
receive data address
– after handshake they both know both
Arbitrary storage for asynchronous protocol
– may post many sends before any receives
Fundamentally a 3-phase transaction
– can use optimistic 1-phase in limited cases
Latency, overhead tend to be higher than SAS:
high BW easier
Hardware support?
– DMA: physical or virtual (better/harder)
EECS 570: Fall 2003 -- rev3
14
Shared Address Space Abstraction
Two-way request/response protocol
– reads require data response
– writes have acknowledgment (for consistency)
Issues
– virtual or physical address on net? (where does
translation happen)
EECS 570: Fall 2003 -- rev3
– coherence, consistency, etc (later)
15
Key Properties of SAS Abstraction
Data addresses are specified by the source of the request
– no dynamic buffer allocation
– protection achieved through virtual memory translation
Low overhead initiation: one instruction (load or store)
High bandwidth more challenging
– may require prefetching, separate “block transfer engine”
Synchronization less straightforward (no explicit event
notification)
Simple request-response pairs
– few fixed message types
– practical to implement in hardware w/o remote CPU involvement
Input buffering I flow control issue
– what if request rate exceeds local memory bandwidth?
EECS 570: Fall 2003 -- rev3
16
Challenge 1: Input Buffer Overflow
Options
• refuse input when full
– creates “back pressure” (in reliable network)
– to avoid deadlock:
• low-level ack/nack
– assumes dedicated network path for ack/nack (common in rings)
– retry on nack
• drop packets
– retry on timeout
• avoid by reserving space per source (“credit-based”)
– when available for reuse?
– scalability?
EECS 570: Fall 2003 -- rev3
17
Challenge 2: Fetch Deadlock
Processing a message may require sending a reply; what if
reply can't be sent due to input buffer overflow?
– step 1: guarantee that replies can be sunk @ destination
• requester reserves buffer space for reply
– step 2: guarantee that replies can be sent into network
• backpressure: logically independent request/reply networks
– physical networks or virtual channels
– credit-based: bound outgoing requests to K per node
• buffer space for K(P-l ) requests + K responses at each node
– low-level ack/nack, packet dropping
• guarantee that replies will never be nacked or dropped
For cache coherence protocols, some requests may
require more: forward request to other node, send
multiple invalidations
– must extend techniques or nack these requests up front
EECS 570: Fall 2003 -- rev3
18
Outline
Scalability [7.1 - read]
Realizing Programming Models
– network transactions
– protocols: SAS, MP, Active Messages
– safety: buffer overflow, fetch deadlock
Communication Architecture Design Space
– where does hardware fit into node architecture?
– how much hardware interpretation of the network transaction?
– how much gap between hardware and user semantics?
• remainder must be done in software
• increased flexibility, increased latency & overhead
– main CPU or dedicated/specialized processor?
EECS 570: Fall 2003 -- rev3
19
Massively Parallel Processor (MPP) Architectures
• Network interface typically
close to processor
Processor
Network
Interface
– Memory bus:
Cache
Memory Bus
I/O Bridge
• locked to specific processor
architecture/bus protocol
– Registers/cache:
Network
• only in research machines
I/O Bus
Main
Memory
Disk
Controller
Disk
EECS 570: Fall 2003 -- rev3
Disk
• Time-to-market is long
– processor already available
or work closely with
processor designers
• Maximize performance
and cost
20
Network of Workstations
Processor
interrupts
Network interface on I/0 bus
Standards (e.g., PCI) => longer
life, faster to market
Slow (microseconds) to access
network interface
“System Area Network” (SAN):
between LAN & MPP
Cache
Core Chip Set
Main
Memory
I/O Bus
Disk
Controller
Disk
Disk
EECS 570: Fall 2003 -- rev3
Graphics
Controller
Graphics
Network
Interface
Network
21
Transaction Interpretation
Simple: HW doesn't interpret much if anything
– DMA from/to buffer, interrupt or set flag on completion
– nCUBE, conventional LAN
– Requires OS for address translation, often a user/kernel copy
User-level messaging: get the OS out of the way
– HW does protection checks to allow direct user access to
network
– may have minimal interpretation otherwise
– May be on I/O bus (Myrinet), memory bus (CM-5), or in regs (JMachine, *T)
– May require CPU involvement in all data transfers (explicit
memory-to-network copy)
EECS 570: Fall 2003 -- rev3
22
Transaction Interpretation (cont'd)
Virtual DMA: get the CPU out of the way (maybe)
– basic protection plus address translation: user-level bulk DMA
– Usually to limited region of addr space (pinned)
– Can he done in hardware (VIA, Meiko CS-2) or software (some
Myrinet, Intel Paragon)
Reflective memory
– DEC memory channel, Princeton SHRIMP
Global physical address space (NUMA): everything in
hardware
– complexity increases, but performance does too (if done right)
Cache coherence: even more so
– stay tuned
EECS 570: Fall 2003 -- rev3
23
Net Transactions: Physical DMA
Data
Dest
DMA
channels
Addr
Length
Rdy
Memory
Status,
interrupt
Cmd
P

Addr
Length
Rdy
Memory
P
• Physical addresses: OS must initiate transfers
– system call per message on both ends: ouch
• Sending OS copies data to kernel buffer w/ header/trailer
– can avoid copy if interface does scatter/gather
• Receiver copies packet into OS buffer, then interprets
EECS 570:
2003
-- rev3
– Fall
user
message
then copied (or mapped) into user space
24
nCUBE/2 Network Interface
independent DMA channel per link direction
segmented messages: can inspect header to direct
remainder of DMA directly to user buffer
– avoids copy at expense of extra interrupt + DMA setup cost
– can't
let2003
buffer
EECS
570: Fall
-- rev3 be paged out (did nCUBE have VM?)
25
Conventional LAN Network Interface
Host Memory
NIC
trncv
NIC Controller
Data
addr
TX
RX
Addr Len
Status
Next
Addr Len
Status
Next
Addr Len
Status
Next
EECS 570: Fall 2003 -- rev3
Addr Len
Status
Next
Addr Len
Status
Next
DMA
len
IO Bus
mem bus
Proc
Addr Len
Status
Next
26
User Level Messaging
User/system
Data
Dest

Mem
P
Status,
interrupt
Mem
P
• map network hardware into user’s address space
– talk directly to network via loads & stores
• user-to-user communication without OS intervention: low latency
• protection: user/user & user/system
• DMA hard… CPU involvement (copying) becomes bottleneck
EECS 570: Fall 2003 -- rev3
27
User Level Network Ports
Virtual address space
Net output
port
Net input
port
Processor
Status
Registers
Program counter
Appears to user as logical message queues plus status
EECS 570: Fall 2003 -- rev3
28
Example: CM-5
Diagnostics network
Control network
• Input and output
FIFO for each
network
• Two data networks
• Save/restore
network buffers on
context switch
Data network
PM PM
Processing
partition
SPARC
FPU
$
ctrl
Data
networks
$
SRAM
I/O partition
Control
network
NI
MBUS
DRAM
ctrl
DRAM
EECS 570: Fall 2003 -- rev3
Processing Control
partition
processors
Vector
unit
DRAM
ctrl
DRAM
DRAM
ctrl
DRAM
Vector
unit
DRAM
ctrl
DRAM
29
User Level Handlers
U s e r /s y s te m
D a ta
A d d re s s
D e st
 
M em
P
Mem
P
• Hardware support to vector to address specified in message
– message ports in registers
– alternate register set for handler?
• Examples: J-Machine, Monsoon, *T (MIT), iWARP (CMU)
EECS 570: Fall 2003 -- rev3
30
J-Machine
• Each node a small messagedriven processor
• HW support to queue msgs
and dispatch to msg handler
task
EECS 570: Fall 2003 -- rev3
31
Dedicated Message Processing Without Specialized Hardware
Network
dest
°°°
Mem
Mem
NI
P
User
MP
System
NI
P
User
MP
System
• General purpose processor performs arbitrary output processing (at system
level)
• General purpose processor interprets incoming network transactions (in
system)
• User Processor <–> Msg Processor share memory
• Msg Processor <–> Msg Processor via system network transaction
EECS 570: Fall 2003 -- rev3
32
Levels of Network Transaction
Network
dest
°°°
Mem
NI
P
MP
Mem
NI
MP
P
• User Processor stores cmd / msg / data into shared output
queue
– must still check for output queue full (or grow dynamically)
• Communication assists make transaction happen
– checking, translation, scheduling, transport, interpretation
• Avoid system call overhead
EECS 570: Fallbus
2003 -- crossings
rev3
• Multiple
likely bottleneck
33
Example: Intel Paragon
Network
I/O
Nodes
I/O
Nodes
Devices
Devices
16
Mem
175 MB/s Duplex
2048 B
NI
i860xp
50 MHz
16 KB $
4-way
32B Block
MESI
°°°
EOP
rte
MP handler
Var data
64
400 MB/s
$
$
P
MP
EECS 570: Fall 2003 -- rev3
sDMA
rDMA
34
EECS 570: Fall 2003 -- rev3
35
Dedicated MP w/specialized NI:
Meiko CS-2
• Integrate message processor into network
interface
– active messages-like capability
– dedicated threads for DMA, reply handling, simple
remote memory access
– supports user-level virtual DMA
• own page table
• can take a page fault, signal OS, restart
– meanwhile, nack other node
• Problem: processor is slow, time-slices threads
– fundamental issue with building your own CPU
EECS 570: Fall 2003 -- rev3
36
Myricom Myrinet (Berkeley NOW)
• Programmable network interface on I/O Bus (Sun SBUS or
PCI)
– embedded custom CPU (“Lanai”, ~40 MHz RISC CPU)
– 256KB SRAM
– 3 DMA engines: to network, from network, to/from host memory
• Downloadable firmware executes in kernel mode
– includes source-based routing protocol
• SRAM pages can be mapped into user space
– separate pages for separate processes
– firmware can define status words, queues, etc.
• data for short messages or pointers for long ones
• firmware can do address translation too… w/OS help
– poll to check for sends from user
• Bottom
line:
EECS 570: Fall
2003 --I/O
rev3 bus still bottleneck, CPU could be faster 37
Shared Physical Address Space
• Implement SAS model in hardware w/o caching
– actual caching must be done by copying from remote memory to local
– programming paradigm looks more like message passing than Pthreads
• yet, low latency & low overhead transfers thanks to HW interpretation; high
bandwidth too if done right
• result: great platform for MPI & compiled data-parallel codes
• Implementation:
– “pseudo-memory” acts as memory controller for remote mem, converts
accesses to network transaction (request)
– “pseudo-CPU” on remote node receives requests, performs on local
memory, sends reply
– split-transaction or retry-capable bus required (or dual-ported mem)
EECS 570: Fall 2003 -- rev3
38
Example: Cray T3D
• Up to 2,048 Alpha 21064s
– no off-chip L2 to avoid inherent latency
• In addition to remote mem ops, includes:
–
–
–
–
prefetch buffer (hide remote latency)
DMA engine (requires OS trap)
synchronization operations (swap, fetch&inc, global AND/OR)
message queue (requires OS trap on receiver)
• Big problem: physical address space
– 21064 supports only 32 bits
– 2K-node machine limited to 2M per node
– external “DTB annex” provides segment-like registers for extended
addressing, but management is expensive & ugly
EECS 570: Fall 2003 -- rev3
39
Cray T3E
• Similar to T3D, uses Alpha 21164 instead of 21064 (on-chip L2)
– still has physical address space problems
• E-registers for remote communication and synchronization
– 512 user, 128 system; 64 bits each
– replace/unify DTB Annex, prefetch queue, block transfer engine, and
remote load / store, message queue
– Address specifies source or destination E-register and command
– Data contains pointer to block of 4 E-regs and index for centrifuge
• Centrifuge
– supports data distributions used in data-parallel languages (HPF)
– 4 E-regs for global memory operation: mask, base, two arguments
• Get & Put Operations
EECS 570: Fall 2003 -- rev3
40
T3E (continued)
• Atomic Memory operations
– E-registers & centrifuge used
– F&I, F&Add, Compare&Swap, Masked_Swap
• Messaging
– arbitrary number of queues (user or system)
– 64-byte messages
– create msg queue by storing message control word to memory
location
• Msg Send
– construct data in aligned block of 8 E-regs
– send like put, but dest must be message control word
– processor is responsible for queue space (buffer management)
• Barrier and Eureka synchronization
EECS 570: Fall 2003 -- rev3
41
DEC Memory Channel (Princeton SHRIMP)
Virtual
Virtual
Physical
Physical
• Reflective Memory
• Writes on Sender appear in Receiver’s memory
– send & receive regions
– page control table
• Receive region is pinned in memory
• Requires duplicate writes, really just message buffers
EECS 570: Fall 2003 -- rev3
42
Performance of Distributed Memory
Machines
• Microbenchmarking
• One-way latency of small (five-word)
message
– echo test
– round-trip divided by 2
• Shared Memory remote read
• Message Passing Operations
– see text
EECS 570: Fall 2003 -- rev3
43
Network Transaction Performance
14
Microseconds
12
10
8
Gap
L
Or
Os
6
4
EECS 570: Fall 2003 -- rev3
Figure 7.31
T3D
NOW
CS2
Paragon
CM-5
T3D
NOW
CS2
Paragon
0
CM-5
2
44
Remote Read Performance
20
15
Gap
L
Issue
10
T3D
NOW
CS2
Paragon
CM-5
T3D
NOW
CS2
0
Paragon
5
CM-5
Microseconds
25
Figure 7.32
EECS 570: Fall 2003 -- rev3
45
Summary of Distributed Memory
Machines
• Convergence of architectures
– everything “looks basically the same”
– processor, cache, memory, communication assist
• Communication Assist
– where is it? (I/O bus, memory bus, processor registers)
– what does it know?
• does it just move bytes, or does it perform some functions?
– is it programmable?
– does it run user code?
• Network transaction
– input & output buffering
– action on remote node
EECS 570: Fall 2003 -- rev3
46

Chapter 7 (excl. 7.9): Scalable Multiprocessors

Transcript Chapter 7 (excl. 7.9): Scalable Multiprocessors

Directory