Introduction - Duke ECE

Download Report

Transcript Introduction - Duke ECE

ECE 259 / CPS 221
Advanced Computer Architecture II
(Parallel Computer Architecture)
Machine Organizations
Copyright 2004 Daniel J. Sorin
Duke University
Slides are derived from work by
Sarita Adve (Illinois), Babak Falsafi (CMU),
Mark Hill (Wisconsin), Alvy Lebeck (Duke), Steve
Reinhardt (Michigan), and J. P. Singh (Princeton).
Thanks!
Outline
• System Organization Taxonomy
• Distributed Memory Systems (w/o cache coherence)
• Clusters and Networks of Workstations
• Presentations of Research Papers
(C) 2004 Daniel J. Sorin from Adve,
Falsafi, Hill, Lebeck, Reinhardt, Singh
ECE 259 / CPS 221
2
A Taxonomy of System Organizations
• Flynn’s Taxonomy of Organizations
–
–
–
–
SISD = Single Instruction, Single Data = uniprocessor
SIMD = Single Instruction, Multiple Data
MISD = Multiple Instruction, Single Data = doesn’t exist
MIMD = Multiple Instruction, Multiple Data
• This course addresses SIMD and MIMD machines
(C) 2004 Daniel J. Sorin from Adve,
Falsafi, Hill, Lebeck, Reinhardt, Singh
ECE 259 / CPS 221
3
SIMD
• Single Instruction, Multiple Data
• E.g., every processor executes an Add on diff data
• Matches certain programming models
– E.g., High performance Fortran’s “Forall”
• For data parallel algorithms, SIMD is very effective
– Else, it doesn’t help much (Amdahl’s Law!)
• Note: SPMD is programming model, not organization
• SIMD hardware implementations
– Vector machines
– Instruction set extensions, like Intel’s MMX
– Digital Signal Processors (DSPs)
(C) 2004 Daniel J. Sorin from Adve,
Falsafi, Hill, Lebeck, Reinhardt, Singh
ECE 259 / CPS 221
4
MIMD
• Multiple Instruction, Multiple Data
• Completely general multiprocessor system model
– Lots of ways to design MIMD machines
• Every processor executes own program w/own data
• MIMD machines can handle many program models
– Shared memory
– Message passing
– SPMD (using shared memory or message passing)
(C) 2004 Daniel J. Sorin from Adve,
Falsafi, Hill, Lebeck, Reinhardt, Singh
ECE 259 / CPS 221
5
Outline
• System Organization Taxonomy
• Distributed Memory Systems (w/o cache coherence)
• Clusters and Networks of Workstations
• Presentations of Research Papers
(C) 2004 Daniel J. Sorin from Adve,
Falsafi, Hill, Lebeck, Reinhardt, Singh
ECE 259 / CPS 221
6
Distributed Memory Machines
• Multiple nodes connected by network:
– Node = processor, caches, memory, communication assist
• Design issue #1: Communication assist
– Where is it? (I/O bus, memory bus, processor registers)
– What does it know?
» Does it just move bytes, or does it perform some functions?
– Is it programmable?
– Does it run user code?
• Design issue #2: Network transactions
– Input & output buffering
– Action on remote node
These two issues are not independent!
(C) 2004 Daniel J. Sorin from Adve,
Falsafi, Hill, Lebeck, Reinhardt, Singh
ECE 259 / CPS 221
7
Ex 1: Massively Parallel Processor (MPP)
• Network interface typically
close to processor
Processor
Network
Interface
Cache
Memory Bus
I/O Bridge
Network
I/O Bus
Main
Memory
Disk
Controller
Disk
(C) 2004 Daniel J. Sorin from Adve,
Falsafi, Hill, Lebeck, Reinhardt, Singh
Disk
– Connected to memory bus:
» Locked to specific
processor architecture/bus
protocol
– Connected to registers/cache:
» Only in research machines
• Time-to-market is long
– Processor already available or
must work closely with
processor designers
• Maximize performance
(cost is no object!)
ECE 259 / CPS 221
8
Ex 2: Network of Workstations
Processor
• Network interface on
I/O bus
interrupts
• Standards (e.g., PCI)
 longer life,
faster to market
Cache
Core Chip Set
Main
Memory
I/O Bus
Disk
Controller
Disk
Disk
Graphics
Controller
Graphics
(C) 2004 Daniel J. Sorin from Adve,
Falsafi, Hill, Lebeck, Reinhardt, Singh
Network
Interface
Network
• Slow (microseconds)
to access network
interface
• “System Area
Network” (SAN):
between LAN & MPP
ECE 259 / CPS 221
9
Spectrum of Communication Assist Designs
1) None: Physical bit stream
– Physical DMA using OS
nCUBE, iPSC, . . .
2) User-level communication
– User-level port
– User-level handler
CM-5, *T
J-Machine, Monsoon, . . .
3) Communication co-processor
– Processing, translation
– Reflective memory
Paragon, Meiko CS-2, Myrinet
Memory Channel, SHRIMP
4) Global physical address
– Proc + Memory controller
RP3, BBN, T3D, T3E
5) Cache-to-cache (later)
– Cache controller
(C) 2004 Daniel J. Sorin from Adve,
Falsafi, Hill, Lebeck, Reinhardt, Singh
most current MPs (Sun, HP)
ECE 259 / CPS 221
10
Network Transaction Interpretation
1) Simplest MP: communication assist doesn’t
interpret much if anything
– DMA from/to buffer, then interrupt or set flag on completion
2) User-level messaging: get the OS out of the way
– Assist does protection checks to allow direct user access to
network (e.g., via memory mapped I/O)
3) Dedicated communication co-processor: get the
CPU out of the way (if possible)
– Basic protection plus address translation: user-level bulk DMA
4) Global physical address space (NUMA): everything
in hardware
– Complexity increases, but performance does too (if done right)
5) Cache coherence: even more so
– Stay tuned
(C) 2004 Daniel J. Sorin from Adve,
Falsafi, Hill, Lebeck, Reinhardt, Singh
ECE 259 / CPS 221
11
(1) No Comm Assist: Physical DMA (C&S 7.3)
Data
Dest
DMA
channels
Addr
Length
Rdy
Status,
interrupt
Cmd
Memory
P

Addr
Length
Rdy
Memory
P
• Physical addresses: OS must initiate transfers
– System call per message on both ends: ouch!
• Sending OS copies data to OS buffer w/ header/trailer
• Receiver copies packet into OS buffer, then interprets
– User message then copied (or mapped) into user space
(C) 2004 Daniel J. Sorin from Adve,
Falsafi, Hill, Lebeck, Reinhardt, Singh
ECE 259 / CPS 221
12
(2) User Level Communication (C&S 7.4)
• Problem with raw DMA: too much OS activity
• Solution: let user to user communication skip OS
– Can’t skip OS if communication is user-system or system-user
• Two flavors:
– User-level ports  map network into user space
– User-level handlers  message reception triggers user handler
(C) 2004 Daniel J. Sorin from Adve,
Falsafi, Hill, Lebeck, Reinhardt, Singh
ECE 259 / CPS 221
13
(2a) User Level Network Ports (C&S 7.4.1)
User/system
Data
Dest

Mem
P
Status,
interrupt
Mem
P
• Map network hardware into user’s address space
– Talk directly to network via loads & stores
• User-to-user communication without OS intervention
– Low latency
• Communication assist must protect users & system
• DMA hard … CPU involvement (copying) becomes
bottleneck
(C) 2004 Daniel J. Sorin from Adve,
Falsafi, Hill, Lebeck, Reinhardt, Singh
ECE 259 / CPS 221
14
(2a) User Level Network Ports
Virtual address space
Net output
port
Net input
port
Processor
Status
Registers
Program counter
• Appears to user as logical message queues plus
status register
• Sender and receiver must coordinate
– What happens if receiver doesn’t pop messages?
(C) 2004 Daniel J. Sorin from Adve,
Falsafi, Hill, Lebeck, Reinhardt, Singh
ECE 259 / CPS 221
15
Example: Thinking Machines CM-5 (C&S 7.4.2)
• Input and output
FIFO for each
network
Diagnostics network
Control network
Data network
PM PM
Processing
partition
• Two networks
Processing Control
partition
processors
I/O partition
– Request & Response
SPARC
• Save/restore network
buffers on context
switch
FPU
$
ctrl
Data
networks
$
SRAM
Control
network
NI
MBUS
DRAM
ctrl
Vector
unit
DRAM
DRAM
ctrl
DRAM
DRAM
ctrl
DRAM
Vector
unit
DRAM
ctrl
DRAM
• More later on CM5!
(C) 2004 Daniel J. Sorin from Adve,
Falsafi, Hill, Lebeck, Reinhardt, Singh
ECE 259 / CPS 221
16
(2b) User Level Handlers (C&S 7.4.3)
• Like user-level ports, but tighter coupling between
port and processor
• Ports are mapped to processor registers, instead of
to special memory regions
• Hardware support to vector to address specified in
message: incoming message directs execution
• Examples: J-Machine, Monsoon, *T (MIT), iWARP
(CMU)
(C) 2004 Daniel J. Sorin from Adve,
Falsafi, Hill, Lebeck, Reinhardt, Singh
ECE 259 / CPS 221
17
Example: J-Machine (MIT)
• Each node a small messagedriven processor
• HW support to queue msgs
and dispatch to msg handler
task
(C) 2004 Daniel J. Sorin from Adve,
Falsafi, Hill, Lebeck, Reinhardt, Singh
ECE 259 / CPS 221
18
(Brief) Example: Active Messages
• Messages are “active” – they contain instructions for
what the receiving processor should do
• Optional reading assignment on Active Messages
(C) 2004 Daniel J. Sorin from Adve,
Falsafi, Hill, Lebeck, Reinhardt, Singh
ECE 259 / CPS 221
19
(3) Dedicated Communication Co-Processor
Network
dest
Mem
Mem
NI
NI
°°°
P
User
MP
P
System
User
MP
System
• Use a programmable message processor (MP)
– MP runs software that does communication assist work
• MP and CPU communicate via shared memory
(C) 2004 Daniel J. Sorin from Adve,
Falsafi, Hill, Lebeck, Reinhardt, Singh
ECE 259 / CPS 221
20
(3) Dedicated Communication Co-Processor
Network
dest
°°°
Mem
NI
P
User
Mem
NI
MP
MP
P
System
• User processor stores cmd / msg / data into shared output queue
– Must still check for output queue full (or have it grow dynamically)
• Message processors make transaction happen
– Checking, translation, scheduling, transport, interpretation
• Avoid system call overhead
• Multiple bus crossings likely bottleneck
(C) 2004 Daniel J. Sorin from Adve,
Falsafi, Hill, Lebeck, Reinhardt, Singh
ECE 259 / CPS 221
21
Example: Intel Paragon
Service
Network
I/O
Nodes
I/O
Nodes
Devices
Devices
16
Mem
175 MB/s Duplex
2048 B
NI
i860xp
50 MHz
16 KB $
4-way
32B Block
MESI
°°°
EOP
rte
MP handler
Var data
64
400 MB/s
$
$
P
MP
(C) 2004 Daniel J. Sorin from Adve,
Falsafi, Hill, Lebeck, Reinhardt, Singh
sDMA
rDMA
ECE 259 / CPS 221
22
Dedicated MP w/specialized NI:
Meiko CS-2
• Integrate message processor into network interface
– Active messages-like capability
– Dedicated threads for DMA, reply handling, simple remote memory
access
– Supports user-level virtual DMA
» Own page table
» Can take a page fault, signal OS, restart
• Meanwhile, nack other node
• Problem: processor is slow, time-slices threads
– Fundamental issue with building your own CPU
(C) 2004 Daniel J. Sorin from Adve,
Falsafi, Hill, Lebeck, Reinhardt, Singh
ECE 259 / CPS 221
23
(4) Shared Physical Address Space
• Implement shared address space model in hardware
w/o caching
– Actual caching must be done by copying from remote memory to
local
– Programming paradigm looks more like message passing than
Pthreads
» Nevertheless, low latency & low overhead transfers thanks to
HW interpretation; high bandwidth too if done right
» Result: great platform for MPI & compiled data-parallel codes
• Implementation:
– “Pseudo-memory” acts as memory controller for remote mem,
converts accesses to network transactions (requests)
– “Pseudo-CPU” on remote node receives requests, performs on local
memory, sends reply
– Split-transaction or retry-capable bus required (or dual-ported mem)
(C) 2004 Daniel J. Sorin from Adve,
Falsafi, Hill, Lebeck, Reinhardt, Singh
ECE 259 / CPS 221
24
Example: Cray T3D
• Up to 2,048 Alpha 21064 microprocessors
– No off-chip L2 to avoid inherent latency
• In addition to remote memory ops, includes:
–
–
–
–
–
Prefetch buffer (hide remote latency)
DMA engine (requires OS trap)
Synchronization operations (swap, fetch&inc, global AND/OR)
Global synch operations (barrier, eureka)
Message queue (requires OS trap on receiver)
• Big problem: physical address space
– 21064 supports only 32 bits
– 2K-node machine limited to 2M per node
– External “DTB annex” provides segment-like registers for extended
addressing, but management is expensive & ugly
(C) 2004 Daniel J. Sorin from Adve,
Falsafi, Hill, Lebeck, Reinhardt, Singh
ECE 259 / CPS 221
25
Example: Cray T3E
• Similar to T3D, uses Alpha 21164 instead of 21064
(on-chip L2)
– Still has physical address space problems
• E-registers for remote communication and
synchronization
– 512 user, 128 system; 64 bits each
– Replace/unify DTB Annex, prefetch queue, block transfer engine,
and remote load / store, message queue
– Address specifies source or destination E-register and command
– Data contains pointer to block of 4 E-regs and index for centrifuge
• Centrifuge
– Supports data distributions used in data-parallel languages (HPF)
– 4 E-regs for global memory operation: mask, base, two arguments
(C) 2004 Daniel J. Sorin from Adve,
Falsafi, Hill, Lebeck, Reinhardt, Singh
ECE 259 / CPS 221
26
Cray T3E (continued)
• Atomic Memory operations
– E-registers & centrifuge used
– Fetch&Increment, Fetch&Add, Compare&Swap, Masked_Swap
• Messaging
– Arbitrary number of queues (user or system)
– 64-byte messages
– Create message queue by storing message control word to memory
location
• Message Send
– Construct data in aligned block of 8 E-regs
– Send like put, but destination must be message control word
– Processor is responsible for queue space (buffer management)
• Barrier and Eureka synchronization
• DISCUSSION OF PAPER
(C) 2004 Daniel J. Sorin from Adve,
Falsafi, Hill, Lebeck, Reinhardt, Singh
ECE 259 / CPS 221
27
(5) Cache Coherent Shared Memory
• Hardware management of caching shared memory
– Less burden on programmer to manage caches
– But tough to do on very big systems (e.g., Cray T3E)
• Cache coherent systems are focus of this course
(C) 2004 Daniel J. Sorin from Adve,
Falsafi, Hill, Lebeck, Reinhardt, Singh
ECE 259 / CPS 221
28
Summary of Distributed Memory Machines
• Convergence of architectures
– Everything “looks basically the same”
– Processor, cache, memory, communication assist
• Communication Assist
– Where is it? (I/O bus, memory bus, processor registers)
– What does it know?
» Does it just move bytes, or does it perform some functions?
– Is it programmable?
– Does it run user code?
• Network transaction
– Input & output buffering
– Action on remote node
(C) 2004 Daniel J. Sorin from Adve,
Falsafi, Hill, Lebeck, Reinhardt, Singh
ECE 259 / CPS 221
29
Outline
• System Organization Taxonomy
• Distributed Memory Systems (w/o cache coherence)
• Clusters and Networks of Workstations
• Presentations of Research Papers
(C) 2004 Daniel J. Sorin from Adve,
Falsafi, Hill, Lebeck, Reinhardt, Singh
ECE 259 / CPS 221
30
Clusters & Networks of Workstations (C&S 7.7)
• Connect a bunch of commodity machines together
• Options:
– Low perf comm, high throughput: Condor-style job management
– High perf comm: fast SAN and LAN technology
• Nothing fundamentally different from what we’ve
studied already
• In general, though, network interface further from
processor (I.e., on I/O bus)
– Less integration, more software support
(C) 2004 Daniel J. Sorin from Adve,
Falsafi, Hill, Lebeck, Reinhardt, Singh
ECE 259 / CPS 221
31
Network of Workstations (NOW)
Processor
• Network interface on
I/O bus
interrupts
• Standards (e.g., PCI)
=> longer life,
faster to market
Cache
Core Chip Set
Main
Memory
I/O Bus
Disk
Controller
Disk
Disk
Graphics
Controller
Graphics
(C) 2004 Daniel J. Sorin from Adve,
Falsafi, Hill, Lebeck, Reinhardt, Singh
Network
Interface
Network
• Slow (microseconds)
to access network
interface
• “System Area
Network” (SAN):
between LAN & MPP
ECE 259 / CPS 221
32
Myricom Myrinet (Berkeley NOW)
• Programmable network interface on I/O Bus (Sun SBUS
or PCI)
– Embedded custom CPU (“Lanai”, ~40 MHz RISC CPU)
– 256KB SRAM
– 3 DMA engines: to network, from network, to/from host memory
• Downloadable firmware executes in kernel mode
– Includes source-based routing protocol
• SRAM pages can be mapped into user space
– Separate pages for separate processes
– Firmware can define status words, queues, etc.
» Data for short messages or pointers for long ones
» Firmware can do address translation too … w/OS help
– Poll to check for sends from user
• Bottom line: I/O bus still bottleneck, CPU could be faster
(C) 2004 Daniel J. Sorin from Adve,
Falsafi, Hill, Lebeck, Reinhardt, Singh
ECE 259 / CPS 221
33
Example: DEC Memory Channel
(Princeton SHRIMP)
Virtual
Virtual
Physical
Physical
• Reflective Memory
• Writes on sender appear in receiver’s memory
– Must specify send & receive regions (not arbitrary locations)
• Receive region is pinned in memory
• Requires duplicate writes, really just message
buffers
(C) 2004 Daniel J. Sorin from Adve,
Falsafi, Hill, Lebeck, Reinhardt, Singh
ECE 259 / CPS 221
34
Outline
• System Organization Taxonomy
• Distributed Memory Systems (w/o cache coherence)
• Clusters and Networks of Workstations
• Presentations of Research Papers
(C) 2004 Daniel J. Sorin from Adve,
Falsafi, Hill, Lebeck, Reinhardt, Singh
ECE 259 / CPS 221
35
Thinking Machines CM-5
• PRESENTATION
(C) 2004 Daniel J. Sorin from Adve,
Falsafi, Hill, Lebeck, Reinhardt, Singh
ECE 259 / CPS 221
36