CS 519 -- Operating Systems -

Transcript CS 519 -- Operating Systems -

CS519: Lecture 6
Communication in tightly coupled systems (parallel computing)
Why Parallel Computing? Performance!
CS 519
2
Operating System Theory
Processor Performance
10,000
CRAY
 CRAY
 Micro
Micro
n = 1,000
n = 100
n = 1,000
n = 100

1,000

T 94

LINPACK (MFLOPS)
C90



100




DEC 8200

Ymp

Xmp/416



Xmp/14se





IBM Power2/990
MIPS R4400
DEC Alpha

 HP9000/735
 DEC Alpha AXP
 HP 9000/750
 CRAY 1s
 IBM RS6000/540
10

MIPS M/2000


MIPS M/120

Sun 4/260
1
1975
CS 519


1980
1985
1990
3
1995
2000
Operating System Theory
But not just Performance
 At some point, we’re willing to trade some
performance for:
Ease of programming
Portability
Cost
 Ease of programming & Portability
Parallel programming for the masses
Leverage new or faster hardware asap
 Cost
High-end parallel machines are expensive resources
CS 519
4
Operating System Theory
Amdahl’s Law
S
1
s
Speedup for computations with fraction s of
sequential work
100
80
Speedup
 If a fraction s of
a computation is
not parallelizable,
then the best
achievable
speedup is
0
0.01
0.025
0.05
0.1
0.2
60
40
20
0
1
10
20
30
40
50
60
70
80
90
100
Number of processors
CS 519
5
Operating System Theory
Pictorial Depiction of Amdahl’s Law
1
p
1
Time
CS 519
6
Operating System Theory
Parallel Applications
 Scientific computing not the only class of parallel
applications
 Examples of non-scientific parallel applications:
Data mining
Real-time rendering
Distributed servers
CS 519
7
Operating System Theory
Centralized Memory Multiprocessors
CP
U
cache
CP
U
cache
Memory
memory bus
I/O bus
disk
CS 519
Net interface
8
Operating System Theory
Distributed Shared-Memory (NUMA)
Multiprocessors
CP
U
cache
CP
U
cache
Memory
Memory
memory bus
memory bus
I/O bus
I/O bus
network
disk
CS 519
Net interface
Net interface
9
disk
Operating System Theory
Multicomputers
CP
U
cache
CP
U
cache
Memory
Memory
memory bus
memory bus
I/O bus
I/O bus
network
disk
Net interface
Net interface
disk
Inter-processor communication in multicomputers
is effected through message passing
CS 519
10
Operating System Theory
Basic Message Passing
Send
Receive
Send
Receive
P0
P1
P0
P1
N0
N1
N0
Communication Fabric
CS 519
11
Operating System Theory
Terminology
 Basic Message Passing:
 Send: Analogous to mailing a letter
 Receive: Analogous to picking up a letter from the mailbox
 Scatter-gather: Ability to “scatter” data items in a message
into multiple memory locations and “gather” data items from
multiple memory locations into one message
 Network performance:
 Latency: The time from when a Send is initiated until the
first byte is received by a Receive.
 Bandwidth: The rate at which a sender is able to send data
to a receiver.
CS 519
12
Operating System Theory
Scatter-Gather
Scatter (Receive)
Gather (Send)
…
…
Message
Message
Memory
CS 519
Memory
13
Operating System Theory
Basic Message Passing Issues
 Issues include:
 Naming: How to specify the receiver?
 Buffering: What if the out port is not available? What if the
receiver is not ready to receive the message?
 Reliability: What if the message is lost in transit? What if
the message is corrupted in transit?
 Blocking: What if the receiver is ready to receive before
the sender is ready to send?
CS 519
14
Operating System Theory
Traditional Message Passing
Implementation
 Kernel-based message passing: unnecessary data
copying and traps into the kernel
S
R
M
M
M S R
M S R
M S R
CS 519
15
Operating System Theory
Reliability
 Reliability problems:
Message loss
Most common approach: If don’t get a reply/ack msg within some
time interval, resend
Message corruption
Most common approach: Send additional information (e.g., error
correction code) so receiver can reconstruct data or simply detect
corruption, if part of msg is lost or damaged. If reconstruction is
not possible, throw away corrupted msg and pretend it was lost
Lack of buffer space
Most common approach: Control the flow and size of messages to
avoid running out of buffer space
CS 519
16
Operating System Theory
Reliability
 Reliability is indeed a very hard problem in largescale networks such as the Internet
Network is unreliable
Message loss can greatly impact performance
Mechanisms to address reliability can be costly even when
there’s no message loss
 Reliability is not as hard for parallel machines
Underlying network hardware is much more reliable
Less prone to buffer overflow, cause often have hardware
flow-control
 Address reliability later, for loosely coupled systems
CS 519
17
Operating System Theory
Computation vs. Communication Cost
 200 MHz clock  5 ns instruction cycle
 Memory access:
L1: ~2-4 cycles  10-20 ns
L2: ~5-10 cycles  25-50 ns
Memory: ~50-200 cycles  250-1000 ns
 Message roundtrip latency:
~20 s
Suppose 75% hit ratio in L1, no L2, 10 ns L1 access time, 500
ns memory access time  average memory access time 132.5
ns
1 message roundtrip latency = 151 memory accesses
CS 519
18
Operating System Theory
Performance … Always Performance!
 So … obviously, when we talk about message passing,
we want to know how to optimize for performance
 But … which aspects of message passing should we
optimize?
We could try to optimize everything
Optimizing the wrong thing wastes precious resources, e.g.,
optimizing leaving mail for the mail-person does not increase
overall “speed” of mail delivery significantly
 Subject of Martin et al., “Effects of Communication
Latency, Overhead, and Bandwidth in a Cluster
Architecture,” ISCA 1997.
CS 519
19
Operating System Theory
Martin et al.: LogP Model
CS 519
20
Operating System Theory
Sensitivity to LogGP Parameters
 LogGP parameters:
L = delay incurred in passing a short msg from source to dest
o = processor overhead involved in sending or receiving a msg
g = min time between msg transmissions or receptions (msg
bandwidth)
G = bulk gap = time per byte transferred for long transfers
(byte bandwidth)
 Workstations connected with Myrinet network and
Generic Active Messages layer
 Delay insertion technique
 Applications written in Split-C but perform their own
data caching
CS 519
21
Operating System Theory
Sensitivity to Overhead
L  5.0s
g  5.8s
P  16
CS 519
22
Operating System Theory
Sensitivity to Gap
L  5.0s
o  2.9s
P  16
CS 519
23
Operating System Theory
Sensitivity to Latency
o  2.9s
g  5.8s
P  16
CS 519
24
Operating System Theory
Sensitivity to Bulk Gap
L  5.0s
o  2.9s
g  5.8s
P  16
CS 519
25
Operating System Theory
Summary
 Runtime strongly dependent on overhead and gap
 Strong dependence on gap because of burstiness of
communication
 Not so sensitive to latency  can effectively overlap
computation and communication with non-blocking
reads (writes usually do not stall the processor)
 Not sensitive to bulk gap  got more bandwidth than
we know what to do with
CS 519
26
Operating System Theory
What’s the Point?
 What can we take away from Martin et al.’s study?
 It’s extremely important to reduce overhead because it may
affect both “o” and “g”
 All the “action” is currently in the OS and the Network
Interface Card (NIC)
 Subject of von Eicken et al., “Active Message: a
Mechanism for Integrated Communication and
Computation,” ISCA 1992.
CS 519
27
Operating System Theory
An Efficient Low-Level Message
Passing Interface
von Eicken et al., “Active Messages: a Mechanism for Integrated
Communication and Computation,” ISCA 1992
von Eicken et al., “U-Net: A User-Level Network Interface for
Parallel and Distributed Computing,” SOSP 1995
Santos, Bianchini, and Amorim, “A Survey of Messaging Software
Issues and Systems for Myrinet-Based Clusters”, PDCP 1999
von Eicken et al.: Active Messages
 Design challenge for large-scale multiprocessor:
 Minimize communication overhead
 Allow computation to overlap communication
 Coordinate the above two without sacrificing processor
cost/performance
 Problems with traditional message passing:
 Send/receive are usually synchronous; no overlap between
communication and computation
 If not synchronous, needs buffering (inside the kernel) on the
receive side
 Active Messages approach:
 Asynchronous communication model (send and continue)
 Message specifies handler that integrates msg into on-going
computation on the receiving side
CS 519
29
Operating System Theory
Buffering
 Remember buffering problem: what to do if receiver
not ready to receive?
Drop the message
This is typically very costly because of recovery costs
Leave the message in the NIC
Reduce network utilization
Can result in deadlocks
Wait until receiver is ready – synchronous or 3-phase
protocol
Copy to OS buffer and later copy to user buffer
CS 519
30
Operating System Theory
3-phase Protocol
CS 519
31
Operating System Theory
Copying
Process
Address Space
Incoming Message
Message
Buffers
OS Address Space
CS 519
32
Operating System Theory
Copying - Don’t Do It!
Hennessy
and
Patterson,
1996
CS 519
33
Operating System Theory
Overhead of Many Native MIs Too High
 Recall that overhead is critical to appl performance
 Asynchronous send and receive overheads on many
platforms (back in 1991): Ts = time to start a message;
Tb = time/byte; Tfb = time/flop (for comparison)
CS 519
34
Operating System Theory
Message Latency on Two Different LAN
Technologies
CS 519
35
Operating System Theory
von Eicken et al.: Active Receive
 Key idea is really to optimize receive - Buffer
management is more complex on receiver
Handler
Message Data
CS 519
36
Operating System Theory
Active Receive More Efficient
P1
P0
Active Message
P0
P1
Copying
P0
P1
OS
OS
CS 519
37
Operating System Theory
Active Message Performance
Send
Instructions
NCUBE2
CM-5
Receive
Time (s) Instructions Time (s)
21
11.0
1.6
34
15.0
1.7
Main difference between these AM implementations is that the
CM-5 allows direct, user-level access to the network interface.
More on this in a minute!
CS 519
38
Operating System Theory
Any Drawback To Active Message?
 Active message  SPMD
SPMD: Single Program Multiple Data
 This is because sender must know address of handler
on receiver
 Not absolutely necessary, however
Can use indirection, i.e. have a table mapping handler Ids to
addresses on receiver. Mapping has a performance cost,
though.
CS 519
39
Operating System Theory
User-Level Access to NIC
 Basic idea: allow
protected user
access to NIC
for implementing
comm. protocols
at user-level
CS 519
40
Operating System Theory
User-level Communication
 Basic idea: remove the kernel from the critical path
of sending and receiving messages
user-memory to user-memory: zero copy
permission is checked once when the mapping is established
buffer management left to the application
 Advantages
low communication latency
low processor overhead
approach raw latency and bandwidth provided by the
network
 One approach: U-Net
CS 519
41
Operating System Theory
U-Net Abstraction
CS 519
42
Operating System Theory
U-Net Endpoints
CS 519
43
Operating System Theory
U-Net Basics
 Protection provided by endpoints and communication channels
 Endpoints, communication segments, and message queues are only
accessible by the owning process (all allocated in user memory)
 Outgoing messages are tagged with the originating endpoint
address and incoming messages are demultiplexed and only
delivered to the correct endpoints
 For ideal performance, firmware at NIC should implement the
actual messaging and NI multiplexing (including tag checking).
Protection must be implemented by the OS by validating
requests for the creation of endpoints. Channel registration
should also be implemented by the OS.
 Message queues can be placed at different memories to
optimize polling
 Receive queue allocated in host memory
 Send and free queues allocated in NIC memory
CS 519
44
Operating System Theory
U-Net Performance on ATM
CS 519
45
Operating System Theory
U-Net UDP Performance
CS 519
46
Operating System Theory
U-Net TCP Performance
CS 519
47
Operating System Theory
U-Net Latency
CS 519
48
Operating System Theory
Virtual Memory-Mapped Communication
 Receiver exports the receive buffers
 Sender must import a receive buffer before sending
 The permission of sender to write into the receive buffer is
checked once, when the export/import handshake is performed
(usually at the beginning of the program)
 Sender can directly communicate with the network interface to
send data into imported buffers without kernel intervention
 At the receiver, the network interface stores the received data
directly into the exported receive buffer with no kernel
intervention
CS 519
49
Operating System Theory
Virtual-to-physical address
receiver
sender
int send_buffer[1024];
recv_id = import(receiver, exp_id);
int rec_buffer[1024];
exp_id = export(rec_buffer, sender);
send(recv_id, send_buffer);
recv(exp_id);
 In order to store data directly into the application address
space (exported buffers), the NI must know the virtual to
physical translations
 What to do?
CS 519
50
Operating System Theory
Software TLB in Network Interface
 The network interface must incorporate a TLB (NI-TLB) which
is kept consistent with the virtual memory system
 When a message arrives, NI attempts a virtual to physical
translation using the NI-TLB
 If a translation is found, NI transfers the data to the physical
address in the NI-TLB entry
 If a translation is missing in the NI-TLB, the processor is
interrupted to provide the translation. If the page is not
currently in memory, the processor will bring the page in. In
any case, the kernel increments the reference count for that
page to avoid swapping
 When a page entry is evicted from the NI-TLB, the kernel is
informed to decrement the reference count
 Swapping prevented while DMA in progress
CS 519
51
Operating System Theory
Introduction to Collective Communication
CS 519
52
Operating System Theory
Collective Communication
 More than two processes involved in communication
Barrier
Broadcast (one-to-all), multicast (one-to-many)
All-to-all
Reduction (all-to-one)
CS 519
53
Operating System Theory
Barrier
Compute
Compute
Compute
Compute
Barrier
Compute
CS 519
Compute
54
Compute
Compute
Operating System Theory
Broadcast and Multicast
Broadcast
Multicast
P1
P1
Message
P0
Message
P2
P0
P3
CS 519
P2
P3
55
Operating System Theory
All-to-All
CS 519
Message
Message
P0
P2
P1
P3
Message
Message
56
Operating System Theory
Reduction
sum  0
for i  1 to p do
sum  sum + A[i]
A[1]
A[0] + A[1]
A[0] + A[1] + A[2] + A[3]
A[2] + A[3]
P1
A[1]
P1
A[2] + A[3]
A[2]
P0
A[0]
A[1]
A[2]
A[3]
P0
P2
P2
A[3]
A[3]
P3
CS 519
P3
57
Operating System Theory

CS 519 -- Operating Systems -

Transcript CS 519 -- Operating Systems -

Directory