CS 519 -- Operating Systems -- Fall 2000

Download Report

Transcript CS 519 -- Operating Systems -- Fall 2000

Communication in Tightly Coupled
Systems
CS 519: Operating System Theory
Computer Science, Rutgers University
Instructor: Thu D. Nguyen
TA: Xiaoyan Li
Spring 2002
Why Parallel Computing? Performance!
Computer Science, Rutgers
2
CS 519: Operating System Theory
Processor Performance
10,000
CRAY
 CRAY
 Micro
Micro
n = 1,000
n = 100
n = 1,000
n = 100

1,000

T 94

LINPACK (MFLOPS)
C90



100




DEC 8200

Ymp

Xmp/416



Xmp/14se





IBM Power2/990
MIPS R4400
DEC Alpha

 HP9000/735
 DEC Alpha AXP
 HP 9000/750
 CRAY 1s
 IBM RS6000/540
10

MIPS M/2000


MIPS M/120

Sun 4/260
1
1975
Computer Science, Rutgers


1980
1985
1990
3
1995
2000
CS 519: Operating System Theory
But not just Performance
At some point, we’re willing to trade some performance
for:
Ease of programming
Portability
Cost
Ease of programming & Portability
Parallel programming for the masses
Leverage new or faster hardware asap
Cost
High-end parallel machines are expensive resources
Computer Science, Rutgers
4
CS 519: Operating System Theory
Amdahl’s Law
If a fraction s of a
computation is not
parallelizable, then
the best achievable
speedup is
Speedup for computations with fraction s of
sequential work
100
S
1
s
Speedup
80
0
0.01
0.025
0.05
0.1
0.2
60
40
20
0
1
10
20
30
40
50
60
70
80
90
100
Number of processors
Computer Science, Rutgers
5
CS 519: Operating System Theory
Pictorial Depiction of Amdahl’s Law
1
p
1
Time
Computer Science, Rutgers
6
CS 519: Operating System Theory
Parallel Applications
Scientific computing not the only class of parallel
applications
Examples of non-scientific parallel applications:
Data mining
Real-time rendering
Distributed servers
Computer Science, Rutgers
7
CS 519: Operating System Theory
Centralized Memory Multiprocessors
CP
U
cache
CP
U
cache
Memory
memory bus
I/O bus
disk
Computer Science, Rutgers
Net interface
8
CS 519: Operating System Theory
Distributed Shared-Memory (NUMA)
Multiprocessors
CP
U
cache
CP
U
cache
Memory
Memory
memory bus
memory bus
I/O bus
I/O bus
network
disk
Computer Science, Rutgers
Net interface
Net interface
9
CS 519: Operating System Theory
disk
Multicomputers
CP
U
cache
CP
U
cache
Memory
Memory
memory bus
memory bus
I/O bus
I/O bus
network
disk
Net interface
Net interface
Inter-processor communication in multicomputers
is effected through message passing
Computer Science, Rutgers
10
CS 519: Operating System Theory
disk
Basic Message Passing
Send
Receive
Send
Receive
P0
P1
P0
P1
N0
N1
N0
Communication Fabric
Computer Science, Rutgers
11
CS 519: Operating System Theory
Terminology
Basic Message Passing:
Send: Analogous to mailing a letter
Receive: Analogous to picking up a letter from the mailbox
Scatter-gather: Ability to “scatter” data items in a message
into multiple memory locations and “gather” data items from
multiple memory locations into one message
Network performance:
Latency: The time from when a Send is initiated until the first
byte is received by a Receive.
Bandwidth: The rate at which a sender is able to send data to a
receiver.
Computer Science, Rutgers
12
CS 519: Operating System Theory
Scatter-Gather
Scatter (Receive)
Gather (Send)
…
…
Message
Message
Memory
Computer Science, Rutgers
Memory
13
CS 519: Operating System Theory
Basic Message Passing: Easy, Right?
What can be easier than this, right?
Well, think of the post office: to send a letter
Computer Science, Rutgers
14
CS 519: Operating System Theory
Basic Message Passing: Not So Easy
Why is it so complicated to send a letter if basic
message passing is so easy?
Well, it’s really not easy! Issues include:
Naming: How to specify the receiver?
Routing: How to forward the message to the correct receiver
through intermediaries?
Buffering: What if the out port is not available? What if the
receiver is not ready to receive the message?
Reliability: What if the message is lost in transit? What if the
message is corrupted in transit?
Blocking: What if the receiver is ready to receive before the
sender is ready to send?
Computer Science, Rutgers
15
CS 519: Operating System Theory
Traditional Message Passing
Implementation
Kernel-based message passing: unnecessary data
copying and traps into the kernel
S
R
M
M
M S R
M S R
M S R
Computer Science, Rutgers
16
CS 519: Operating System Theory
Reliability
Reliability problems:
Message loss
Most common approach: If don’t get a reply/ack msg within some
time interval, resend
Message corruption
Most common approach: Send additional information (e.g., error
correction code) so receiver can reconstruct data or simply detect
corruption, if part of msg is lost or damaged. If reconstruction is not
possible, throw away corrupted msg and pretend it was lost
Lack of buffer space
Most common approach: Control the flow and size of messages to
avoid running out of buffer space
Computer Science, Rutgers
17
CS 519: Operating System Theory
Reliability
Reliability is indeed a very hard problem in large-scale
networks such as the Internet
Network is unreliable
Message loss can greatly impact performance
Mechanisms to address reliability can be costly even when
there’s no message loss
Reliability is not as hard for parallel machines
Underlying network hardware is much more reliable
Less prone to buffer overflow, cause often have hardware
flow-control
Address reliability later, for loosely coupled systems
Computer Science, Rutgers
18
CS 519: Operating System Theory
Computation vs. Communication Cost
200 MHz clock  5 ns instruction cycle
Memory access:
L1: ~2-4 cycles  10-20 ns
L2: ~5-10 cycles  25-50 ns
Memory: ~50-200 cycles  250-1000 ns
Message roundtrip latency:
~20 s
Suppose 75% hit ratio in L1, no L2, 10 ns L1 access time, 500 ns
memory access time  average memory access time 132.5 ns
1 message roundtrip latency = 151 memory accesses
Computer Science, Rutgers
19
CS 519: Operating System Theory
Performance … Always Performance!
So … obviously, when we talk about message passing, we
want to know how to optimize for performance
But … which aspects of message passing should we
optimize?
We could try to optimize everything
Optimizing the wrong thing wastes precious resources, e.g.,
optimizing leaving mail for the mail-person does not increase overall
“speed” of mail delivery significantly
Subject of Martin et al., “Effects of Communication
Latency, Overhead, and Bandwidth in a Cluster
Architecture,” ISCA 1997.
Computer Science, Rutgers
20
CS 519: Operating System Theory
Martin et al.: LogP Model
Computer Science, Rutgers
21
CS 519: Operating System Theory
Sensitivity to LogGP Parameters
LogGP parameters:
L = delay incurred in passing a short msg from source to dest
o = processor overhead involved in sending or receiving a msg
g = min time between msg transmissions or receptions (msg bandwidth)
G = bulk gap = time per byte transferred for long transfers (byte
bandwidth)
Workstations connected with Myrinet network and Generic Active
Messages layer
Delay insertion technique
Applications written in Split-C but perform their own data caching
Computer Science, Rutgers
22
CS 519: Operating System Theory
Sensitivity to Overhead
L  5.0s
g  5.8s
P  16
Computer Science, Rutgers
23
CS 519: Operating System Theory
Sensitivity to Gap
L  5.0s
o  2.9s
P  16
Computer Science, Rutgers
24
CS 519: Operating System Theory
Sensitivity to Latency
o  2.9s
g  5.8s
P  16
Computer Science, Rutgers
25
CS 519: Operating System Theory
Sensitivity to Bulk Gap
L  5.0s
o  2.9s
g  5.8s
P  16
Computer Science, Rutgers
26
CS 519: Operating System Theory
Summary
Runtime strongly dependent on overhead and gap
Strong dependence on gap because of burstiness of
communication
Not so sensitive to latency  can effectively overlap
computation and communication with non-blocking reads
(writes usually do not stall the processor)
Not sensitive to bulk gap  got more bandwidth than
we know what to do with
Computer Science, Rutgers
27
CS 519: Operating System Theory
What’s the Point?
What can we take away from Martin et al.’s study?
It’s extremely important to reduce overhead because it may
affect both “o” and “g”
All the “action” is currently in the OS and the Network
Interface Card (NIC)
Subject of von Eicken et al., “Active Message: a
Mechanism for Integrated Communication and
Computation,” ISCA 1992.
Computer Science, Rutgers
28
CS 519: Operating System Theory
An Efficient Low-Level Message
Passing Interface
von Eicken et al., “Active Messages: a Mechanism for Integrated
Communication and Computation,” ISCA 1992
von Eicken et al., “U-Net: A User-Level Network Interface for Parallel
and Distributed Computing,” SOSP 1995
Santos, Bianchini, and Amorim, “A Survey of Messaging Software
Issues and Systems for Myrinet-Based Clusters”, PDCP 1999
von Eicken et al.: Active Messages
Design challenge for large-scale multiprocessor:
Minimize communication overhead
Allow computation to overlap communication
Coordinate the above two without sacrificing processor
cost/performance
Problems with traditional message passing:
Send/receive are usually synchronous; no overlap between
communication and computation
If not synchronous, needs buffering (inside the kernel) on the receive
side
Active Messages approach:
Asynchronous communication model (send and continue)
Message specifies handler that integrates msg into on-going
computation on the receiving side
Computer Science, Rutgers
30
CS 519: Operating System Theory
Buffering
Remember buffering problem: what to do if receiver
not ready to receive?
Drop the message
This is typically very costly because of recovery costs
Leave the message in the NIC
Reduce network utilization
Can result in deadlocks
Wait until receiver is ready – synchronous or 3-phase protocol
Copy to OS buffer and later copy to user buffer
Computer Science, Rutgers
31
CS 519: Operating System Theory
3-phase Protocol
Computer Science, Rutgers
32
CS 519: Operating System Theory
Copying
Process
Address Space
Incoming Message
Message
Buffers
OS Address Space
Computer Science, Rutgers
33
CS 519: Operating System Theory
Copying - Don’t Do It!
Hennessy
and
Patterson,
1996
Computer Science, Rutgers
34
CS 519: Operating System Theory
Overhead of Many Native MIs Too High
Recall that overhead is critical to appl performance
Asynchronous send and receive overheads on many
platforms (back in 1991): Ts = time to start a message;
Tb = time/byte; Tfb = time/flop (for comparison)
Computer Science, Rutgers
35
CS 519: Operating System Theory
Message Latency on Two Different LAN
Technologies
Computer Science, Rutgers
36
CS 519: Operating System Theory
von Eicken et al.: Active Receive
Key idea is really to optimize receive - Buffer
management is more complex on receiver
Handler
Message Data
Computer Science, Rutgers
37
CS 519: Operating System Theory
Active Receive More Efficient
P1
P0
Active Message
P0
P1
Copying
P0
P1
OS
OS
Computer Science, Rutgers
38
CS 519: Operating System Theory
Active Message Performance
Send
Instructions
NCUBE2
CM-5
Receive
Time (s) Instructions Time (s)
21
11.0
1.6
34
15.0
1.7
Main difference between these AM implementations is that the
CM-5 allows direct, user-level access to the network interface.
More on this in a minute!
Computer Science, Rutgers
39
CS 519: Operating System Theory
Any Drawback To Active Message?
Active message  SPMD
SPMD: Single Program Multiple Data
This is because sender must know address of handler on
receiver
Not absolutely necessary, however
Can use indirection, i.e. have a table mapping handler Ids to
addresses on receiver. Mapping has a performance cost,
though.
Computer Science, Rutgers
40
CS 519: Operating System Theory
User-Level Access to NIC
Basic idea: allow
protected user
access to NIC for
implementing comm.
protocols at userlevel
Computer Science, Rutgers
41
CS 519: Operating System Theory
User-level Communication
Basic idea: remove the kernel from the critical path of
sending and receiving messages
user-memory to user-memory: zero copy
permission is checked once when the mapping is established
buffer management left to the application
Advantages
low communication latency
low processor overhead
approach raw latency and bandwidth provided by the network
One approach: U-Net
Computer Science, Rutgers
42
CS 519: Operating System Theory
U-Net Abstraction
Computer Science, Rutgers
43
CS 519: Operating System Theory
U-Net Endpoints
Computer Science, Rutgers
44
CS 519: Operating System Theory
U-Net Basics
Protection provided by endpoints and communication channels
Endpoints, communication segments, and message queues are only
accessible by the owning process (all allocated in user memory)
Outgoing messages are tagged with the originating endpoint address
and incoming messages are demultiplexed and only delivered to the
correct endpoints
For ideal performance, firmware at NIC should implement the
actual messaging and NI multiplexing (including tag checking).
Protection must be implemented by the OS by validating requests
for the creation of endpoints. Channel registration should also be
implemented by the OS.
Message queues can be placed at different memories to optimize
polling
Receive queue allocated in host memory
Send and free queues allocated in NIC memory
Computer Science, Rutgers
45
CS 519: Operating System Theory
U-Net Performance on ATM
Computer Science, Rutgers
46
CS 519: Operating System Theory
U-Net UDP Performance
Computer Science, Rutgers
47
CS 519: Operating System Theory
U-Net TCP Performance
Computer Science, Rutgers
48
CS 519: Operating System Theory
U-Net Latency
Computer Science, Rutgers
49
CS 519: Operating System Theory
Virtual Memory-Mapped Communication
Receiver exports the receive buffers
Sender must import a receive buffer before sending
The permission of sender to write into the receive buffer is
checked once, when the export/import handshake is performed
(usually at the beginning of the program)
Sender can directly communicate with the network interface to
send data into imported buffers without kernel intervention
At the receiver, the network interface stores the received data
directly into the exported receive buffer with no kernel
intervention
Computer Science, Rutgers
50
CS 519: Operating System Theory
Virtual-to-physical address
receiver
sender
int send_buffer[1024];
recv_id = import(receiver, exp_id);
int rec_buffer[1024];
exp_id = export(rec_buffer, sender);
send(recv_id, send_buffer);
recv(exp_id);
In order to store data directly into the application address space
(exported buffers), the NI must know the virtual to physical
translations
What to do?
Computer Science, Rutgers
51
CS 519: Operating System Theory
Software TLB in Network Interface
The network interface must incorporate a TLB (NI-TLB) which is
kept consistent with the virtual memory system
When a message arrives, NI attempts a virtual to physical
translation using the NI-TLB
If a translation is found, NI transfers the data to the physical
address in the NI-TLB entry
If a translation is missing in the NI-TLB, the processor is
interrupted to provide the translation. If the page is not
currently in memory, the processor will bring the page in. In any
case, the kernel increments the reference count for that page to
avoid swapping
When a page entry is evicted from the NI-TLB, the kernel is
informed to decrement the reference count
Swapping prevented while DMA in progress
Computer Science, Rutgers
52
CS 519: Operating System Theory
Introduction to Collective Communication
Computer Science, Rutgers
53
CS 519: Operating System Theory
Collective Communication
More than two processes involved in communication
Barrier
Broadcast (one-to-all), multicast (one-to-many)
All-to-all
Reduction (all-to-one)
Computer Science, Rutgers
54
CS 519: Operating System Theory
Barrier
Compute
Compute
Compute
Compute
Barrier
Compute
Computer Science, Rutgers
Compute
55
Compute
Compute
CS 519: Operating System Theory
Broadcast and Multicast
Broadcast
Multicast
P1
P1
Message
P0
Message
P2
P0
P3
Computer Science, Rutgers
P2
P3
56
CS 519: Operating System Theory
All-to-All
Computer Science, Rutgers
Message
Message
P0
P2
P1
P3
Message
Message
57
CS 519: Operating System Theory
Reduction
sum  0
for i  1 to p do
sum  sum + A[i]
A[1]
A[0] + A[1]
A[0] + A[1] + A[2] + A[3]
A[2] + A[3]
P1
A[1]
P1
A[2] + A[3]
A[2]
P0
A[0]
A[1]
A[2]
A[3]
P0
P2
P2
A[3]
A[3]
P3
Computer Science, Rutgers
P3
58
CS 519: Operating System Theory