PowerPoint - UBC Department of Computer Science

Transcript PowerPoint - UBC Department of Computer Science

SCTP versus TCP
for MPI
Brad Penoff, Humaira Kamal, Alan Wagner
Department of Computer Science
University of British Columbia
Distributed
Research
Group
SC-2005
Nov 16
What is SCTP?

Stream Control Transmission Protocol
(SCTP)
General purpose unicast transport
protocol for IP network data
communications
 Recently standardized by IETF
 Can be used anywhere TCP is used

What is SCTP?

Stream Control Transmission Protocol
(SCTP)
General purpose unicast transport
protocol for IP network data
communications
 Recently standardized by IETF
 Can be used anywhere TCP is used

Question
Can we take advantage of SCTP features to better
support parallel applications using MPI?
Communicating MPI Processes
TCP is often used as transport protocol for MPI
MPI Process
MPI Process
MPI API
MPI API
SCTP
TCP
SCTP
TCP
IP
IP
Overview of SCTP
SCTP Key Features

Reliable in-order delivery, flow control,
full duplex transfer.

TCP-like congestion control

Selective ACK is built-in the protocol
SCTP Key Features

Message oriented

Use of associations

Multihoming

Multiple streams within an association
Associations and Multihoming
Endpoint X
Endpoint Y
Association
NIC1
NIC2
NIC3
NIC4
Network
207.10.x.x
IP=207 .10.3.20
IP=168 .1.10.30
Network
168.1.x.x
IP=207 .10.40.1
IP=168.1.140.10
Logical View of Multiple Streams in
an Association
Endpoint X
Endpoint Y
SEND
Outbound
Streams
Stream 1
RECEIVE
Stream 2
Stream 3
SEND
RECEIVE
Inbound
Streams
Stream 1
Stream 2
Stream 3
Partially Ordered User Messages Sent on
Different Streams
Send order
Msg A
Msg B
Endpoint X
RECEIVE
Msg C
Msg D
Msg E
Endpoint Y
SEND
Stream 1
Stream 2
Stream 3
Partially Ordered User Messages Sent on
Different Streams
Send order
Msg B
Msg C
Msg D
Endpoint X
Msg E
Endpoint Y
RECEIVE
SEND
Msg A
Stream 1
Stream 2
Stream 3
Partially Ordered User Messages Sent on
Different Streams
Send order
Msg C
Msg D
Endpoint X
Msg E
Endpoint Y
RECEIVE
SEND
Msg A
Stream 1
Msg B
Stream 2
Stream 3
Partially Ordered User Messages Sent on
Different Streams
Send order
Msg D
Endpoint X
Msg E
Endpoint Y
RECEIVE
SEND
Msg A
Msg C
Stream 1
Msg B
Stream 2
Stream 3
Partially Ordered User Messages Sent on
Different Streams
Send order
Msg E
Endpoint X
Endpoint Y
RECEIVE
Msg A
SEND
Msg C
Msg D
Stream 1
Msg B
Stream 2
Stream 3
Partially Ordered User Messages Sent on
Different Streams
Send order
Endpoint X
Endpoint Y
RECEIVE
Msg A
SEND
Msg C
Msg D
Stream 1
Msg B
Stream 2
Msg E
Stream 3
Partially Ordered User Messages Sent on
Different Streams
Receive order
Msg A
Endpoint X
Endpoint Y
RECEIVE
Msg C
SEND
Msg D
Stream 1
Msg B
Stream 2
Msg E
Stream 3
Partially Ordered User Messages Sent on
Different Streams
Receive order
Msg A
Msg B
Endpoint X
Endpoint Y
RECEIVE
Msg C
SEND
Msg D
Stream 1
Stream 2
Msg E
Stream 3
Partially Ordered User Messages Sent on
Different Streams
Receive order
Msg A
Msg B
Endpoint X
Msg C
Endpoint Y
RECEIVE
SEND
Msg D
Stream 1
Stream 2
Msg E
Stream 3
Partially Ordered User Messages Sent on
Different Streams
Receive order
Msg A
Msg B
Msg C
Endpoint X
Msg D
Endpoint Y
RECEIVE
SEND
Stream 1
Stream 2
Msg E
Stream 3
Partially Ordered User Messages Sent on
Different Streams
Receive order
Msg A
Msg B
Endpoint X
RECEIVE
Msg C
Msg D
Msg E
Endpoint Y
SEND
Stream 1
Stream 2
Stream 3
Partially Ordered User Messages Sent on
Different Streams
Receive order
Msg A
Msg B
Msg C
Msg D
Msg E
Can be received in the
Endpoint X same order as it was
Endpoint Y
RECEIVE sent (required in TCP).
SEND
Stream 1
Stream 2
Stream 3
Partially Ordered User Messages Sent on
Different Streams
Alternative receive order
Endpoint X
Endpoint Y
RECEIVE
Msg A
SEND
Msg C
Msg D
Stream 1
Msg B
Stream 2
Msg E
Stream 3
Partially Ordered User Messages Sent on
Different Streams
Alternative receive order
Msg E
Endpoint X
Endpoint Y
RECEIVE
Msg A
SEND
Msg C
Msg D
Stream 1
Msg B
Stream 2
Stream 3
Partially Ordered User Messages Sent on
Different Streams
Alternative receive order
Msg E
Msg A
Endpoint X
Endpoint Y
RECEIVE
SEND
Msg C
Msg D
Stream 1
Msg B
Stream 2
Stream 3
Partially Ordered User Messages Sent on
Different Streams
Alternative receive order
Msg E
Msg A
Msg B
Endpoint X
Endpoint Y
RECEIVE
SEND
Msg C
Msg D
Stream 1
Stream 2
Stream 3
Partially Ordered User Messages Sent on
Different Streams
Alternative receive order
Msg E
Msg A
Msg B
Endpoint X
Msg C
Endpoint Y
RECEIVE
SEND
Msg D
Stream 1
Stream 2
Stream 3
Partially Ordered User Messages Sent on
Different Streams
Alternative receive order
Msg E
Msg A
Endpoint X
RECEIVE
Msg B
Msg C
Msg D
Endpoint Y
SEND
Stream 1
Stream 2
Stream 3
Partially Ordered User Messages Sent on
Different Streams
Alternative receive order
Msg E
Msg A
Msg B
Msg C
Delivery constraints: A
Endpoint X must be before C and
RECEIVE C must be before D
Msg D
Endpoint Y
SEND
Stream 1
Stream 2
Stream 3
MPI Point-to-Point Overview
MPI Point-to-Point
MPI_Send(msg,count,type,dest-rank,tag,context)
MPI_Recv(msg,count,type,source-rank,tag,context)



Message matching is
done based on Tag,
Rank and Context
(TRC).
Combinations such as
blocking, non-blocking,
synchronous,
asynchronous, buffered,
unbuffered.
Use of wildcards for
receive
Envelope
Context
Rank
Tag
Payload
Format of MPI Message
MPI Messages Using Same Context, Two Processes
Process X
MPI_Send(Msg_1,Tag_A)
MPI_Send(Msg_2,Tag_B)
MPI_Send(Msg_3,Tag_A)
Process Y
MPI_Irecv(..ANY_TAG..)
Msg_1
Msg_2
Msg_3
Process X
MPI_Send(Msg_1,Tag_A)
Process Y
MPI_Irecv(..ANY_TAG..)
Msg_1
MPI_Send(Msg_2,Tag_B)
MPI_Send(Msg_3,Tag_A)
Msg_3
Msg_2
MPI Messages Using Same Context, Two Processes
Process X
Process Y
MPI_Irecv(..ANY_TAG..)
MPI_Send(Msg_1,Tag_A)
MPI_Send(Msg_2,Tag_B)
MPI_Send(Msg_3,Tag_A)
Msg_3
Msg_1
Msg_2
Out of order
messages with
same tags
violate MPI
semantics
Using SCTP for MPI

Striking similarities between SCTP and MPI
SCTP
MPI
One-to-Many
Socket
Context
Association
Rank /
Source
Streams
Message
Tags
SCTP-based MPI
MPI over SCTP :
Design and Implementation

LAM (Local Area Multi-computer) is an open
source implementation of MPI library.

We redesigned LAM TCP RPI module to
use SCTP.

RPI module is responsible maintaining state
information of all requests.
Implementation Issues

Maintaining State Information


Message Demultiplexing



Extend RPI initialization to map associations to rank.
Demultiplexing of each incoming message to direct it to
the proper receive function.
Concurrency and SCTP Streams


Maintain state appropriately for each request function to
work with the one-to-many style.
Consistently map MPI tag-rank-context to SCTP
streams, maintaining proper MPI semantics.
Resource Management

Make RPI more message-driven.

Eliminate the use of the select() system call, making the
implementation more scalable.
Eliminating the need to maintain a large number of
socket descriptors.

Implementation Issues

Eliminating Race Conditions



Reliability


Modify out-of-band daemons and request progression
interface (RPI) to use a common transport layer protocol
to allow for all components of LAM to multihome
successfully.
Support for large messages


Finding solutions for race conditions due to added
concurrency.
Use of barrier after association setup phase.
Devised a long-message protocol to handle messages
larger than socket send buffer.
Experiments with different SCTP stacks
Features of Design

Head-of-Line Blocking Avoidance

Scalability, 1 socket per process

Multihoming

Added Security
Head-of-Line Blocking
Process X
Process Y
MPI_Send
TCP
MPI_Send
Tag_B
Tag_A
Msg_B
Msg_A
MPI_Irecv
MPI_Irecv
Blocked
Process X
Process Y
MPI_Send
Tag_B
Tag_A
Msg_B
Msg_A
MPI_Irecv
SCTP
MPI_Send
Delivered
MPI_Irecv
P0
MPI_Irecv(P1, MPI_ANY_TAG)
MPI_Irecv(P1, MPI_ANY_TAG)
MPI_Waitany()
Compute()
MPI_Waitall()
---
P1
MPI_Send(Msg-A, P0, tag-A)
MPI_Send(Msg-B, P0, tag-B)
---
P0
P1
MPI_Irecv(P1, MPI_ANY_TAG)
MPI_Irecv(P1, MPI_ANY_TAG)
MPI_Waitany()
Compute()
MPI_Waitall()
---
TCP
Execution time on P0
MPI_Irecv
MPI_Irecv
Socket
buffer
Msg-B arrives
MPI_Send(Msg-A, P0, tag-A)
MPI_Send(Msg-B, P0, tag-B)
---
P0
P1
MPI_Irecv(P1, MPI_ANY_TAG)
MPI_Irecv(P1, MPI_ANY_TAG)
MPI_Waitany()
Compute()
MPI_Waitall()
---
TCP
Execution time on P0
MPI_Irecv
Socket
buffer
Msg-B arrives
MPI_Irecv
MPI_Waitany
Msg-A arrives
MPI_Send(Msg-A, P0, tag-A)
MPI_Send(Msg-B, P0, tag-B)
---
P0
P1
MPI_Irecv(P1, MPI_ANY_TAG)
MPI_Irecv(P1, MPI_ANY_TAG)
MPI_Waitany()
Compute()
MPI_Waitall()
---
TCP
Execution time on P0
MPI_Irecv
Socket
buffer
Msg-B arrives
MPI_Irecv
MPI_Waitany
Msg-A arrives
Compute
MPI_Waitall
MPI_Send(Msg-A, P0, tag-A)
MPI_Send(Msg-B, P0, tag-B)
---
P0
P1
MPI_Irecv(P1, MPI_ANY_TAG)
MPI_Irecv(P1, MPI_ANY_TAG)
MPI_Waitany()
Compute()
MPI_Waitall()
---
TCP
Execution time on P0
MPI_Irecv
Socket
buffer
Msg-B arrives
MPI_Waitany
Msg-A arrives
MPI_Waitall
---
SCTP
MPI_Irecv
MPI_Irecv
Compute
MPI_Send(Msg-A, P0, tag-A)
MPI_Send(Msg-B, P0, tag-B)
MPI_Irecv
P0
P1
MPI_Irecv(P1, MPI_ANY_TAG)
MPI_Irecv(P1, MPI_ANY_TAG)
MPI_Waitany()
Compute()
MPI_Waitall()
---
TCP
Execution time on P0
MPI_Irecv
Socket
buffer
MPI_Send(Msg-A, P0, tag-A)
MPI_Send(Msg-B, P0, tag-B)
---
SCTP
MPI_Irecv
Msg-B arrives
MPI_Irecv
MPI_Irecv
MPI_Waitany
MPI_Waitany
Msg-A arrives
Compute
MPI_Waitall
P0
P1
MPI_Irecv(P1, MPI_ANY_TAG)
MPI_Irecv(P1, MPI_ANY_TAG)
MPI_Waitany()
Compute()
MPI_Waitall()
---
TCP
Execution time on P0
MPI_Irecv
Socket
buffer
MPI_Send(Msg-A, P0, tag-A)
MPI_Send(Msg-B, P0, tag-B)
---
SCTP
MPI_Irecv
Msg-B arrives
MPI_Irecv
MPI_Irecv
MPI_Waitany
Compute
MPI_Waitany
Msg-A arrives
Compute
MPI_Waitall
MPI_Waitall
Performance
SCTP Performance


SCTP stack is in early stages and will
improve over time
Performance is stack dependant (Linux
lksctp stack << FreeBSD KAME stack)
- SCTP bundles messages together so it might not
always be able to pack a full MTU
- Comprehensive CRC32c checksum – offload to
NIC not yet commonly available
Experiments

MPBench Ping-pong comparison

NAS Parallel benchmarks

Task Farm Program
8 nodes, Dummynet, fair comparison:
Same socket buffer sizes, Nagle disabled,
SACK ON, No multihoming, CRC32c OFF
Experiments: Ping-pong
MPBench Ping Pong Test
1.2
1
0.8
LAM_SCTP
LAM_TCP
0.6
0.4
0.2
Message Size (bytes)
MPBench Ping Pong Test under No Loss
131069
98302
65535
32768
0
1
Throughput Normalized to LAM_TCP
values
1.4
Experiments: NAS
Experiments: Task Farm

Non-blocking communication

Overlap computation with
communication

Use of multiple tags
Task Farm - Short Messages
Total Run Time (seconds)
LAM_SCTP versus LAM_TCP for Farm Program
Message Size: Short, Fanout: 10
LAM_SCTP
LAM_TCP
200
154.7
150
88.1
100
50
8.7
6.2
11.7
16.0
0
0%
1%
Loss Rate
2%
Task Farm - Head-of-line blocking
Total Run Time (seconds)
LAM_SCTP 10-Streams versus LAM_SCTP 1-Stream
for Farm Program. Message Size: Short, Fanout: 10
10 Streams
1 Stream
25
21.6
20
16.0
15
10
8.7
9.3
11.7 11.0
5
0
0%
1%
Loss Rate
2%
Conclusions

SCTP is a better match for MPI





Avoids unnecessary head-of-line blocking
due to use of streams
Increased fault tolerance in presence of
multihomed hosts
Built-in security features
Improved congestion control
SCTP may enable more MPI programs to
execute in LAN and WAN environments.
Future Work

Release our LAM SCTP RPI module

Modify real applications to use tags as
streams

Continue to look for opportunities to
take advantage of standard IP
transport protocols for MPI
Thank you!
More information about our work is at:
http://www.cs.ubc.ca/labs/dsg/mpi-sctp/
Or Google “sctp mpi”
Extra Slides
Associations and Multihoming
Endpoint X
Endpoint Y
NIC1
NIC3
NIC2
NIC4
Network
207.10.x.x
IP=207 .10.3.20
IP=168 .1.10.30
Network
168.1.x.x
IP=207 .10.40.1
IP=168.1.140.10
MPI over SCTP :
Design and Implementation

Challenges:
 Lack of documentation
 Code examination
• Our document is linked-off LAM/MPI
website

Extensive instrumentation
• Diagnostic traces

Identification of problems in SCTP
protocol
MPI API Implementation

Request
Progression
Layer
Receive Request is Issued
Application Layer
Receive Request Queue
MPI Implementation
Runtime

Short
Messages vs.
Long
Messages
Unexpected Message Queue
SCTP Layer
Socket
Incoming Message is Received
Partially Ordered User Messages Sent on
Different Streams
Message Stream Number (SNo)
2
2
1
2
0
Fragmentation
User messages
SCTP Layer
Control chunk queue
Data chunk queue
SCTP Packets
Bundling
IP Layer
Added Security
P0
P1
INIT
INIT-ACK
COOKIE-ECHO
User data can be
piggy-backed on
third and fourth leg
COOKIE-ACK
SCTP’s Use of Signed Cookie
Added Security
32 bit Verification Tag – reset attack
 Autoclose feature
 No half-closed state

NAS Benchmarks
The NAS benchmarks approximate
real world parallel scientific
applications
 We experimented with a suite of 7
benchmarks, 4 data set sizes
 SCTP performance comparable to
TCP for large datasets.

Farm Program - Long Messages
Total Run Time (seconds)
LAM_SCTP versus LAM_TCP for Farm Program
Message Size: Long, Fanout: 10
LAM_SCTP
LAM_TCP
8000
6414
6000
3103
4000
2000
79
129
786
1585
0
0%
1%
Loss Rate
2%
Head-of-line blocking – Long messages
Total Run Time (seconds)
LAM_SCTP 10-Streams versus LAM_SCTP 1-Stream
for Farm Program. Message Size: Long, Fanout: 10
10 Streams
1 Stream
2500
1942
2000
1585
1500
786
1000
500
79
1000
79
0
0%
1%
Loss Rate
2%
Experiments: Benchmarks

SCTP outperformed TCP under loss for
ping pong test.
Experiments: Benchmarks

SCTP outperformed TCP under loss
for ping pong test.
Throughput of Ping-pong w/ 30K
messages
60000
50000
40000
Bytes/
30000
second
20000
10000
0
SCTP
TCP
1
2
Loss Rate
Experiments: Benchmarks

SCTP outperformed TCP under loss
for ping pong test.
Throughput of Ping-pong w/ 300K
messages
6000
5000
4000
Bytes/
3000
second
2000
1000
0
SCTP
TCP
1
2
Loss Rate

PowerPoint - UBC Department of Computer Science

Transcript PowerPoint - UBC Department of Computer Science

Directory