THE THIRD TRANSPORT

Download Report

Transcript THE THIRD TRANSPORT

Andrey Shomer & Daniel Yudelevich

SCTP = Stream Control Transport Protocol

SCTP is a new IETF standard transport
protocol (RFC2960)

An Alternative to TCP and UDP

UDP: bare minimum
 just port numbers, and an optional checksum
 no flow control, no congestion control, no
reliability or ordering

TCP: a package deal
 flow control, congestion control, byte-stream
orientation
 total ordering and total reliability

Almost everything you can do with TCP and
UDP

Plus the following features NOT available in UDP or
TCP
– Multi-homing
– Multi-streaming
– Message boundaries
– Improved SYN-flood protection
– Tunable parameters (Timeout, Retrans, etc.)
– A range of reliability and order (full to partial to none)
along with congestion control

Multi-homing improved robustness to failures
In TCP, connections made between <IP addr,port> and <IP addr, port>
If a host is multi-homed, you have to choose ONE IP Addr only, at
each end
If that interface goes down, so does the connection
With SCTP, you can list as many IP addresses per endpoint as you like
If host is still reachable through ANY of those addresses,
connection stays up!

Multi-streaming reduced delay
A.k.a. partial ordering. Eliminates Head of Line (HOL) blocking
In TCP, all data must be sent in order;
loss at head of line delays delivery of subsequent data
In SCTP, you can send over up to 64K independent streams,
each ordered independently
A loss on one stream does not delay the delivery on other streams
i.e. multi-streaming eliminates HOL blocking

Message boundaries preserved easier coding
TCP repacketizes data any old way it sees fit
(message boundaries not preserved)
SCTP preserves message boundaries
Application protocols easier to write, and application code simpler.

Improved SYN-flood protection more secure
TCP vulnerable to SYN flood;
(techniques to combat are "bags on the side")
Protection against SYN floods is built-in with SCTP
(Four way handshake (4WHS) vs 3WHS in TCP)
Listening sockets don't set up state until a connection is validated

Tunable parameters (Timeout, Retrans, etc.) more flexibility
Tuning TCP parameters requires system admin privs, kernel changes,
kernel hacking
SCTP parameters can be tuned on a socket by socket basis

Congestion controlled unreliable/unordered data more flexibility
TCP has congestion control, but can't do unreliable/unordered
delivery
UDP can do unreliable/unordered delivery, but not congestion
controlled
SCTP is always congestion controlled, and offers a range of services
from full reliability to none, and full ordering to none.
With SCTP, reliable and unreliable data can be multiplexed over same
connection.

byte-stream oriented communication

avoid congestion control
 UDP lets you blast away, but SCTP won't let you

true on-the-wire connectionless
communication
 connection setup required
Datalink Header
(e.g. Ethernet, 802.11, PPP)
IP Header
SCTP Common Header
Chunk 1
...
Chunk N
Datalink Trailer
(e.g. Ethernet, 802.11, PPP)
one or
more "chunks"
Source Port
Destination Port
Verification Tag
CRC-32 Checksum


Source and Destination Port: 16-bit port values
Verification Tag: 32-bit random value selected by
each endpoint in an association during setup
 Discriminates between two successive associations
 Protection mechanism against blind attackers
Chunk Type
Chunk Flags
Chunk Length
Chunk Data
Chunk Type: 8-bit value indicating the type of chunk
Chunk Flags: 8-bit flags, defined on per chunk type
basis
 Chunk Length: 16-bit length in bytes, including the
chunk type, chunk flags, and chunk length fields



There are more than 20 chunk types currently
defined in SCTP
 (1) DATA (0x00)
 (2) INITIATION [INIT] (0x01)
 (3) INITIATION-ACKNOWLEDGMENT [INIT-ACK]
(0x02)
 (4) SELECTIVE-ACKNOWLEDGMENT [SACK]
(0x03)
 (5) HEARTBEAT (0x04)
 ... etc...


In TCP, the communication relationship
between two endpoints is called a connection
In SCTP, we would call this an association
 Socket pair: { <Local IP addr, port>, <Remote IP
addr, port> }

An SCTP association can be represented as a
pair of SCTP endpoints
 assoc = { [10.1.61.11 : 2223],
[161.10.8.221, 120.1.1.5 : 80] }
Endpoint A
Endpoint Z
INIT
INIT-ACK
Association
Is Up
COOKIE-ECHO
COOKIE-ACK
Association
Is Up
Type=0x00
Flags=UBE
Length=variable
TSN Value
Stream Identifier
Stream Sequence Num
Payload Protocol Identifier
Variable Length User Data






Flag Bits: U – Unordered Data B – Begin E-End (for fragmentation)
TSN: transmission sequence num for ordering, reassembly,
retransmission
Stream Identifier: the stream number for this DATA
Stream Sequence Number: orders this DATA chunk within the stream
Payload Protocol Identifier: opaque value used by the endpoints
User Data: the user message (or portion of)

When data is transferred in TCP, the user gets a stream of bytes
(not to be confused with SCTP streams)

Users must “frame” their own messages if they are not transfering
a stream of bytes (ftp might be considered an application that
sends a stream of bytes)

An SCTP user will send and receive messages. All message
boundaries are preserved

A user will always read either ALL of a message or in some cases
part of a message

Streams are a powerful mechanism that
allows multiple ordered flows of messages
within a single association.

Messages are sent in their respective streams
and if a message in one stream is lost, it will
not hold up delivery of a message in the other
streams

The application specifies the stream number
to send a message on using its API interface
Endpoint-1
NI-1
Endpoint-2
NI-2
NI-1
NI-2
IP Network
IP Network
When a peer is multi-homed, a “primary destination address” is
selected by the SCTP endpoint
 By default, all data is sent to this primary address.
 When the primary address fails, the sender selects an alternate
primary address until it is restored or the user changes the primary
address.


SCTP has two methods of detecting fault:
 Heartbeats
 Data retransmission thresholds

Two types of faults can be discovered:
 An unreachable address
 An unreachable peer
Endpoint-1
Endpoint-2
NI-1
NI-2
NI-1
X
IP Network
IP Network
NI-2
Endpoint-1
Endpoint-2
NI-1
NI-2
NI-1
X
IP Network
X
IP Network
NI-2
Endpoint-1
Endpoint-2
NI-1
NI-2
NI-1
IP Network
IP Network
NI-2

A HEARTBEAT is sent to any destination address that
has been idle for longer than the heartbeat period

A destination address is idle if no chunks that can be
used for RTT updates have been sent to it
 e.g. usually DATA and HEARTBEAT

The heartbeat period timer is reset any time a DATA
or HEARTBEAT are sent

The peer responds with a HEARTBEAT-ACK

Just after association setup, Heartbeats will occur at a
faster rate to “confirm” addresses


Many implementation of SCTP protocol exist, worth mentions are
the Kernel implementations in FreeBSD (by Randall Stewart and
Peter Lei of Cisco) and the Linux one (kernels 2.6+, sponsored by
Motorola, Nokia, and IBM). Both of them support a standard
TCP\UDP socket interface as well as linkable library to include
more advanced SCTP features.
An alternative also exists, SCTPLIB is a non-commercial project
(released under GNU) working under FreeBSD, Linux, Mac OS X,
Solaris and even Windows. However being a user-level
implementation it requires the SCTP server to have special
privileges and all the SCTP applications to register with it and
communicate using the interprocess communication (IPC) which is
implemented in the source code – which may affect performance
in certain high load scenarios.



SCTP load sharing potentially provides
significant increases in transport protocol
performance and network efficiency.
The article compares between 2 existing load
sharing options as well as introduces a new
one
Note: Comparison is between transport level
load sharing only





Suggested by Ahemd Abd El Al,Tarek N. Saadawi, Myung J. Lee in
ComCom 27-11 (2004)
New SCTP chunk types and additional association setup
parameters for their load sharing extension to SCTP.
Additional, path related sequence numbers and time stamps in
new SCTP data chunks and acknowledgments
Simplifies the handling of SCTP packets when load-sharing is
used, however its not compatible with current SCTP
implementations
Overhead is bigger, while the same data could be extracted from
sender information and correct interpretation of selective ACKs
received by sender.






Suggested by Janardhan R. Iyengar, Armando L. Caro, Jr., Paul D. Amer, Gerard J. Heinz
(University of Dalware) and Randall R. Stewart (Cisco, same guy who implemented the
FreeBSD kernel)
When the ccwnd of a current link does not allow any more data, the link is switched and
data is being sent via the 2nd link (until all ccwnds are fully exploited).
A problem could occur when the receiver notices the gaps in data and lats the sender know
of that issue, this could cause immediate unnecessary multiple fast-retransmissions and
needlessly reduces the ccwnd size (which significantly affects the throughput).
CMT deals with this issue by increasing the gap counter only when the data chunk is
reported missing by an incoming SACK or when higher TSN are already acknowledged for
the path on which those chunks are sent.
Allows faster congestion window growth using a new field : CTSNA (stores the highest TSN
which was transmited on the path without discontinuity)
Delaying selective acks appropriately so that unnecessary SACKs are not transmitted, flags
are in use to ensure retransmissions are triggered in time




Published byAndreas Jungmaier and Erwin P. Rathgeb
in 2006 (Telecommun Syst 31)
Proposes Path Based SACKs (PB-SACK), meaning that
instead of keeping a SACK counter per association
(like in CMT) there is one for every link.
The SCTP Receiver sends one SACK per every two
PB-SACKs, however it could still send 2 SACKs for the
problem described in CMT – yet a single SACK per two
data chunks on average is still better.
Could be described as an improvement to CMT
protocol










For all paths d(i), set the flag d(i).saw_new_path_sack = FALSE.
For all paths d(i), set the flag d(i).new_pCTSNA = FALSE.
For any path d(i), for which a data chunk has been newly acknowledged, set the flag
d(i).saw_new_path_sack = TRUE.
For any d(i) for which d(i).saw_new_path_sack = TRUE, find the highest TSN newly
acknowledged. Store this value in d(i).highest_path_tsn_acked.
For any d(i) for which d(i).saw_new_path_sack = TRUE, store the number of bytes newly
acknowledged in d(i).newly acked bytes.
For any d(i) for which d(i).saw_new_path_sack = TRUE, find the corresponding d(i).pCTNSA
If d(i).pCTSNA was advanced by the SACK that is being processed, set the flag d(i).new_pCTSNA
= TRUE.
For any d(i) for which d(i).new_pCTSNA = TRUE, and for which the number of outstanding bytes is
higher than the congestion window d(i).cwnd, increase the congestion window as required in
sections 7.2.1 and 7.2.2 of RFC 2960, e.g., in slow start, if the number of outstanding bytes on d(i)
exceeds d(i).cwnd, the congestion window is increased by d(i).cwnd+ =
min(d(i).newly_acked_bytes, d(i).pMTU).
If the SACK chunk contains gap reports, check for any data chunk t remaining in the
retransmission queue that is reported missing and was sent to path dt , whether
dt.saw_new_path sack= TRUE and t < dt.highest_path_tsn_acked. If so, increase the gap counter
for t.
If this counter reaches the threshold (e.g., 4), perform a fast retransmission as per Section 7.2.4 of
RFC 2960.



Simulated in OPNET
Two 34Mbit/S links (as in E3)
Path 1 – 10ms delay, Path 2 – variable delay
between 10ms and 200ms

Note: ICMP is taken into account


Consider a parameter rd as delay2/delay1.
Throughput: For rd<9 we benefit from multi-homing
(notice that PB-SACK is still better than CTM
throughout the graph) in two identical links,
afterwards it is counter-productive to use multihoming (both algorithms reach a state where they
can't guarantee an affective load distribution
anymore). At this point both algorithm rely almost
exclusively on path 1 and the throughtput is limited
almost exclusively by the receiver window causing an
increased message delay.



CWND: As expected the total congestion window is higher for PBSACK than for CMT (since the throughput is higher at the same
delay), nevertheless this is not the case for rd>10.
At rd=15 the CWND is drastically lower,however the throughput is
higher by almost 500KB/s. This shows that the CMT way was not
explicitly correct and the aggregated CWND is not the biggest
factor on throughput. The more different the links become it
becomes clearer that per-path based ACKs are getting better
results than just optimizing the aggregate CWND size
To summarize, while in similar links CMT shows great results (and
LS-SCTP should not be considered since it includes incompatible
protocol extensions) PB-SACK is proven superior in
inhomogeneous links.