Make Protocol Ready for Gigabit (Protocol Dependent)

Download Report

Transcript Make Protocol Ready for Gigabit (Protocol Dependent)

Make Protocol Ready for Gigabit
Scopes
• In this presentation, we will present various
protocol design and implementation techniques
that can allow a protocol to function correctly on
Gbps or deliver Gbps performance to the user
application or the system output.
• (In the previous presentation, what we presented
were operating system design and implementation
techniques for supporting Gbps network)
Protect Against Wrapped
Sequence Number
Sequence Number Wrapping Around
• TCP uses sequence numbers to help it detect packet loss and
duplicate packets.
• TCP’s sequence number is in bytes, rather than in packets.
The length of the sequence number field is 32 bits.
• On a gigabit network, it would take only 32 seconds to
wrap around a sequence number! The TTL field in the IP
header just limits the maximum number of hops a packet
can traverse. It does not limit the maximum amount of time
a packet can stay in the network.
• Wrapping around a sequence number can result in wrong
comparisons of the freshness of two sequence numbers.
This can have a very bad effect.
Problem 1: Sequence Number
Drops to Zero
• Suppose that the length of the sequence number
field is n bits, then when the sequence number
grows from 0, 1, … to 2^n, the sequence number
will wrap and drop to zero.
• As a result, when we compare two sequence
numbers a and b where b > a, the comparison
result may be wrong.
• The effect of the wrong comparison is that a more
recent packet carrying b will be rejected and
discarded because it is considered older than the
the packet (carrying a) that is already received.
Sequence Number Wheel to
Avoid Problem 1
• To avoid the comparison problem, we can use a
sequence number wheel scheme.
• The sequence number space (N = 2 ^n ) is divided
into two parts each of which is (N/2) large.
• The division line is not fixed. It is floating with
the sequence number (e.g., a) to be compared with.
• One part represents all the sequence numbers that
are considered as larger than a.
– a < b if |a – b| < N/2 and a < b, or |a – b| > N/2 and a > b
• The other part represents all the sequence numbers
that are considered as smaller than b.
– Otherwise.
Sequence Number Wheel
Problem 2: Sequence Number
Wraps and Grows Up
• In Gbps network, in just 32 seconds, a sequence
number can wrap and grow to the same number (e.g.,
a -> 0 -> a -1).
• This means that an outdated packet (carrying a) that
stays in the network for a long time (e.g., 32 seconds)
may look like that it is exactly the next packet that
the receiver expects to receive. (because the last
packet received carries a – 1).
• This problem may result in a corrupted received file.
• This problem cannot be solved by the sequence
number wheel scheme.
PAWS Used to Detect Problem 2
• PAWS (Protect against wrapped sequence numbers:
RFC 1323) is a scheme used in tcp_input() to detect
problem 2.
• PAWS is based on the premise that the 32-bit
timestamp values wrap around at a much lower
frequency than the 32-bit sequence number, on a
high-speed network.
– The TCP timestamp option is assumed to be used in the
TCP header.
– Right now in FreeeBSD 4.x, one tick used in timestamp
represents 1 ms. Therefore, it needs about 24 days (1193
hours) to change the sign bit.
PAWS Used to Detect Problem 2
• Therefore, when tcp_input() receives a packet, it will
first check whether the new packet’s timestamp is
older (smaller) than the timestamp of the lastly
received packet.
• Of course, if the TCP connection has been idle for
more than 24 days, its timestamp may have wrapped
around, which can make the comparison wrong.
– In this case, the packet will not be dropped.
• Otherwise, the packet will be dropped.
TCP Checksum Offloading
Turn off Checksum Computation
• Computing checksum is very expensive.
– Every byte of a packet need to be read from memory to
the CPU, be added, and then written fro memory back
to memory.
– CPU cycles are wasted.
– The bandwidth of the CPU-memory bus or the memory
system (depending on which one is smaller) is wasted.
– Therefore, on Gbps networks, how to avoid or reduce
the checksum cost becomes an important topic.
• Solution 1: Do not calculate checksum at all
– E.g., right now on FreeBSD 4.x, you can turn off UDP
packets’ checksum computation.
Checksum Offloading
• Solution 2: Let the network interface card do the
checksum computation. (IP header checksum and TCP
data payload checksum)
• Nowadays almost every Gigabit Ethernet NIC supports
computing checksum on the NIC. For example, the
3COM if_ti.c driver.
• To take advantage of this hardware function, the NIC
device driver needs to communicate with TCP code so
that tcp_output() and tcp_input() know whether they
should compute checksums.
• Checksum offload risks that the errors occurring
between the TCP layer and the device driver cannot be
detected by the receiver!
Free Checksum Computation on
some RISC Processors
• Sometimes, on a computer system with a RISC
processor, the checksum computation can be
performed without any cost.
• Some researchers observed that most RISC
processors can perform two instructions per clock
cycle, of which only one operation can be loading
data from memory or storing data to memory.
• Thus, there is space for two instructions in the
following copy loop instructions
– Load
– Store
[%r0] %r2
%r2 [%r1]
Free Checksum Computation on
some RISC Processors
• As a result, if we add two instructions that calculate
checksum in between these two load and store
instructions, the checksum can be calculated for free.
(No other work can be done in between the load and
store instructions anyway)
–
–
–
–
Load
Add
Addc
Store
[%r0]
%r5
%r5
%r2
%r2
%r2 %r5 ! Add to running sum in r5
#0
%r5 ! Add carry into r5
[%r1]
• This example also shows that programmed I/O is not
always worse than DMA.
TCP Header Prediction for
Gigabit Networks
TCP Implementation is Complicated
• In FreeBSD 4.2, tcp_input.c has 2797 lines of C code
and tcp_output.c has 939 lines. There are 339 lines of
“if” statements in tcp_input.c and 126 lines in
tcp_output.c.
• These numbers show:
– TCP processing is complicated.
– TCP input processing is more complicated than TCP output
processing.
• Previously, we presented that locality is important for
good cache performance
• And, because conditional branches can hurt the
performance of a pipelined CPU a lot, their bad effect
should be minimized.
TCP Header Prediction
• Header prediction looks for packets that fit the profile
of the packets that the receiver expects to receive next.
• If a packet meets the header prediction condition, it
will be handled in just a few instructions. Otherwise,
it will be handled by a general-processing code.
• Actually, this is an old design principle – optimize for
the common case!
• TCP header prediction scheme can improve TCP
transfer throughput because it improves instruction
locality, which can improve cache performance.
Two Common Cases
• If TCP is sending data, the next expected segment for
this connection is an ACK for outstanding data.
• If TCP is receiving data, the next expected segment for
this connection is the next in-sequence data segment.
• On LAN, where packet losses are rare, the header
prediction works between 97 and 100%. On WAN,
where packet losses are more possible, the percentage
drops to between 83% and 99%.
• The code for processing these two common cases is
placed at the beginning of tcp_input(). This results in a
better cache performance.
– There is no information about how well this scheme can
improve TCP transfer throughput though.
Measuring RTTs on Gigabit
Networks
Measuring RTTs Is Important
• To more efficiently retransmit lost packets, the
RTT of a connection should be correctly and
precisely measured.
– When a packet is lost, we do not want to wait
unnecessarily long before resending it.
• To increase the accuracy of measurements, the
first step is to use a high-resolution clock.
– Before FreeBSD 3.0, the clock resolution for TCP RTT
measurement is 500 ms.
– Now it becomes 1 ms.
• The second step is to use more RTT measurement
samples to calculate the average RTT.
Timer Management
• Timers are extensively used to measure RTTs.
– When a packet is sent, a timer is started. When the
corresponding ACK returns, the timer is stopped. The elapsed
time of the timer represents one RTT sample.
• The above approach can be used on a low speed network such as
10 Mbps. On a gigabit network, where a 1500-byte packet is sent
every 12 microseconds, this approach is infeasible.
• To reduce the high frequency of timer setup/cancel operations, the
original solution is to get a RTT sample every RTT, rather than
getting a RTT sample for every sent packet.
– Simple to implement. Just do not set up another timer until the
previous timer is cancelled.
– However, the accuracy suffers.
TCP Timestamp Option
• To get a RTT sample for every sent packet while
avoiding the need to setup/cancel timers at a high
frequency, TCP uses a timestamp option (RFC 1323).
• The sender places a timestamp in every sent segment.
The receiver sends the timestamp back in the ACK.
This allows the sender to calculate the difference and
use it as the RTT sample.
– This option must be supported by both the TCP sender
and receiver.
– However, the original design does not need support from
the receiver.
TCP Window Scale Option for
Gigabit Networks
TCP Maximum Throughput
• A TCP connection’s maximum achievable
throughput is limited by the minimum of the TCP
sender’s socket send buffer and the TCP receiver’s
socket receive buffer.
– Min(socket send buffer on the sender, socket receive
buffer on the receiver) / RTT.
• Although we can use setsockopt() to enlarge the
socket send and receive buffer to a big value, the
advertised window field in the TCP header is only
16 bits, which means a maximum window size of
64 KB only.
– On gigabit networks, clearly this is not enough.
TCP Window Scale Option
• In this scheme, the definition of the TCP window is
enlarged from 16 to 32 bits.
• The window field in the header still uses 16 bits, but
a option is defined that applies a scaling operation to
the 16-bit values.
– During the 3-way handshaking phase, this option is carried
in the SYN and SYN+ACK packets to indicate whether
the option is supported.
• In TCP implementation, the real window size is
internally maintained as a 32-bit value.
• The shift field in the option is 1-byte long.
– 0 means no scaling is performed.
– 14 is the maximum, allowing 64KB * 2^14.
TCP Window Scale Option
• This option can only appear in a SYN packet. Therefore, the scale
factor is fixed in each direction when the connection is established.
• The shift count is automatically calculated by TCP, based on the
size of the socket receive buffer.
– Keep dividing the buffer size by 2 until the resulting number is
less than 64 KB.
• Each host thus maintains two shift counts – S for sending and R for
receiving.
– Every 16-bit advertised window that is received from the other
end is left-shifted by R bits to obtain the real advertised window.
– When we need to send a window advertisement to the other end,
the real window size is right-shifted by S bits, and the resulting
16-bit value is placed in the TCP header.
Congestion Control on Gigabit
Networks
Congestion Control
• Congestion control is more difficult on Gigabit
networks.
– Although the absolute control delay (a connection’s RTT)
may remain about the same regardless of the link bandwidth
(either 10, 100, or 1000 Mbps), the cost of the control delay
becomes higher on higher-bandwidth network.
– Why? During the same RTT, a larger amount of data has been
injected into the network, before the control packet arrives at
the traffic source to reduce its sending rate.
– For example, in Gigabit Ethernet 802.3x PAUSE flow control
scheme, one RTT is required for the pause packet to take
effect. Therefore, as the bandwidth increases, more data will
be sent before the congestion control starts to take effect.
– The result is that congestion control becomes less and less
effective on high-bandwidth networks. (No solution!)