TCP/IP and Other Transports for High

Download Report

Transcript TCP/IP and Other Transports for High

TCP/IP and Other Transports for
High Bandwidth Applications
Back to Basics
Richard Hughes-Jones
The University of Manchester
www.hep.man.ac.uk/~rich/ then “Talks” then look for “Brasov”
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
1
Structure of the Talks
The aim is to give you a picture of how researchers are using high performance
networks to support their work.
 Back to Basics
 Simple Introduction to Networking
 TCP/IP on High Bandwidth Long Distance Networks
 But TCP/IP works !
 The effect of packet loss
 Advanced TCP Stacks
 Fairness
 Real Applications on Real Networks
 Disk-2-disk applications on real networks
 Memory-2-memory tests
 Transatlantic disk-2-disk at Gigabit speeds
 Remote Computing Farms
 The effect of distance
 Radio Astronomy e-VLBI
Thanks for allowing me to use their slides to:
Sylvain Ravot CERN, Les Cottrell SLAC, Brian Tierney LBL, Robin Tasker DL
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
2
Simple Introduction to Networking
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
3
What is a Protocol Stack ?
 ISO OSI (Open Systems Interconnection) Seven Layer Model defines a
framework allowing development of real network protocols
 A layer…




performs unique and specific tasks
only has knowledge of those layers immediately above and below
uses services of layer below, and provides services to layer above
the services defined by a layer are implementation independent –
it’s a definition of how things work
 conceptually communicates with its peer in the remote system
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
4
 Encapsulation:
The Layering Principle
 Each protocol layer N adds a Header to the data unit from layer N+1
 Header contains control information
App data
Layer 7: Application
user processes
Layer 6: Presentation
data interpretation, code transformation
SH
Layer 5: Session
Connection, negotiation control
Layer 4: Transport
End-2-end data transfer & integrity
Packet sequencing, flow control
Layer 3: Network
Addressing, Routing
Packet sequencing, flow control
Layer 2: Data Link
Packet assembly/disassembly
Transmission control, Error checking
Layer 1: Physical
Electrical, Optical, Mechanical
PH
App data
PH
App data
Segment
TH
SH
PH
App data
Packet
NH TH
SH
PH
App data
Frame
DH NH TH
SH
PH
App data FCS
Bits on the “wire”
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
5
What do the Layers do?
 Transport Layer: acts as a go-between for the user and network
 Provides end-to-end data movement & control
 Gives the level of reliability/integrity need by the application
 Can ensure a reliable service (which network layer cannot),
e.g. assigns sequence numbers to identify “lost” packets
 Network Layer: deals with logical addressing & the transmission of
packets, mechanism for routing.
 Data Link Layer: provides the synchronization and error checking for
the data transmitted over a single physical link
(may ensure correct delivery of frames)
 􀂄Going down: fits packets from the network layer above into
frames.
 􀂄Going up: Groups bits from the physical layer into frames.
 Physical Layer: concerned with the transmission of individual bits.
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
6
How do the “IP” Protocols fit together?
Application
File
Transfer
Protocol
(FTP) RFC 559
TELNET
RFC 854
Simple Mail
Transfer
Protocol
( Presentation
(SMTP) RFC 821
Session)
DNS
traceroute
NFS
RFC 1024, 1057
and 1094
User Datagram
Protocol (UDP)
RFC 768
Transmission Control
Protocol (TCP)
RFC 793
Internet Control
Message Protocol
(ICMP) RFC 792
Routing
OSPF, BGP
Address
Resolution
Protocols
ARP: RFC 826
RARP: RFC 903
Internet Protocol
IP
RFC 791
Network
Data Link
ping
SNMP
RFC 1157
DNS
POP3/IMAP
HTTP
Transport
TFTP RFC 783
ssh
Ethernet
Token Ring
Network Interface Cards
ISDN
FDDI
SMDS
ATM
SDH/SONET xDSL
Transmission Mode
Physical
TP Copper
Fibre Optic
Satellite
Microwave DWDM CWDM etc
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
7
Some of the “IP” Protocols
 Transmission Control Protocol. TCP provides application programs
access to the network using a reliable, connection-oriented
transport layer service.
 User Datagram Protocol. UDP provides unreliable, connectionless delivery service using the IP protocol to transport messages
between machines. It adds the ability to distinguish among multiple
destinations on a single host computer.
 Internet Protocol. IP receives datagrams from the upper-layer
software and transmits it to the destination host based upon a best
effort, connection-less delivery service.
 Internet Control Message Protocol. ICMP allows internet routers to
transmit error messages and test messages.
 Internet Group Message Protocol. IGMP is used with multicast to
send UDP datagrams to multiple hosts.
 Address Resolution Protocol. ARP translates between the 32 bit IP
address and a 48 bit LAN address.
 Reverse Address Resolution Protocol. RARP translates between
the 48 bit LAN address and the 32 bit IP address.
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
8
The Physical Layer 1: Ethernet
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
9
The Link Layer 2: Ethernet Frame
Frame header
IP Datagram
FCS
12 bytes
Inter Frame Gap
Preamble, which is comprised of 56 bits of alternating 0s and 1s. The
preamble provides all the nodes on the network a signal against
which to synchronize.
Start Frame delimiter, which marks the start of a frame. The start frame delimiter is 8
bits long with the pattern10101011
Media Access Control (MAC) Address
Every Ethernet network card has, built into its hardware, a unique six-octet (48-bit)
hexadecimal number that differentiates it from all other Ethernet cards in the
universe. The DA and SA define the path across the link
Length/Type field two octets long.
If the value =< 1500 (0x05dc hex) indicates the length of data
If the value > 1500 indicates network-layer protocol : “Ethernet Types”
Data, the reason the frame exists.
MTU Maximum Transport Unit
Frame Check Sequence to protect the frame
contents
10
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
The Link Layer: Ethernet VLANs
VLANS are logical networks built over the
same physical cable plant.
Distinguishes Ethernet frames between
their logical networks using VLAN
header
VLAN is defined by the use of value 0x8100 in the Type field location.
The next two octets are composed of the following three fields:
User Priority field
This field is 3 bits in length and is used to define the priority of the Ethernet frame.
This is utilized to define and deliver a class of service
Canonical format indicator
This is 1 bit in length. Just **don’t** ask!!!
VLAN Identifier field
This field is 12 bits in length and contains the VLAN identifier (VID)
of this frame.
The original Length/Type field will then follow the inserted VLAN tag.
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
11
The Network Layer 3: IP
 IP Layer properties:
 Provides best effort delivery
 It is unreliable
 Packet may be lost
 Duplicated
 Out of order
 Connection less
 Provides logical addresses
 Provides routing
 Demultiplex data on protocol number
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
12
The Internet datagram
Frame header IP header
0
4
Vers Hlen
8
16
Type of serv.
Transport
FCS
24
19
Total length
31
Identification
Flags Fragment offset
TTL
Protocol
Header Checksum
Source IP address
Destination IP address
IP Options (if any)
20 Bytes
Padding
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
13
IP Datagram Format (cont.)
Vers
Hlen
TOS.
Total length
Identification
Flags Fragment offset
 Type of Service – TOS:
TTL
Protocol
Header Checksum
now being used for QoS
Source IP address
Destination IP address
 Total length: length of datagram
IP Options (if any)
Padding
in bytes, includes header and data
 Time to live – TTL: specifies how long datagram is
allowed to remain in internet
 Routers decrement by 1
 When TTL = 0 router discards datagram
 Prevents infinite loops
 Protocol: specifies the format of the data area
 Protocol numbers administered by central authority to guarantee
agreement, e.g. ICMP=1, TCP=6, UDP=17 …
 Source & destination IP address: (32 bits each) contain
IP address of sender and intended recipient
 Options: (variable length) Mainly used to record a route,
or timestamps, or specify routing
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
14
Internet Class-based addresses
 An Address looks like 192.168.22.123
 Class A: large number of hosts, few networks
 0nnnnnnn hhhhhhhh hhhhhhhh hhhhhhhh
 7 network bits (0 and 127 reserved, so 126 networks),
24 host bits (> 16M hosts/net)
 Initial byte 1-127 (decimal)
 Class B: medium number of hosts and networks
 10nnnnnn nnnnnnnn hhhhhhhh hhhhhhhh
 16,384 class B networks, 65,534 hosts/network
 Initial byte 128-191 (decimal)
 Class C: large number of small networks
 110nnnnn nnnnnnnn nnnnnnnn hhhhhhhh
 2,097,152 networks, 254 hosts/network
 Initial byte 192-223 (decimal)
 Class D: Multicast (See RFC 1112)
 1110nnnn nnnnnnnn nnnnnnnn hhhhhhhh
 Initial byte 224-239 (decimal)
 Class E: Reserved
 Initial byte 248-255 (decimal)
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
15
The Transport Layer 4: UDP
 UDP Provides :
 Connection less service over IP
 No setup teardown
 One packet at a time
 Minimal overhead – high performance
 Provides best effort delivery
 It is unreliable:
 Packet may be lost
 Duplicated
 Out of order
 Application is responsible for
 Data reliability
 Flow control
 Error handling
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
16
UDP Datagram format
Frame header IP header
0
8
Source port
UDP header
16
Application data
24
Destination port
FCS
31
8 Bytes
UDP message len Checksum (opt.)
 Source/destination port: port numbers identify sending & receiving processes
 Port number & IP address allow any application on Internet to be uniquely identified
 Ports can be static or dynamic
 Static (< 1024) assigned centrally, known as well known ports
 Dynamic
 Message length: in bytes includes the UDP header and data (min 8 max 65,535)
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
17
The Transport Layer 4: TCP
 TCP RFC 768 RFC 1122 Provides :
 Connection orientated service over IP
 During setup the two ends agree on details
 Explicit teardown
 Multiple connections allowed
 Reliable end-to-end Byte Stream delivery over unreliable network
 It takes care of:
 Lost packets
 Duplicated packets
 Out of order packets
 TCP provides
 Data buffering
 Flow control
 Error detection & handling
 Limits network congestion
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
18
The TCP Segment Format
Frame header IP header
0
4
8
10
Source port
TCP header
Application data
24
16
FCS
31
Destination port
Sequence number
Acknowledgement number
Hlen Resv Code
Window
Checksum
Urgent ptr
Options (if any)
20 Bytes
Padding
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
19
TCP Segment Format – cont.
 Source/Dest port: TCP port numbers to ID applications
at both ends of connection
 Sequence number: First byte in segment from sender’s
byte stream
 Acknowledgement: identifies the number of the byte the
sender of this segment expects to receive next
 Code: used to determine segment purpose, e.g. SYN,
ACK, FIN, URG
 Window: Advertises how much data this station is willing
to accept. Can depend on buffer space remaining.
Source port
Destination port
 Options: used for window scaling,
Sequence number
SACK, timestamps,
Acknowledgement number
maximum segment size etc.
Hlen
Resv
Code
Checksum
Options (if any)
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
Window
Urgent ptr
Padding
20
TCP – providing reliability
 Positive acknowledgement (ACK) of each received segment
 Sender keeps record of each segment sent
 Sender awaits an ACK – “I am ready to receive byte 2048 and beyond”
 Sender starts timer when it sends segment – so can re-transmit
Receiver
Sender
Segment n
Sequence 1024
Length 1024
RTT
ACK of Segment n
Ack 2048
Segment n+1
Sequence 2048
Length 1024
RTT
ACK of Segment n +1
Ack 3072
Time
 Inefficient – sender has to wait
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
21
Flow Control: Sender – Congestion Window
 Uses Congestion window, cwnd, a sliding window to control the data flow
 Byte count giving highest byte that can be sent with out an ACK
 Transmit buffer size and Advertised Receive buffer size important.
 ACK gives next sequence no to receive AND
The available space in the receive buffer
 Timer kept for each packet
TCP Cwnd slides
Data sent and ACKed
Unsent Data
Sent Data
buffered waiting ACK may be transmitted
immediately
Received ACK
advances trailing edge
Sending host
advances marker
as data transmitted
Data to be sent,
waiting for window
to open.
Application writes here
Receiver’s advertised
window advances
leading edge
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
22
Flow Control: Receiver – Lost Data
Lost data
Application reads here
Data given to
application
Window slides
ACKed but not
given to user
Next byte expected
Expected sequence no.
Received but
not ACKed
Last ACK given
Receiver’s advertised
window advances
leading edge
 If new data is received with a sequence number ≠ next byte expected
Duplicate ACK is send with the expected sequence number
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
23
How it works: TCP Slowstart
 Probe the network - get a rough estimate of the optimal congestion window size
 The larger the window size, the higher the throughput
 Throughput = Window size / Round-trip Time
 exponentially increase the congestion window size until a packet is lost
 cwnd initially 1 MTU then increased by 1 MTU for each ACK received
 Send 1st packet get 1 ACK increase cwnd to 2
 Send 2 packets get 2 ACKs inc cwnd to 4
 Time to reach cwnd size W = RTT*log2 (W)
 Rate doubles each RTT
packet loss
timeout
CWND
slow start:
exponential
increase
congestion avoidance:
linear increase
retransmit:
slow start
again
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
time
24
How it works: TCP Congestion Avoidance
 additive increase: starting from the rough estimate, linearly
increase the congestion window size to probe for additional
available bandwidth
 cwnd increased by 1 /MTU for each ACK – linear increase in rate
 TCP takes packet loss as indication of congestion !
 multiplicative decrease: cut the congestion window size
aggressively if a packet is lost
 Standard TCP reduces cwnd by 0.5
 Slow start to Congestion avoidance transition determined by ssthresh
packet loss
timeout
CWND
slow start:
exponential
increase
congestion avoidance:
linear increase
retransmit:
slow start
again
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
time
25
TCP Fast Retransmit & Recovery
 Duplicate ACKs are due to lost segments or segments out of order.
 Fast Retransmit: If the sender transmits 3 duplicate ACKs
(i.e. it received 3 additional segments without getting the one expected)
 Send the missing segment





Set ssthresh to 0.5*cwnd – so enter congestion avoidance phase
Set cwnd = (0.5*cwnd +3 ) – the 3 dup ACKs
Increase cwnd by 1 segment when get duplicate ACKs
Keep sending new data if allowed by cwnd
Set cwnd to half original value on new ACK
 no need to go into “slow start” again
 At steady state, CWND oscillates around the optimal window size
 With a retransmission timeout, slow start is triggered again
packet loss
timeout
CWND
slow start:
exponential
increase
congestion avoidance:
linear increase
retransmit:
slow start
again
time
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
26
TCP: Simple Tuning - Filling the Pipe
 Remember, TCP has to hold a copy of data in flight
 Optimal (TCP buffer) window size depends on:
 Bandwidth end to end, i.e. min(BWlinks) AKA bottleneck bandwidth
 Round Trip Time (RTT)
The number of bytes in flight to fill the entire path:
 Bandwidth*Delay Product BDP = RTT*BW
 Can increase bandwidth by
orders of magnitude
 Windows also used for flow control
Receiver
Sender
RTT
ACK
Segment time on wire =
bits in segment/BW
Time
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
27
Congestion control: ACK clocking
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
28
More Information
 Lectures, tutorials etc. on TCP/IP:






www.nv.cc.va.us/home/joney/tcp_ip.htm
www.cs.pdx.edu/~jrb/tcpip.lectures.html
www.raleigh.ibm.com/cgi-bin/bookmgr/BOOKS/EZ306200/CCONTENTS
www.cisco.com/univercd/cc/td/doc/product/iaabu/centri4/user/scf4ap1.htm
www.cis.ohio-state.edu/htbin/rfc/rfc1180.html
www.jbmelectronics.com/tcp.htm
 Encylopaedia
 http://www.freesoft.org/CIE/index.htm
 TCP/IP Resources
 www.private.org.il/tcpip_rl.html
 Understanding IP addresses
 http://www.3com.com/solutions/en_US/ncs/501302.html
 Configuring TCP (RFC 1122)
 ftp://nic.merit.edu/internet/documents/rfc/rfc1122.txt
 Assigned protocols, ports etc (RFC 1010)
 http://www.es.net/pub/rfcs/rfc1010.txt & /etc/protocols
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
29
Any Questions?
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
30
Backup Slides
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
31
More Information Some URLs









UKLight web site: http://www.uklight.ac.uk
MB-NG project web site: http://www.mb-ng.net/
DataTAG project web site: http://www.datatag.org/
UDPmon / TCPmon kit + writeup:
http://www.hep.man.ac.uk/~rich/net
Motherboard and NIC Tests:
http://www.hep.man.ac.uk/~rich/net/nic/GigEth_tests_Boston.ppt
& http://datatag.web.cern.ch/datatag/pfldnet2003/
“Performance of 1 and 10 Gigabit Ethernet Cards with Server
Quality Motherboards” FGCS Special issue 2004
http:// www.hep.man.ac.uk/~rich/
TCP tuning information may be found at:
http://www.ncne.nlanr.net/documentation/faq/performance.html
& http://www.psc.edu/networking/perf_tune.html
TCP stack comparisons:
“Evaluation of Advanced TCP Stacks on Fast Long-Distance
Production Networks” Journal of Grid Computing 2004
PFLDnet http://www.ens-lyon.fr/LIP/RESO/pfldnet2005/
Dante PERT http://www.geant2.net/server/show/nav.00d00h002
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
32
tcpdump / tcptrace
 tcpdump: dump all TCP header information for a specified
source/destination
 ftp://ftp.ee.lbl.gov/
 tcptrace: format tcpdump output for analysis using xplot
 http://www.tcptrace.org/
 NLANR TCP Testrig : Nice wrapper for tcpdump and tcptrace tools
 http://www.ncne.nlanr.net/TCP/testrig/
 Sample use:
tcpdump -s 100 -w /tmp/tcpdump.out host hostname
tcptrace -Sl /tmp/tcpdump.out
xplot /tmp/a2b_tsg.xpl
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
33
tcptrace and xplot
X axis is time
Y axis is sequence number
the slope of this curve gives the throughput over time.
xplot tool make it easy to zoom in
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
34




Zoomed In View
Green Line: ACK values received from the receiver
Yellow Line tracks the receive window advertised from the receiver
Green Ticks track the duplicate ACKs received.
Yellow Ticks track the window advertisements that were the same as the
last advertisement.
 White Arrows represent segments sent.
 Red Arrows (R) represent retransmitted segments
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
35
TCP Slow Start
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
36