Transcript ppt - LIFL

P2P Overlay Network for TCP
Programming with UDP Hole Punching
Takayuki Okamoto, Taisuke Boku,
Mitsuhisa Sato, Osamu Tatebe
Graduate School of Systems and Information Engineering,
University of Tsukuba
2nd NEGST workshop
1
Abstract

Large amount of idle PCs in the world
Behind NAT and firewall
 Special programming is required to communicate
with each other



Relay server, NAT traversal
We are developing a P2P communication library
to ease to use PCs behind NAT and firewall
UDP hole punching
 Original reliable communication library on UDP/IP
 User level management

We use the term of “NAT” for both NAT boxes and firewalls hereafter
2nd NEGST workshop
2
Outline

Motivation and objective


Proposal of a scalable communication framework
based on NAT traversal


Design and implementation of communication
library
Evaluation of communication performance


P2P computing
Performance for UDP with our reliable
communication library
Works in France
2nd NEGST workshop
3
Motivation & background

NAT problem




Most of computing nodes are behind firewalls or NAT (Network
Address Translation) boxes
These nodes can’t communicate with each other directly
With relay transfer, the bandwidth of relay-nodes becomes a
bottleneck
NAT traversal techniques


With several negotiation procedures, the nodes can communicate
directly through intermediate NATs
Complicated negotiation is required on each application program
Internet
Cluster
Firewall
2nd NEGST workshop
Broadband PC
router (NAT)
4
Objective

Goal: providing a communication framework for efficient
and easily programmable HPC-P2P computing




Easy to use nodes behind NATs
High scalability
High throughput
High portability for a large variety of environments
Internet
Cluster
Firewall
Broadband
router
(NAT)
PC
2nd NEGST workshop
New communication
environment
5
Requirement specification




Direct communication based on NAT traversal
Name space independent from the physical one
Fully distributed management system
User-level implementation
Internet
Cluster
Firewall
Broadband
router
(NAT)
PC
2nd NEGST workshop
New communication
environment
6
Overlay networks

Virtual networks constructed on application layer

Generally defined as “a routing (relay) system
among involved nodes”




Independent from the physical network
Relay nodes may become bottlenecks
Applications neglect the network topology
Our system



Name space and communication methods between
any pair of nodes without packet-relay
Applications can be designed for effective
communication on physical network
Supporting both applications and frameworks
2nd NEGST workshop
7
Design concept of our system

Two different types of communication


Managements and controls in our system
Data transfer on applications
Name & negotiation
Data transfer
(management system) (communication library)
Requirement
Consistency,
quick response
High throughput
Style
Server-client, DHT
P2P
Topology
Tree
Direct
Feature
Using super-node with Without relay-node
very few traffic
High scalability
2nd NEGST workshop
8
Design of communication library

Socket API compatible with TCP/IP
Easy porting of existing applications written in
TCP/IP
 Easy programming with large flexibility - not
limited to “master-slave” style


Communication method is automatically selected
Pure (direct) TCP/IP is the best
 UPnP is supported by wide class of home-use
NATs
 UDP hole punching is mostly available on NATs
⇒ for TCP-programming, reliable streaming
communication feature must be provided by
software

2nd NEGST workshop
9
Reliable communication on UDP/IP

RI2N/UDP




Developed by JST-CREST “Mega-Scale Computing”
Project
Basically designed for fault-tolerant communication
on PC cluster with Ethernet
Based on UDP/IP, but provides TCP-like streaming
communication, retransmission and simple
congestion control algorithm
Porting to our communication layer for P2P
computing
⇒ SoU (Stream on UDP) library
2nd NEGST workshop
10
Preliminary performance evaluation


Performance evaluation on SoU library
 Throughput
 Latency
Environment
 Two client nodes in two houses
under different ISPs over the
Internet
 The server node in University of
Tsukuba
 Home-use “broadband router” to
be used



BBR-4HG : max 92Mbps
BLR3-TX4 : max 90Mbps
Four connection methods
(1) TCP DMZ
(2) SoU DMZ
(3) TCP relay
(4) SoU + UDP hole punching
2nd NEGST workshop
server
Router-1 : BBR-4HG
Router-2 : BLR3-TX4
Node-A : Pentium M 1.2GHz
Node-B : Pentium M 2.26GHz
server : Xeon 3.0GHz
ISP1(So-net)
University
SINET(MEXT)
Internet
ISP2(BB.Excite)
Router-1
(NAT)
Router-2
(NAT)
Node-A
Node-B
11
Connection methods (1) and (2)


TCP DMZ
Method (1): TCP/IP with DMZ function of NAT
SoU DMZ
Method (2): SoU with “UDP” DMZ function of NAT
 DMZ function: port forwarding function to transfer all
inbound packets on NAT to a node behind NAT
TCP/IP or UDP/IP
global network
Node-A
NAT-1
(port 1000)
anywhere
→NAT-1:1000
→Node-A:1000
NAT-2
setting
manually
2nd NEGST workshop
Node-B
(port 2000)
12
Connection method (3)

TCP/IP packet relay through Server



TCP relay
Each node makes a TCP/IP channel with the server
The server relays packets from one side to the other side
through TCP/IP channel
Two times of transmission is required to send a packet
server
TCP/IP
global network
Node-A
NAT-1
(port 1000)
NAT-2
Node-B
(port 2000)
2nd NEGST workshop
13
Connection method (4)

SoU + UDP hole
punching
SoU over UDP hole punching



All nodes share the information of IP addresses and ports by the
server through the management channel with TCP/IP
Two client nodes establish a direct communication channel with
UDP/IP by UDP hole punching
Over this UDP channel, SoU is used for streaming and reliable
communication between Node-A and Node-B
server
Information =
address + port
Data
transfer
global network
Node-A
(port 1000)
NAT-1
SoU connection
UDP hole punching
2nd NEGST workshop
NAT-2
Node-B
(port 2000)
14
Throughput

3
throughput [MB/s]
2.41

2
1.67
1.43
1.44
1
TCP DMZ vs. SoU + UDP hole
punching
 Simple vs. complex
 Different only 15%
 Realizing P2P direct
communication without NAT
problem
TCP DMZ vs. TCP relay
 Direct vs. indirect
 TCP relay is 45% higher
 Communication path between
ISPs


0
TCP DMZ
SoU DMZ
TCP relay SoU + UDP
hole
punching
Throughput depends on
bandwidth between ISPs
University has a strong
connection with both ISPs
TCP relay makes a bottleneck on
scalable system
SoU + UDP hole punching is the best
way for P2P computing


Single-sided burst transfer
2nd NEGST workshop
15
Latency

Three methods

30
Very small difference

24.6

20
latency [ms]

15.4
14.6
14.3

TCP relay
10

The largest


0
TCP DMZ
SoU DMZ
TCP relay
SoU + UDP
hole
punching
Average time for 1 byte message transfer
2nd NEGST workshop
Physical latency is large
Difference among protocols
is relatively small
Same hop-count ≈ same
latency
Double time hop-count
Latency depends on the
number of hops in WAN

Throughput depends on
absolute bandwidth
16
Works in France (1)

Porting UDP hole punching in Private Virtual
Cluster (tun version)

PVC provides IP level virtualization


Reliability is not required
Throughput on LAN achieves 90 Mbps on
100BASE-TX with tuning of MTU
daemon
Application
UDP hole
punching
TCP
UDP
IP
tun
device
2nd NEGST workshop
Real NIC
17
Works in France (2)

Making arrangements for performance
evaluation between France and Japan
Nodes in Grid5000 can be used only with their
self
 2 nodes in France and 4 nodes in Japan are
available

PCs
in Univ. Tsukuba
PCs
in home
PC
in my home
In Japan
2nd NEGST workshop
In France
18
Future works

Performance improvement of SoU library


Implementing more sophisticated algorithms of
flow control
Performance evaluation between France and
Japan
Comparing SoU with TCP
 Upgrading SoU for throughput with large latency

2nd NEGST workshop
19
2nd NEGST workshop
20
2nd NEGST workshop
21
2nd NEGST workshop
22
The Procedure of UDP hole punching
Sharing the
Information of IP
address and port
Server
to NAT-2:2000
×?
to NAT-1:1000
×?
global network
Node-A
NAT-1
NAT-2
(port 1000)
Node-B
(port 2000)
NAT-2:2000
→NAT-1:1000
→Node-A:1000
Created by
outbound packets
NAT-1:1000
→NAT-2:2000
→Node-B:2000
This method is available with “Cone NATs”
2nd NEGST workshop
23
Motivation & background

P2P (Peer-to-peer) computing and its potential power


Utilize a great potential computation power provided by a
number of PCs
Public Resource Computing : Aggregating the computation
power of idling PCs in home and office in P2P manner


Volunteer computing (BOINC, etc)
Supporting only master-worker style applications
Clusters
in university
PCs
in home
PCs
in office
2nd NEGST workshop
24
Conclusion

We proposed a communication framework for
P2P computing for HPC applications with high
scalability
Easily programmable even through NATs
 Scalable for a number of nodes without relayserver bottleneck


Performance evaluation on WAN environment
SoU library provides an acceptable performance
 Relatively large cost to establish a connection, but
negligible for long-term HPC applications


Our system has acceptable performance and
scalability for HPC-P2P
2nd NEGST workshop
25
Related work


Generic studies : JXTA, NAT BLASTER, STUNT, OCALA and Skype
A2A API …
NAT traversal techniques
 Wide-Area Communication for Grids: An Integrated
Solution to Connectivity, Performance and Security
Problems [Alexandre et at al. HPDC’04]




Simultaneous TCP : Another TCP connection establishment
procedure on RFC793
User-level implementation
Usable under more particular condition than UDP hole punching
Overlay network without relays
 Private Virtual Cluster: Infrastructure and Protocol for
Instants Grids. [Ala et at al. Europar’06]


High application portability with TUN/TAP
Installation needs root authority
2nd NEGST workshop
26
NAT traversal techniques

Techniques to allow a direct communication among nodes
behind NATs

UDP hole punching



UPnP (Universal Plug and Play)




The most widely used method and easy to implement on userlevel
Communication is limited to UDP/IP
To configure hardware devices temporally through the network
UDP/IP and TCP/IP are available
Each NAT box must support the feature explicitly
They are used mainly in multimedia applications




VoIP (Skype, Google Talk, etc.)
Constant throughput is required for long period
Several amount of packet-loss is allowed without the
retransmission for UDP/IP
For wider variety of applications, we need more concrete and
easy to control communication methods
2nd NEGST workshop
27
Cost to establish a connection


Most preliminary result
TCP DMZ, SoU DMZ and TCP relay


Same as round-trip time
SoU + UDP hole punching



Negotiation, UDP hole punching and SoU are required
Similar to 7 times of round-trip time
For HPC, this is a little overhead
TCP DMZ
SoU DMZ
TCP relay
SoU + UDP hole
punching
28.9 ms
28.5 ms
23.3 ms
199.4 ms
The shortest time to establish a connection
2nd NEGST workshop
28
time to establish a connection [ms]
Cost to establish a connection
Acceptable for HPC
applications as a little overhead
199.4
200
100
•
•
•
•
RDUP+UDP hole punching requires
7 times transmissions on WAN:
1 time on DNS resolution
4 times on sharing of address information
1 time on UDP hole punching
1 time on SoU connection establishment
28.9
28.5
23.3
TCP DMZ
RUDP DMZ
TCP relay
0
RUDP + UDP
hole punching
The shortest time to establish a connection
2nd NEGST workshop
29
Design of management system

Distributed “super-nodes”
to manage the system



Name space
management based on
DHT (Distributed Hash
Table)
Helps the negotiation
among NATs for UDP
hole punching
Relays packet only
when it is necessary
Server nodes
(DHT)
Router(NAT, firewall)
direct Connection
(TCP, UPnP, UDP
hole punching )
Client nodes
2nd NEGST workshop
30
Structure of Management System
Server
Many super-node
and many common nodes
(DHT)
Router(NAT, firewall)
Client
Super node
A server and
many clients
direct Connection
(TCP, UPnP, UDP
hole punching )
2nd NEGST workshop
Router(NAT, firewall)
common node
31
System design overview
A Framework of Public Resource Computing etc.
Monitoring the
overlapping Management
of
System
the names
Name
Management
Node
Management
Communication Library
NAT Traversal
TCP
TCP/IP
Holding TCP
connections with all
client nodes
UPnP
our system
UDP hole punching
UDP/IP
Providing direct
communication for data
through NATs
DHT (Distributed Hash Table) is used for
consistent and scalable management
2nd NEGST workshop
32
System design overview
Various frameworks for Public Resource Computing etc.
Name resolution
from virtual name
Management System
to real IP address
Name
Management
Node
Management
Communication Library
NAT Traversal
TCP
TCP/IP
Node pair
rendezvous for
NAT traversal
UPnP
our system
UDP hole punching
UDP/IP
Providing direct
communication for data
through NATs
2nd NEGST workshop
33
Latency
11ms
10ms
server
Router-1
(NAT)
Internet
Node-A
Router-2
(NAT)
Node-B
15ms
2nd NEGST workshop
34
Cost to establish a connection


Most preliminary result
TCP DMZ, SoU DMZ, TCP relay

Request and replay on TCP or SoU
= round-trip time

SoU + UDP hole punching

Negotiation, UDP hole punching and SoU’s establishment
= round-trip time x 7
TCP DMZ
SoU DMZ
TCP relay
SoU + UDP hole
punching
28.9
28.5
23.3
199.4
2nd NEGST workshop
35
Information transfer
The Procedure
of UDP hole punching
through a server
Server
Reachable using a
mapping information
to NAT-2:2000
to NAT-2:2000
to NAT-1:1000
Reachable
to Node-B
×
global network
Node-A
NAT-1
NAT-2
(port 1000)
Node-B
(port 2000)
NAT-2:2000
→NAT-1:1000
→Node-A:1000
Automatically
created
NAT-1:1000
→NAT-2:2000
→Node-B:2000
This method is available with “Cone NATs”
2nd NEGST workshop
36
Reliable communication on UDP/IP

RI2N/UDP




Developed by JST-CREST “Mega-Scale Computing” Project
Basically designed for fault-tolerant communication on PC cluster
with Ethernet
Based on UDP/IP, but provides TCP-like streaming
communication, retransmission and simple congestion control
algorithm
Porting to our communication layer for P2P computing
⇒ RUDP (Reliable UDP) library
All RI2N channels share
type
window size
source port
only one UDP port
(padding)
destination port
for selective acknowledgements
sequence number
ack bitmap
option
to share the failure information
link status (fail or not)
data
2nd NEGST workshop
37