No Slide Title

Download Report

Transcript No Slide Title

End-To-End Provisioned Optical Network Testbed for
Large-Scale eScience Applications
Nagi Rao, Bill Wing
Tony Mezzacappa
Computer Science and Mathematics Division
Physics Division
Oak Ridge National Laboratory
Oak Ridge National Laboratory
[email protected],[email protected]
[email protected]
Nov 12, 2003
Project Kick-off Meeting, University of Virginia
Sponsored by
NSF Experimental Infrastructure Networks Program
Outline of Presentation
1. Project details
2. TSI network and application interface requirements
3. Transport for dedicated channels
1. dynamics of shared streams
2. channel stabilization
4. Work Plan
U.S. Department of Energy
Oak Ridge National Laboratory
UT-BATTELLE
ORNL Project Details
Principal Investigators:
Nagi Rao – Computer Scientist/Engineer
Bill Wing – Network Engineer/Scientist
Tony Mezzacappa – Astrophysicist
Technical Staff:
Qishi Wu – Post-Doctoral Fellow
Menxia Zhu – Phd Student
Steven Carter – Systems and Network Support
Budget: 850K (364K year1)
U.S. Department of Energy
Oak Ridge National Laboratory
UT-BATTELLE
TSI Computations: Networking Support
Networking Activities:
Data transfers: Archive and supply
massive amounts of data
(terabyte/days)
Interactive visualizations: Visualize
archival or on-line data
Remote steering and control: control
computations and visualizations
into regions of interest
Coordinated operations: collaborative
visualization and steering
Visualization stream
Data stream
Control stream
U.S. Department of Energy
Oak Ridge National Laboratory
UT-BATTELLE
Types of Networking Channels
High Bandwidth Data Channels:
Off-line Transfers: Terabyte datasets
Supercomputers – high performance storage systems
Storage – host nodes and visualization servers
On-line Transfers:
Supercomputers – visualization nodes
Control and Steering Channels:
Interactive visualization – human response time
Computational steering – respond to “inertia” of computation
Coordinated Channels:
Coordinated visualization, steering, and archival
Multiple visualization and steering nodes
On Internet: these channels can be supported only in a limited way
– It is difficult to sustain large data rates in a fair manner
– Unpredictability of transport dynamics makes it very difficult to achieve stability
U.S. Department of Energy
Oak Ridge National Laboratory
UT-BATTELLE
Data Transfers Over Dedicated Channels
Several Candidate Protocols (to be tested):
UDP-based data transport:
UDT(SABUL), tsunami, hurricane, RBUDP, IQ-RUDP, and others
Advantages: application-level implementations and conceptually simple
methods
Disadvantages: unstable code and hard to configure parameters
Tuned TCP methods:
net100: tune flow windows large enough to avoid self created losses
Advantages: known mechanisms and tested kernel code
Disadvantages: physical losses are problematic
– TCP interprets physical losses as congestion and reduces throughput
Host Issues for 1-10Gbps Rates: Impedance match issues
– Buffering in NIC, kernel and application, disk speeds
• –zero-copy kernel patch and ST
–
OS bypass, RDMA
U.S. Department of Energy
Oak Ridge National Laboratory
UT-BATTELLE
Multiple Streams Over Dedicated Channels
Example:
•Monitor computation through a visualization channel
•Interactive visualization – rotate, project different subspaces
•Computational Steering – specify parameters on the fly
•Archive/load the data – store the interesting data
Visualization stream
Visualization control
Steering
Option 1:
Dedicated channels for each
stream
•4 NICS – 4 MSPP slots
Option 2:
Share dedicated channels
•single NIC and MSPP slots
•realize sharing at protocol or
application level
Data stream
Option 3:
Visualization streams on one
channel
Data and steering streams on
another channel
•two NIC and MSPP slots
•realize sharing at protocol or
High performance
application level
storage
U.S. Department of Energy
Oak Ridge National Laboratory
UT-BATTELLE
Terminology Review
Connection:
Logical: host site to host site
Circuit or Channel or Bandwidth Pipe:
Physical: NIC-NIC
connection
Stream:
Logical: Application to application
Visualization stream
Data stream
Control stream
U.S. Department of Energy
Oak Ridge National Laboratory
UT-BATTELLE
Dedicated NIC-NIC Channels
Advantages: No other traffic on the channel
• Simpler protocols:
– Rate controllers with loss recovery mechanisms would suffice for
• data transfers and
• control channels for host-host connections
• Coordination between the streams can be handled at application/middleware level
Disadvantages:
• Scaling problems:
– single connection requires 4 NIC-NIC pairs and 4 channels in the example
– main computation site supporting 5 users requires
• host with 20 NICs and 20 channels
• MSPP with least 20 slots (e.g 5 blades each with 4 GigE slots)
•
Utilization problems:
– Even a small control stream needs an entire channel (with minimum resolution)
• E.g., 10Mbps control stream on GigE channel
U.S. Department of Energy
Oak Ridge National Laboratory
UT-BATTELLE
Multiple Streams on Single NIC-NIC Channel
Streams interact and affect each other:
• Packets may be “pooled” at the source and destination nodes:
– NIC – interrupt coalescing and buffer clearing
– NIC-kernel transfers through buffers
– Kernel-application transfers
• Processor load determines interrupt response time at finer levels
Two important consequences
– Protocols or applications need to “share” the channel
• Need protocols that allow for appropriate bandwidth sharing
• TCP-like paradigm but a more structured problem
– Total bandwidth is known
– Competing traffic is host generated
– Protocol interaction could generate complicated dynamics
• Need protocols that stabilize the dynamics for control channels
• Very few protocols exist that protect against “underflow”
• Need a combination of existing and newer protocols
U.S. Department of Energy
Oak Ridge National Laboratory
UT-BATTELLE
TSI Application interfaces and networking modules
applications
interfaces
data transfers
Application
module1
Application
Module 2
middleware
protocols
Computational
steering
dynamics
visualization
Application
Module 3
Stabilization
modules
Bulk transport
modules
channels
U.S. Department of Energy
Oak Ridge National Laboratory
Control
modules
streaming
protocols
Dedicated provisioned channels
UT-BATTELLE
Interfacing with visualization modules
Overall Approach: Separate the steering and display components:
–
–
–
–
•
Steering module – connect it visualization control channel
Display module
Separate rendering and display sub-modules and locate them at hosts
Connect sub-modules over data channels
Candidates under consideration – all need hooks to use dedicated channels
– OpenGL, VTK codes
– code needs to be modified with appropriate calls – non-trivial
– enSight
• can operate across IP networks without firewalls
• High cost and no access to source code
– Paraview
• stability problems and hard to use
– Aspect (?)
• Developed at ORNL
• Has functionality similar to Paraview with additional analysis modules
• Developers are willing to incorporate CHEETAH modules
– On-line streaming
– Large datasets
U.S. Department of Energy
Oak Ridge National Laboratory
UT-BATTELLE
Optimizing visualization pipeline on a network
data storage
geometry
computation
rendering
display
Host node
Decomposition of visualization pipeline:
– “links” have different bandwidths
• Geometry could be larger than data
• Display bandwidth can be much smaller – human consumption
– tasks require different computational power
• Large datasets require a cluster to compute the geometry
• Rendering can be done on graphics-enabled machines
• Display can be transferred to X-enabled machine
Pipeline can be realized over the network and display can be forward to user host
U.S. Department of Energy
Oak Ridge National Laboratory
UT-BATTELLE
Protocols for dedicated channels
– multiple data streams
Problem is simpler than Internet:
Total available channel bandwidth is known
All traffic is generated by the nodes and is “known”
Fairness issues are simpler – nodes can allocate bandwidth among streams
TCP addresses these problems over the Internet:
slow-start to figure out available bandwidth
packet loss and time-out to conclude traffic levels
AIMD to adjust the flow rate
Bandwidth partitioning among data streams might require close-loop control:
Simply (open-loop) control of data rates at application level does not always work:
Example: NIC has higher capacity than the provisioned channel:
1. packets might be combined and sent out at higher rate by NIC causing losses at
MSPP
2. packets can be coalesced at receiver NIC resulting rates different from sending
U.S. Department of Energy
Oak Ridge National Laboratory
UT-BATTELLE
Protocols for dedicated channels
– multiple data and control streams
Problem is to maintain “steady” dynamics for the control streams between applications
Not just between NICs or at the line
Complicated end-to-end dynamics can be caused by various factors:
Channel losses:
Physical losses
Losses due to sum of streams exceeding the capacity
Impedance mismatch between
NIC and line
NIC and kernel
kernel and application
On the Internet:
Only probabilistic solution is possible over Internet because of complicated cross traffic dynamics – our
solutions based on stochastic approximation
TCP does not solve the problem
Multiple TCP/UDP streams generate chaos-like dynamics
Single TCP stream on the dedicated channel has underflow problem
Tune the flow-window at the desired level and adjust AIMD not to kick-in
burst of losses can kill the stream – TCP interprets
This problem still simpler than Internet:
Here cross-traffic is generated by the nodes and is “known”
Channels must explicitly stabilized using application-level closed loop control
U.S. Department of Energy
Oak Ridge National Laboratory
UT-BATTELLE
Complicated Dynamics Interacting Streams
Simulation Results: TCP-AIMD exhibits chaos-like trajectories
TCP streams competing with each other on a dedicated link (Veres and Boda
2000)
TCP competing with UDP on a dedicated link (Rao and Chua 2002)
Analytical Results (Rao and Chua 2002): TCP-AIMD has chaotic regimes
Competing with UDP steady streams on a dedicated link
State space analysis and Poincare maps
Internet Measurements (2003, last few weeks): TCP-AIMD traces are a complicated
mixture of stochastic and chaotic components
Note: on dedicated links we expect less or no chaotic component
U.S. Department of Energy
Oak Ridge National Laboratory
UT-BATTELLE
Internet Measurements –
Joint work with Jianbo Gao
Question: How relevant are the simulation and analytical results on chaotic trajectories?
Answer: Only partially.
Internet (net100) traces show that TCP-AIMD dynamics are complicated mixture of
chaotic and stochastic regimes:
– Chaotic – TCP-AIMD dynamics
– Stochastic – TCP response to network traffic
Basic Point: TCP Traces collected on all Internet connections showed complicated
dynamics
– classical “saw-tooth” profile is not seen even once
– This is not a criticism against TCP, it was not intended for smooth dynamics
U.S. Department of Energy
Oak Ridge National Laboratory
UT-BATTELLE
Cwnd time series for ORNL-LSU connection
Connection: OC192 to Atlanta-Sox; Internet2 to Houston; LAnet to LSU
U.S. Department of Energy
Oak Ridge National Laboratory
UT-BATTELLE
Both Stochastic and Chaotic Parts are dominant
Lorenz – chaotic
Common envelope
Uniform Random
Spread out
TCP traces have:
common envelope and
spread out at certain scales
U.S. Department of Energy
Oak Ridge National Laboratory
UT-BATTELLE
Characterized as Anomalous Diffusions
Log-log displacement curves
Large exponent:
typical of chaotic systems with
injected noise
U.S. Department of Energy
Oak Ridge National Laboratory
UT-BATTELLE
End-to-End Delay Dynamics Control: End Filtering
U. Oklahoma
U. Oklahoma
ORNL
Internet Connection
ORNL: source
Destination
ORNL
filter
U. Oklahoma
.
Old Dominion Uni
Objective:
Achieve smooth end-to-end delay
Solution:
1. Reduce end-to-end delay using two-paths via daemons:
Old Dominion Uni
.
X-axis: message sizes (bytes)
Y-axis: end-to-end delay (sec)
ORNL-OU, ORNL-ODU_OU
2. Filter the output at destination
U.S. Department of Energy
Oak Ridge National Laboratory
UT-BATTELLE
Throughput Stabilization –
Joint work with Qishi Wu
• Niche Application Requirement: Provide stable throughput
at a target rate - typically much below peak bandwidth
– Commands for computational steering and visualization
– Control loops for remote instrumentation
• TCP AIMD is not suited for stable throughput
– Complicated dynamics
– Underflows with sustained traffic
U.S. Department of Energy
Oak Ridge National Laboratory
UT-BATTELLE
Measurements: ORNL-LSU
ORNL-LSU old connection: Esnet peering with Abilene in New York
Both hosts have 10M NICS
Throughput stabilized within seconds at the target rate and was stable under:
•Large and small ftp at hosts and LAN
•Web browsing
U.S. Department of Energy
Oak Ridge National Laboratory
UT-BATTELLE
Stochastic Approximation: UDP window-based method
data packets
transmission rate
r(t)
Source
goodput
Destination
goodput
g S (t )
Source node S
g D (t )
acknowledgements
Destination node D
Transport control loop
Objective: adjust source rate to achieve (almost) fixed
goodput at the destination
Difficulty: data packets and acks are subject to random
processes
Approach: Rely on statistical properties of data paths
U.S. Department of Energy
Oak Ridge National Laboratory
UT-BATTELLE
Throughput and loss rates vs. window size and cycle time
Typical day
Christmas day
Objective: adjust source rate to yield the desired throughput at destination
U.S. Department of Energy
Oak Ridge National Laboratory
UT-BATTELLE
Adaptation of source rate
• Adjust the window size
Target throughput
Wc , n 1  Wc , n 
• Adjust cycle-time
Ts , n 1 
a  Ts
*
(
g

g
)
n

n
Noisy estimate
1.0
1.0 a / Wc
   ( gn  g * )
Ts , n
n
• Both are special cases of classical Robbins-Monroe method
^

rn 1  rn   n  g (rn )  g *


 n  0,  n  0,

n

n
U.S. Department of Energy
Oak Ridge National Laboratory
UT-BATTELLE
Performance Guarantees
• Summary:
Stabilization is achieved with a high probability with a very
simple estimation of source rate
• Basic result: for the general update  n1  n  O(n )
rn1  rn 
• We have
U.S. Department of Energy
Oak Ridge National Laboratory
a
( g n  g*)
n
a  0,
1
   min(1,  )
2
3
 1
O
(
)
if


,



2
n
E[(rn   n ) 2 ]  
O( 1 ) if   3  ,
 n 2 (  )
2
UT-BATTELLE
Internet Measurements
• ORNL-LSU connection (before recent upgrade)
– Hosts with 10 M NIC
– 2000 mile network distance
• ORNL-NYC – ESnet
• NYC-DC-Hou – Abilene
• HOU-LSU – Local n/s
• ORNL-GaTech Connection
– Hosts with GigE NICS
– ORNL-Juniper router – 1Gig link
– Juniper- ATL Sox – OC192 (1Gig link)
– Sox-GaTech – 1Gig link
U.S. Department of Energy
Oak Ridge National Laboratory
UT-BATTELLE
ORNL-LSU Connection
ESnet
ORNL
Local
LSU
U.S. Department of Energy
Oak Ridge National Laboratory
UT-BATTELLE
Goodput Stabilization: ORNL-LSU
Experimental Results
• Case 1: Target goodput = 1.0 Mbps,
rate control through congestion
window, a = 0.8,
Datagram acknowledging time (
) vs.
source rate (Mbps) & goodput (Mbps)
U.S. Department of Energy
Oak Ridge National Laboratory
• Case 2. Target goodput = 2.0 Mbps,
rate control through congestion
window, a = 0.8,
Datagram acknowledging time (
) vs.
source rate (Mbps) & goodput (Mbps)
s
UT-BATTELLE
Goodput Stabilization: ORNL-LSU
Experimental Results
• Case 3. Target goodput = 3.0 Mbps, rate control through congestion
window, a = 0.8,
  0.8
Datagram acknowledging time (
) vs.
source rate (Mbps) & goodput (Mbps)
U.S. Department of Energy
Oak Ridge National Laboratory
s
UT-BATTELLE
Goodput Stabilization:
ORNL-LSUExperimental Results
• Case 4. Target goodput = 2.0 Mbps,
rate control through sleep time, a =
0.8,
• Case 5. Target goodput =
2.0 Mbps, rate control
through sleep time, a = 0.9,
Datagram acknowledging time (
) vs.
source rate (Mbps) & goodput (Mbps)
U.S. Department of Energy
Oak Ridge National Laboratory
s
UT-BATTELLE
Throughput Stabilization: ORNL-GaTech
. Desired goodput level = 20.0 Mbps, a = 0.8, ,
adjustment made on congestion window
  0.8
U.S. Department of Energy
Oak Ridge National Laboratory
Desired goodput level = 2.0 Mbps, a = 0.8, ,
adjustment made on sleep time
  0.8
UT-BATTELLE
Experiments with tsunami
firebird.ccs.ornl.gov – ccil.cc.gatech.edu
• Network transport control settings:
– NIC speed and path bandwidth: 1 Gbps
– Transferred file size: 204,800,000 bytes
– Using default_block_size: 32768 bytes
• Transmission statistics from Tsunami:
–
–
–
–
–
Ave. sending rate 296.05 Mbps
Loss rate: 64.32%
Transfer time: 17.51 sec
Throughput: 93.6 Mbps
Sending time&receiving time vs. block sequence number (figure
next slide)
U.S. Department of Energy
Oak Ridge National Laboratory
UT-BATTELLE
U.S. Department of Energy
Oak Ridge National Laboratory
UT-BATTELLE
Tsunami measurements
ozy4.csm.ornl.gov – resource.rrl.lsu.edu
•
•
•
•
Path bandwidth: 10 Mbps
Using datagram size: 1400 bytes (the
default one doesn’t work)
File size: 10,240,000 bytes
Case 1: Only Tsunami running
– Throughput 9.47 Mbps (receiver,
client)
– Goodput 4.20 Mbps (sender,
server)
– Sending time&receiving time vs.
datagram sequence number (figure
right)
U.S. Department of Energy
Oak Ridge National Laboratory
UT-BATTELLE
• Case 2: Only ONTCOU (throughput maximization SA) running
• Source goodput: 3.5 Mbps
• Sending time&acknowledging time vs. datagram sequence number
• Sending rate vs. source goodput
U.S. Department of Energy
Oak Ridge National Laboratory
UT-BATTELLE
• Case 3: Tsunami and
ONTCOU running
simultaneously with the same
datagram size
– Tsunami
• Not completed
– ONTCOU
• Transmission completed
• Throughput: 0.533Mbps
• Sending time&acknowledging
time vs. datagram sequence
number (figure next)
U.S. Department of Energy
Oak Ridge National Laboratory
UT-BATTELLE
ORNL Year 1 Tasks
Design and test transport protocols for dedicated channels
1. Single data streams – collaboration with UVa
2. One data and two control streams
Testing on ORNL-ATL-ORNL GigE-SONET link
ORNL host 1
linux
ORNL host 1
linux
Juniper
M160
router
ORNL
OC 192
SOX
router
Atlanta
Interfaces with visualization software:
Simple supernova computation at ORNL hosts on dedicated link
Developing interfaces to Aspect visualization modules and testing
Test Paraview and EnSight
U.S. Department of Energy
Oak Ridge National Laboratory
UT-BATTELLE
ORNL Year 2 Tasks
Design and test transport protocols for dedicated channels
Multiple data, visualization and control streams
Testing on CHEETAH testbed
Interface with visualization:
Interfacing supernova visualization modules over CEETAH
Developing interfaces to Aspect visualization modules with TSI dataset
U.S. Department of Energy
Oak Ridge National Laboratory
UT-BATTELLE
ORNL Year 3 Tasks
Design and test transport protocols for dedicated channels
Collaborating multiple data, visualization and control streams
Testing on CHEETAH testbed
Interface with visualization:
Interfacing supernova visualization and computation modules over CEETAH
Developing interfaces to Aspect visualization modules with TSI on-line computations
Optimizing mapping of visualization pipeline
U.S. Department of Energy
Oak Ridge National Laboratory
UT-BATTELLE
Feedback and Corrections
U.S. Department of Energy
Oak Ridge National Laboratory
UT-BATTELLE
Interfacing with steering modules
Dynamics of visualization control and steering streams must be stabilized from
application to application
– Not enough to stabilize lower transport levels
– NIC to line transfers may not be smooth
– Application to kernel transfers depend on the processor load
•
Provide a user interface for steering and connect it to transport modules
U.S. Department of Energy
Oak Ridge National Laboratory
UT-BATTELLE