Storage System Integration with High Performance Networks

Transcript Storage System Integration with High Performance Networks

Storage System Integration with
High Performance Networks
Jon Bakken and Don Petravick
FNAL
Overview
• Review of some salient characteristics of
wide-area networks.
• Describe initial investigations at Fermilab
for optimizing wide area file transfers
– integrated with production WAN/LAN and
storage systems.
Wide Area Characteristics
• Most prominent characteristic, compared to
LAN, is the very large bandwidth*delay product.
• Underlying structure – it’s a packet world!
• Possible to use pipes between specific sites
• These circuits can be both static and dynamic
• Both IP and non-IP (for example, Fibre-channel
over sonet)
– FNAL has proposed investigations and has just begun
studies with its storage systems to optimize WAN file
transfers using pipes.
Bandwidth*Delay
• At least bandwidth*delay bytes must be kept in
flight on the network to maintain bandwidth.
– This fact is independent of protocol.
– Current practice uses more than this lower limit.
For example, US CMS used ~2x for their DC04.
• CERN <–> FNAL has a measured ~60 ms delay
• Using the 2x factor, 120 ms delay gives
– 30 MB/sec  ~3-4 MB “in flight”
– 1000 MB/sec  ~120 MB “in flight”
Bandwidth*Delay and IP
• Given a single lost packet and a standard
MTU size of 1500 bytes, the host will
receive many out-of-order packets before
receiving the retransmitted missing packet.
– Must incur at least 2 “delays worth”
• FNAL <-> CERN (2*60 ms delay)
•
30 MB/sec: more than 2400 packets
• 1000 MB/sec: more than 80000 packets
Knee-Cliff-Collapse Model
• When load on a segment
approaches a threshold, a
modest increases in
throughput is a
accompanied by a great
increases delay.
• Even more throughput
results in congestion
collapse.
• Can not load a network
arbitrarily.
• TCP tries to avoid
collapse, but its solution
has problems at large
bandwidth*delay
Bandwidth and Delay and TCP
• Stream model of TCP implies packet buffering is
in kernel - this leads to kernel efficiency issues.
• Vanilla TCP behaves as if all packet loss is
caused by congestion.
– TCP Solution is to back off throughput to avoid the
congestion collapse in AIMD fashion:
• Lost packet? Cut packets in flight by ½
• Success?
Open window next time by one more packet
– This leads to a very large recovery time at high
bandwidth*delay:
• 30 MB/sec drops to 15 MB/sec with just 1 lost packet
– Recovery time is 15 MB / 1500 byte MTU = 10000 * 120 ms
– Recovery time is 1200 sec = 20 minutes!
Strategies
• Smaller, lower bandwidth TCP streams in parallel
– Examples of these are GridFTP and BBftp
• Tweak AIMD algorithm
– Logic is in the sender’s kernel stack only (congestion window)
– FAST, and others – USCMS used an FNAL kernel mod in DC04
• May not be “fair” to others using shared network resources
• Break the stream model, use UDP and ‘cleverness’,
especially for file transfers. But:
– You have to be careful and avoid congestion collapse.
– You need to be fair to other traffic, and be very certain of it
– Isolate strategy by confining transfer to a “pipe”
Pipes and File Transfer Primitives
• Tell network the bandwidth of your stream using
RSVP, Resource Reservation Protocol
• Network will forward the packets/sec you
reserved and drop the rest (QoS)
• Network will not over subscribe the total
bandwidth.
• Network leaves some bandwidth out of the QoS
for others.
• Unused bandwidth is not available to others at
high QoS.
Storage Element
File Stage In
File Stage In
File Stage Out
Grid Side
WAN
File
Srv
LAN
File
Srv
File
Srv
File
Srv
File
Srv
Worker Node Side
(POSIX style I/O)
worker
worker
worker
worker
worker
worker
Storage System and Bandwidth
• Storage Element does not know the bandwidth of
individual stream very well at all
– For example, a disk may have many simultaneous assessors or
the file may be in memory cache and transferred immediately
– Bandwidth depends on fileserver disk and your disk.
• Requested bandwidth too small?
– If QoS tosses a packet, AIMD will drastically affect transfer rate
• Requested bandwidth too high?
– Bandwidth at QoS level wasted, overall experimental rate suffers
• Storage Element may know the aggregate bandwidth
better than individual stream bandwidth.
– Storage Element, therefore needs to aggregate flows onto a pipe
between sites, not deal with QoS on a single flow.
– This means the local network will be involved in aggregation.
FNAL investigations
Investigate support of static and dynamic pipes by
storage systems in WAN transfers.
– Fiber to Starlight optical exchange at Northwestern
University.
– Local improvements to forward traffic flows onto the
pipe from our LAN
– Local improvements to admit traffic flows onto our
LAN from the pipe
– Need changes to Storage System to exploit the WAN
changes.
Fiber to Starlight
• FNAL’s fiber pair has the potential for 33 channels
between FNAL and Starlight (3 to be activated soon)
• Starlight provides FNAL’s access to Research and
Education Networks:
–
–
–
–
–
–
–
–
ESnet
DOE Science Ultranet
Abilene
LHCnet (DOE-funded link to CERN)
SurfNet
UKLight
CA*Net
National Lambda Rail
LAN – Pipe investigation
• Starlight path bypasses
FNAL border router
• Aggregation of many
flows to fill a (dynamic)
pipe.
• We believe that pipes will
be ‘owned’ by a VO.
• Forwarding to the pipe is
done on a per flow basis
• Starlight path ties directly
to production LAN and
production Storage
Element (no dual NICs).
Forwarding Server
ESNet
Forwarding
server
File server
Starlight
Router and
Core Network
Flow-by-flow Strategy
• Storage element identifies flows to the forwarding server
by using layer 5 information
– Host IP, Dest IP, Host Port, Dest Port and Transfer Protocol
– And VO information
• Forwarding server informs peer site to allow admission
• Forwarding server configures local router to forward flow
over DWDM link or the flow takes the default route
– 1 GB pipe is about 30 flows at 30 MB/S.
– If flows are 1 GB files, this yields about 1 flow change/sec
• Forwarding server allows flows to take alternate path
when dynamic path is torn down.
– Firewalls may have issues with this.
• Incoming flows are analogous
• Flow-by-Flow solution seems to suit problem well, but
there are plenty of implementation issues.
Changes to Storage Element to
exploit dynamic pipes
• Build semantics into bulk copy interfaces that
allow for batching transfers to use bandwidth
when available.
• Based on bandwidth availability, dynamically
change number of files transferred in parallel
• Based on bandwidth availability, change the
layer-5 (FTP) protocols used
– Switch from FTP to UDP blaster (sabul) for example.
– Or change the parameters used to tune layer-5
protocols, for example parallelism within ftp.
• Deal with flows which have not completed when
dynamic pipe is de-allocated.
Summary
• There are conventional and research
approaches to wide area networks.
• The interactions in the wide area are interesting
and important to grid based data systems
• FNAL now has the facilities in place to
investigate a number of these issues.
• Storage Elements are important parts of the
investigation and require changes to achieve
high throughput and reliable transfers over WAN

Storage System Integration with High Performance Networks

Transcript Storage System Integration with High Performance Networks

Directory