CHEP07_UltralightV3_pa - Indico

Download Report

Transcript CHEP07_UltralightV3_pa - Indico

Advances in Integrated Storage,
Transfer and Network Management
<x>
Email: <x>@<y>
On behalf of the Ultralight Collaboration
Networking and HEP
 The network is a key element, that HEP needs to work on,
as much as its Tier1 and Tier2 facilities.
 Network bandwidth is going to be a scarce, strategic resource;
the ability to co-schedule network and other resources will be crucial
 US CMS Tier1/Tier2 experience has shown this (tip of the iceberg)
 HEP needs to work with US LHCNet network providers/operators/managers,
and experts on the upcoming circuit-oriented management paradigm
 Joint planning is needed, for a consistent outcome
 The provision of managed circuits is also a direction for ESNet and
Internet2, but LHC networking will require it first
 We also need to exploit now-widely available technologies:
 Reliable data transfers, monitoring in-depth, end-to-end workflow
information (including every network flow, end system state and load)
 Risk: lack of efficient workflow, readiness  lack of competitiveness
 Lack of engagement has made this a DOE-funding issue as well;
even the planned bandwidth may not be there
Transatlantic Networking:
Bandwidth Management with Channels
 The Planned bandwidth for 2008-2010 is already well below the
capabilities of current end-systems, appropriately tuned
 Already evident in intra-US production flows
(Nebraska, UCSD, Caltech)

 “Current capability” limit of nodes is still an order of magnitude greater
than best results seen in production so far
 This can be extended to transatlantic flows, using a series of
straightforward technical steps: configuration, diagnosis, monitoring
Involving the end-hosts and storage systems, as well as networks
 The current lack of transfer volume from non-US Tier1s sites is
not, and cannot be an indicator of future performance
 Upcoming paradigm: managed bandwidth channels
 Scheduled: multiple classes of “work”, settable number
of simultaneous flows per class; like computing facility queues
 Monitored and managed: BW will be matched to throughput capability
[schedule coherently; free up excess BW resources as needed]
 Respond to: errors; schedule changes; progress-rate deviations
Computing, Offline and CSA07 (CMS)
US Tier1 – Tier2 Flows Reach 10 Gbps
1/2007 Snapshot
Computing, Offline and CSA07
One well configured site.
What happens if there are
10 such sites?
Nebraska
Transatlantic Networking:
Bandwidth Management with Channels
 The Planned bandwidth for 2008-2010 is already well below the
capabilities of current end-systems, appropriately tuned
 Already evident in intra-US production flows
(Nebraska, UCSD, Caltech)

 “Current capability” limit of nodes is still an order of magnitude greater
than best results seen in production
 This can be extended to transatlantic flows, using a series of
straightforward technical steps: configuration, diagnosis, monitoring
Involving the end-hosts and storage systems, as well as networks
 The current lack of transfer volume from non-US Tier1s sites is
not, and cannot be an indicator of future performance
 Upcoming paradigm: managed bandwidth channels
 Scheduled: multiple classes of “work”, settable number
of simultaneous flows per class; like computing facility queues
 Monitored and managed: BW will be matched to throughput capability
[schedule coherently; free up excess BW resources as needed]
 Respond to: errors; schedule changes; progress-rate deviations
Internet2’s New Backbone
Level(3) Footprint;
Infinera 10 X 10G Core;
CIENA Optical Muxes
Initial deployment – 10 x 10 Gbps wavelengths over the footprint
First round maximum capacity – 80 x 10 Gbps wavelengths; expandable
Scalability – potential migration to 40 Gbps or 100 Gbps capability
Transition to NewNet underway now (since 10/2006), until end of 2007
Network Services for Robust Data
Transfers: Motivation
 Today’s situation
 End-hosts / applications unaware of network topology,
available paths, bandwidth available (with other flows) on each
 Network unaware of transfer schedules, end-host capabilities
 High performance networks are not a ubiquitous resource
 Applications might get full BW only on the access part of network
 Usually only to the next router
 Mostly not beyond the network under the site’s control
 Better utilization needs joint management; work together
 To schedule and allocate Network resources along with storage
and CPU: using queues, and quotas where needed
Data Transfer applications can no longer treat network as a passive resource.
They can only be efficient if they are “network-aware”: communicating with
and delegating many of their functions to network-resident services
Data Transfer Experience
 Transfers are generally much slower than expected, or slow down, or
even stop. Many potential causes remain undiagnosed:
 Configuration problem ? Loading ? Queuing ? Errors ? etc.
 End-host problem, or network problem, or application failure ?
 Application (blindly) retries, and perhaps sends emails through
hypernews, to request “expert” troubleshooting
 Insufficient information, too slow to diagnose and correlate
at the time the error occurs
 Result: lower transfer rates, people spending more and more time to
troubleshoot
 Can a manual, non-integrated approach scale to a collaboration the
size of CMS/ATLAS when data transfer needs will increase; where we
will compete for (network) resources with other experiments (and
sciences) ?
 Phedex (CMS) and FTS are not the only end-users’ applications that
face transfer challenges (Ask Atlas for example)
Managed Network Services Integration
 The Network alone is not enough
 Need proper integration and interaction needed with
the end-hosts, or storage systems
 Multiple applications will want to use the network resource
 ATLAS; ALICE and LHCb
 Private transfer tools; Analysis applications streaming data
 Real-time streams (e.g. videoconferencing)
 No need for different domains to develop their own monitoring
and troubleshooting tools for a resource that they share
 Manpower intensive; duplication of effort; “reinvention”
 Lack of sufficient operational expertise, lack of access
to sufficient information, and lack of control
 So these systems will (and do) lack the needed functionality,
performance, and reliability
Managed Network
Services Strategy (1)
Analogous to the realtime operation and
internal monitoring and
diagnosis
Network Services Integrated with End Systems
 Need to use a real-time, end-to-end view of the network and
end-systems: based on end-to-end monitoring, in depth
 Need to Extract and Correlate information
(e.g. network state, end-host state, transfer queues-state)
 Solution: Network and end-host exchange information
via real-time services and “end-host agents” (EHA)
 Provide sufficient information for decision support
 EHAs and network services can cooperate, to Automate some
operational decisions, based on accumulated experience
 especially where they become reliable enough
 automation will become essential as the usage and
number of users scale up, and competition increases
Managed Network
Services Strategy (2)
Real-time control and
adaptation: an
optimizable system
[Semi-]Deterministic Scheduling
 Receive request: “Transfer N bytes from Site A to B with
throughput R1”; authenticate/authorize/prioritize
 Verify end-host (HW, config.) capability (R2); schedule
bandwidth B > R2; estimate time to complete T(0)
 Schedule path, with priorities P(i) on segment S(i)
 Check periodically: compare rate R(t) to R2, compare and
update time to complete T(i) to T(i-1)
 Triggers: error (e.g. segment failure); variance beyond
thresholds (e.g. progress too slow; channel underutilized,
wait in queue too long); state change (e.g. new high
priority transfer submitted)
 Dynamic Actions: build alternative path; change channel
size; create new channel and squeeze others in class, etc.
Network Aware Scenario:
Link- Segment Outage
NMA: Network Monitoring Agent
EHA: End-Host Agent
DTSS: Dataset Transfer &
Scheduling Service
DTSS (fully
distributed)
(4) Notify
(3) Reroute/
Circuit Setup
(2) Notify
NMA
(1) Monitoring
EHA
EHA
Data Transfer
(1) NMA detects a problem
(e.g. link-segment down)
 (2) Notifies DTSS
(3) DTSS re-routes the traffic
if bandwidth available on
a different route
Transparent to end-systems
Rerouting takes into account
transfer priority
End-system
(Source)
End-system
(Target)
End hosts and applications
must interact with network
services to achieve this
(4) Otherwise if no alternative,
DTSS notifies EHAs
DTSS puts transfer on hold
Network Aware Scenario:
End-Host Problem
NMA: Network Monitoring Agent
EHA: End-Host Agent
DTSS: Dataset Transfer &
Scheduling Service
(1) EHA detects problem with
end-host (e.g. throughput on
NIC lower than expected)
 Notifies DTSS
(2) DTSS squeezes the
corresponding circuit
 Gives more bandwidth to
the next highest priority
transfer sharing (part of)
the path
(3) NMA senses increased
throughput ; Notify DTSS (4)
DTSS
(fully
distributed
(2) Adjust
)
(4) Notify
NMA
(1) Notify
EHA
EHA
Data
transfer
End-system
(Source)
End-system
(Target)
Network AND End-host
monitoring allow
Root Cause Analysis!
BW used by
end-system
(source)
NIC Problem
Back to nominal BW
Problem resolved, sensed
detected
by network monitoring
CPU
End Host Agent (EHA)
Mem
 Authorization; Service discovery; System configuration
 Complete local profiling of host hardware and software
 Complete monitor of host processes, activity, load
 End to end performance measurements
 Act as active listener to “events” generated by local apps
Disk

System
Net
Disk I/O
Network
 EHA implements secure APIs that applications can use for requests
and interactions with network services.
 EHA continuously receives estimations of time-to-complete its requests
 Where possible, EHA will help provide correct system & network configuration
to the end-hosts.
Minimize!
End Host/User Agent is a Reality in
EVO (also FDT, VINCI, etc.)
Key Characteristic
A coherent real-time
architecture of
inter-communicating
agents and services
with correlated actions
EHA monitors
various system &
network parameters
Enable (suggested) automatic actions
based on anomalies to ensure proper
quality (e.g.: automatically limit video if
bandwidth is limited)
Specialized Agents Supporting Robust
Data Transfer Operations
Example Agent Services Supporting Transfer Operations & Management
 Build a real-time, dynamic map of the inter-connection topology among
the services, & the channels on each link-segment, including:
 How the channels are allocated (bandwidth, priority) and used
 Information on the total traffic and major flows on each segment
 Locate network resources & services which have agreements with a
given VO
 List routers and switches along each path that are “manageable”
 Determine which path-options exist between source and destination
 Among N replicas of a data source, and M possible network paths,
“Discover” (with monitoring) the least estimated time to completion
of a transfer to a given destination; take reliability-record into account
 Detect problems in real-time such as :
 Loss or impairment of connectivity; crc errors; asymmetric routing
 Large rate of lost packets, or retransmissions (above a threshold)
b Implemented as a set of
The Scheduler
collaborating agents:
No single point of failure
 Independent agents control different
segments or channels
 Negotiate for end-to-end connection, using cost functions to find
“lowest-cost” path (on the topological graph): based on VO, priorities,
pre-reservations, etc.
 An agent offers a “lease” for use of a segment or channel to its peers
 Periodic lease renewals are used by all agents
 Allows a flexible response of the agents to task completion,
application failure or network errors.
 If a segment failure or impairment is detected, DTSS automatically
provides path-restoration based on the discovered network topology
 Resource reallocation may be applied to preserve high priority tasks
 The supervising agents then release all other channels or segments
along a path:
An alternative path may be set up: rapidly enough to avoid a TCP
timeout, and thus to allow the transfer to continue uninterrupted.
Dynamic Path Provisioning & Queueing
 Channel allocation based on VO/Priority, [ + Wait time, etc.]
 Create on demand a End-to-end path or Channel & configure end-hosts

Automatic recovery (rerouting) in case of errors
 Dynamic reallocation of throughputs per channel: to manage priorities,
control time to completion, where needed
 Reallocate resources requested but not used
Request
User
Realtime
Feedback
End Host
Agents
Scheduling
Control
Monitoring
Dynamic end-2-end optical path creation & usage
200+ MBytes/sec
From a 1U Node
FDT Transfer
Minimize emails
like this!
4 fiber cut emulations
FDT Disk-to-Disk WAN transfers
Between Two End-hosts
1U Nodes with 4 Disks
MB/s
NEW YORK GENEVA
4U Disk Servers with 24 Disks
CERN CALTECH
Reads and writes on 4
SATA disks in parallel
on each server
Reads and writes on two
12-port RAID Controllers
in parallel on each server
Mean traffic ~ 210 MB/s
~ 0.75 TB per hour
Mean traffic ~ 545 MB/s
~ 2 TB per hour
Working on integrating FDT with DCache
Expose system information that cannot be handled automatically
Drill-down, with simple
auto-diagnostics, using
ML/APMon in ALIEN
RSS feeds and email
alerts when anomalies
occur whose correction
is not automated
Incrementally try to automate these
anomaly-corrections (if possible)
based on accumulated experience
System Architecture:
the Main Services
Application
Application
End User
Agent
End User
Agent
Authentication, Authorization, Accounting
System
Evaluation &
Optimization
Scheduling & Queues
Dynamic Path & Channel Building, Allocation
Failure
Detection
Prediction
Topology Discovery
GMPLS
MPLS
OS
SNMP
MONITORING
Control Path
Provisioning
Learning
Several services are already
operational and field tested
System Architecture: Scenario View
Interoperation with Other
Network Domains
It is important is to establish a
smooth transition from the
current situation to the one
envisioned without
interruption of currently
 Higher performance transfers
available services
 Better utilization of our network resources
Benefits to LHC
 Applications interface with network services & monitoring




 Instead of creation of their own, so
Less manpower required from CMS We intend progressive development
and deployment (not “all at once”).
Decrease email traffic
 Less time spent on support; Less manpower from CMS
Providing end-to-end monitoring and alerts
 More transparency for troubleshooting (if it is not automated)
 Incrementally work on automation of anomaly-handling where possible;
based on accumulated experience But there are strong benefits in early
deployment: cost, efficiency, readiness
Synergy with US funded projects
 Manpower is available to support the transition
Distributed system, with pervasive monitoring and rapid reaction time
 Decrease single points of failure
 Provide information and diagnosis that is not otherwise possible
 Several services are already operational and field tested
Resources: For More Information
 End Host Agent (LISA):
 http://monalisa.caltech.edu/monalisa__Interactive_Clients__LISA.html
 Alice monitoring and alerts:
 http://pcalimonitor.cern.ch/
 Fast data transfer:
 http://monalisa.cern.ch/FDT/
 Network services:
 http://monalisa.caltech.edu/monalisa__Service_Applications__Vinci.html
 EVO with End Host Agent:
 http://evo.caltech.edu/
 UltraLight: http://ultralight.org
See the various posters on this subject at CHEP
Extra Slides Follow
US LHCNet, ESnet Plan 2007-2010:
30-80Gbps US-CERN, ESnet MAN
AsiaPac
SEA
Europe
Europe
Aus.
ESnet4
SDN Core:
30-50G
SNV
BNL
Japan
Japan
CHI
US-LHCNet:
Wavelength Quadrangle
NY-CHI-GVA-AMS
2007-10: 30, 40, 60, 80 Gbps
NYC
GEANT2
SURFNet
IN2P3
DEN
Metro
Rings
DC
FNAL
Aus.
SDG
ALB
ESnet IP Core
≥10 Gbps
ATL
CERN
ELP
ESnet hubs
New ESnet hubs
Metropolitan Area Rings
Major DOE Office of Science Sites
High-speed cross connects with Internet2/Abilene
Production IP ESnet core, 10 Gbps enterprise IP traffic
Science Data Network core, 40-60 Gbps circuit transport
Lab supplied
Major international
LHCNet Data Network
NSF/IRNC circuit; GVA-AMS connection via Surfnet or Geant2
10Gb/s
10Gb/s
30Gb/s
2 x 10Gb/s
US-LHCNet
Network Plan
(3 to 8 x 10 Gbps
US-CERN)
ESNet MANs to FNAL & BNL; Dark Fiber
to FNAL; Peering With GEANT
FDT: Fast Data Transport
SC06 Results 11/14 – 11/15/06
Efficient Data Transfers
 Stable disk-to-disk flows Tampa-Caltech:
Stepping up to 10-to-10 and 8-to-8 1U
 Reading and writing at disk
Server-pairs 9 + 7 = 16 Gbps; then
Solid overnight. Using One 10G link
New Capability Level: ~70 Gbps
per rack of low cost 1U servers
speed over WANs (with TCP)
for the first time
 SC06 Results: 17.7 Gbps on one
link; 8.6 Gbps to/from Korea
 In Java: Highly portable, runs
on all major platforms.
 Based on an asynchronous,
multithreaded system
 Streams a dataset (list of files)
continuously, from a managed
pool of buffers in kernel space,
through an open TCP socket
Smooth data flow from each
disk to/from the network
No protocol start-phase
between files
I. Legrand
Dynamic Network Path Allocation
& Automated Dataset Transfer
>mlcopy A/fileX B/path/
OS path available
Configuring interfaces
Starting Data Transfer
MonALISA Distributed
Internet
Service System
Real time monitoring
APPLICATION
A
DATA
MonALISA
Service
OS
Agent
LISA
Agent
LISA AGENT
sets up
- Net Interfaces
- TCP stack
- Kernel
- Routes
LISA 
APPLICATION
“use eth1.2, …”
B
Optical Switch
Detects errors and automatically
recreates the path in less than
the TCP timeout (< 1 second)
The Functionality of the VINCI System
ML proxy services
MonALISA
ML Agent
ML Agent
MonALISA
ML Agent
ML Agent
MonALISA
ML Agent
ML Agent
Layer 3
ROUTERS
ETHERNET
LAN-PHY
or
WAN-PHY
DWDM
FIBER
Agent
Layer 2
Agent
Agent
Agent
Layer 1
Agent
Site A
Site B
Site C
Monitoring Network Topology,
Latency, Routers
NETWORKS
ROUTERS
AS
Real Time Topology Discovery & Display
Four Continent Testbed
Building a global, network-aware end-to-end managed real-time Grid
Scenario : preemption of lower
priority transfer
End-system 2
(Source 2)
DTSS
EHA
DTTS: Data Transfer & Scheduling
Service
NMA: Network Monitoring Agent
EHA: End-Host Agent
EHA
(1) At time t0, a high priority request
(e.g. T0 – T1 transfer) is received
by DTSS
(2) DTSS
 squeezes the normal priority
channel
 allocates necessary
bandwidth to high priority
transfer
(3) DTSS notifies EHAs
At the end of the high priority
transfer (t1), DTSS:
 restores original allocated
bandwidth and notifies EHAs
End-system 1
(Source 1)
EHA
Data Transfer
End-system 3
(Target)
Allocated
bandwidth,
High priority
Allocated
bandwidth,
Normal priority
t0
t1
New Allocated
bandwidth,
High priority
Advanced examples
Reservation cancellation by user
 Transfer request cancellation has to be propagated to the DTSS
 Not enough to cancel the job on end-host
 DTSS will schedule next transfer in queue ahead of time
 Application might not have data ready (e.g. staging from tape)
 DTSS has to communicate with EHAs which could profit from early schedule,
end host has to advertise when it can accept the circuit
Reservation extension (transfer not finished on schedule)
 If possible, DTSS keeps the circuit
 Same bandwidth if available
 Take transfer priority into account
 Possibly with quota limitation
 Progress tracking:
 EHA has to announce remaining transfer size
 Recalculate Estimated Time to Completion taking into account new
bandwidth
Application in 2008 (Pseudo-log)
 Node1> ts –v –in mercury.ultralight.org:/data01/big/zmumu05687.root –out

















venus.ultralight.org:/mstore/events/data –prio 3 –deadline +2:50 –xsum
-TS: Initiating file transfer setup…
-TS: Contact path discovery service (PDS) through EHA for transfer request
-PDS: Path discovery in progress…
-PDS:
Path RTT 128.4 ms, best effort path bottleneck is 10 GE
-PDS:
Path options found:
-PDS:
Lightpath option exists end-to-end
-PDS:
Virtual pipe option exists (partial)
-PDS:
High-performance protocol capable end-systems exist
-TS: Request 1.2 TB file transfer within 2 hours 50 minutes, priority 3
-TS: Target host confirms available space for [email protected]
-TS: EHA contacted…parameters transferred
-EHA: Priority 3 request allowed for [email protected]
-EHA: request scheduling details
-EHA: Lightpath prior scheduling (higher/same priority) precludes use
-EHA: Virtual pipe sizeable to 3 Gbps available for 1 hour starting in 52.4 minutes
-EHA: request monitoring prediction along path
-EHA: FAST-UL transfer expected to deliver 1.2 Gbps (+0.8/-0.4) averaged over
next 2 hours 50 minutes
Hybrid Nets with Circuit-Oriented Services:
Lambda Station Example
Also OSCARs, Terapaths, UltraLight
A network path forwarding service to interface production
facilities with advanced research networks:
 Goal is selective forwarding on a per flow basis
& Alternate network paths for high impact data movement
 Dynamic path modification, with graceful cutover & fallback
Lambda Station
interacts with:
 Host applications
& systems
 LAN infrastructure
 Site border infrastructure
 Advanced technology WANs
 Remote Lambda Stations
D. Petravick, P. DeMar
TeraPaths: BNL and Michigan; Partnering with
OSCARS (ESnet), LStation (FNAL) and DWMI (SLAC)
 Investigate: Integration & Use of LAN QoS and MPLS-Based
Differentiated Network Services in the ATLAS Distributed
Computing Environment
 As a Way to Manage the Network As a Critical Resource
OSCARS
Remote
TeraPaths
M10
TeraPaths
Resource
Manager
LAN/MPLS
IN(E)GRESS
GridFtp
& SRM
LAN QoS
LAN/MPLS
ESnet
Network
Usage
Policy
MPLS and Remote LAN QoS requests
SE
Grid AA
Monitoring
Bandwidth
Requests
& Releases
Data Management
Data Transfer
Traffic Identification:
addresses, port #, DSCP bits
Dantong Yu
Application in 2008
 Node1> ts –v –in mercury.ultralight.org:/data01/big/zmumu05687.root –out

















venus.ultralight.org:/mstore/events/data –prio 3 –deadline +2:50 –xsum
-TS: Initiating file transfer setup…
-TS: Contact path discovery service (PDS) through EHA for transfer request
-PDS: Path discovery in progress…
-PDS:
Path RTT 128.4 ms, best effort path bottleneck is 10 GE
-PDS:
Path options found:
-PDS:
Lightpath option exists end-to-end
-PDS:
Virtual pipe option exists (partial)
-PDS:
High-performance protocol capable end-systems exist
-TS: Request 1.2 TB file transfer within 2 hours 50 minutes, priority 3
-TS: Target host confirms available space for [email protected]
-TS: EHA contacted…parameters transferred
-EHA: Priority 3 request allowed for [email protected]
-EHA: request scheduling details
-EHA: Lightpath prior scheduling (higher/same priority) precludes use
-EHA: Virtual pipe sizeable to 3 Gbps available for 1 hour starting in 52.4 minutes
-EHA: request monitoring prediction along path
-EHA: FAST-UL transfer expected to deliver 1.2 Gbps (+0.8/-0.4) averaged over
next 2 hours 50 minutes
The main components for Scheduler
Request
Options
Network Control &
Configuration
Scheduling
Decision
End Hosts Control
& Configuration
User
Possible path / channels
options for the requested
priority
Progress Monitor
Priority Queues
for each segment
Reservations &
Active Transfers lists
Topological Graph
and traffic information
Failures Detection
Applications (Data Transfers,
Storage) Monitoring
ApMon, dedicated modules
Network & end hosts
monitoring
SNMP, sFlow, TL1, trace, ping
AvBw, /proc , TCP conf.
Dedicated
Agents
Learning and Prediction
We must also consider the applications that do not use our APIs and to provide
effective network services for all the applications.
Learning algorithms (like Self Organizing Neural Networks) will be used to evaluate the
traffic created by other applications, to identify major patters and to dynamically
setup effective connectivity maps based on the monitoring data.
It is very difficult if not impossible to assume that we could predict all possible events
in a complex environment like the GRID, and compile the knowledge about those
events in advance.
Learning is the only practical approach in which agents can acquire necessary
information to describe their environments.
We approach this multi-agent learning task from two different levels:
 the local level of individual learning agents
 the global level of inter-agent communication
We need to ensure that each agent can be optimized from local perspective, while the
globally monitoring mechanism acts as a ‘driving force’ to evolve the agents
collectively based on the previous experience .