Dynamic Lightpaths

Download Report

Transcript Dynamic Lightpaths

LHCnet: Proposal for LHC
Network infrastructure extending
globally to Tier2 and Tier3 sites
Artur Barczyk, Harvey Newman
California Institute of Technology / US LHCNet
LHCT2S Meeting
CERN, January 13th, 2011
1
THE PROBLEM TO SOLVE
2
LHC Computing Infrastructure
WLCG in brief:
• 1 Tier-0 (CERN)
• 11 Tiers-1s; 3 continents
• 164 Tier-2s; 5 (6) continents
Plus O(300) Tier-3s worldwide
3
CMS Data Movements
(All Sites and Tier1-Tier2)
Throughput [GBy/s]
Throughput [GBy/s]
120 Days June-October
2.5
2
1.5
2
Daily average total
rates reach over
2 GBytes/s
1.5
1
1
0.5
0.5
0
4
Daily average
T1-T2 rates reach
1-1.8 GBytes/s
6/19 7/03 7/17 7/31 8/14 8/28 9/11 9/25 10/9 0 6/23 7/07 7/21 8/4 8/18 9/1 9/15 9/29 10/13
132 Hours
Last Week
1 hour average:
to 3.5 GBytes/s
3
To ~50%
during Dataset
Reprocessing &
Repopulation
2
1
0
Tier2-Tier2 ~25%
of Tier1-Tier2
Traffic
10/6
10/7
10/8
10/9
10/10
4
Worldwide data distribution and analysis (F.Gianotti)
Total throughput of ATLAS data through the Grid: 1st January  November.
MB/s
per day
6 GB/s
~2 GB/s
(design)
Peaks of 10 GB/s reached
Grid-based analysis in Summer 2010: >1000 different users; >15M analysis jobs
The excellent Grid performance has been crucial for fast release of physics results. E.g.:
ICHEP: the full data sample taken until Monday was shown at the conference Friday 5
Changing LHC Data Models
• 3 recurring themes:
– Flat(ter) hierarchy: Any site might in the future pull data from any other
site hosting it.
– Data caching: Analysis sites will pull datasets from other sites “on
demand”, including from Tier2s in other regions
• Possibly in combination with strategic pre-placement of data sets
– Remote data access: jobs executing locally, using data cached at a
remote site in quasi-real time
• Possibly in combination with local caching
• Expect variations by experiment
6
Ian Bird, CHEP conference, Oct 2010
7
Remote Data Access and Local
Processing with Xrootd (CMS)
 Useful for smaller sites with less
(or even no) data storage
 Only selected objects are read
(with object read-ahead).
No transfer of entire data sets
 CMS demonstrator: Omaha
diskless Tier3, served data from
Caltech and Nebraska (Xrootd)
Strategic Decisions:
Remote Access vs Data Transfers
Similar operations in
ALICE for years
Brian Bockelman, September 2010
8
Ian Bird, CHEP conference, Oct 2010
9
Requirements summary
(from Kors’ document)
• Bandwidth:
– Ranging from 1 Gbps (Minimal site) to 5-10Gbps (Nominal) to N x 10
Gbps (Leadership)
– No need for full-mesh @ full-rate, but several full-rate connections
between Leadership sites
– Scalability is important,
• sites are expected to migrate Minimal  Nominal  Leadership
• Bandwidth growth: Minimal = 2x/yr, Nominal&Leadership = 2x/2yr
• “Staging”:
– Facilitate good connectivity to so far (network-wise) underserved sites
• Flexibility:
– Should be able to include or remove sites at any time
• Budget Neutrality:
– Solution should be cost neutral [or at least affordable, A/N]
10
SOLUTION PROPOSAL
11
Lessons learned
• The LHC OPN has proven itself, shall learn from it
• Simple architecture
– Point-to-point Layer 2 circuits
– Flexible and scalable topology
• Grew organically
– From star to partial mesh
– Open to several technology choices
• each of which satisfies requirements
• Federated governance model
– Coordination between stakeholders
– No single administrative body required
– Made extensions and funding straight-forward
• Remaining challenge: monitoring and reporting
– More of a systems approach
12
Design Inputs
• By the scale, geographical distribution and diversity of the
sites as well as funding, only a federated solution is feasible
• The current LHC OPN is not modified
– OPN will become part of a larger whole
– Some purely Tier2/Tier3 operations
• Architecture has to be Open and Scalable
– Scalability in bandwidth, extent and scope
• Resiliency in the core, allow resilient connections at the edge
• Bandwidth guarantees  determinism
– Reward effective use
– End-to-end systems approach
• Operation at Layer 2 and below
– Advantage in performance, costs, power consumption
13
Design Inputs, cont.
• Most/all R&E networks (technically) can offer Layer 2 services
– Where not, commercial carriers can
– Some advanced ones offer dynamic (user controlled)
allocation
• Leverage as much as possible on existing infrastructures and
collaborations
– GLIF, DICE, GLORIAD, …
• Last but not least:
– This would be the perfect occasion to start using IPv6,
therefore we should, (at least) encourage IPv6, but
support IPv4
• Admittedly the challenge is above Layer 3
14
Design Proposal
• A design satisfying all requirements:
Switched Core with Routed Edge
• Sites interconnected through Lightpaths
– Site-to-site Layer 2 connections, static or dynamic
• Switching is far more robust and cost-effective for highcapacity
interconnects
• Routing (from
end-site
viewpoint)
is deemed
necessary
15
Switched Core
• Strategically placed core exchange points
– E.g. start with 2-3 in Europe, 2 in NA, 1 in SA, 1-2 in Asia
– E.g. existing devices at Tier1s, GOLEs, GEANT nodes, …
• Interconnected through high capacity trunks
– 10-40 Gbps today, soon 100Gbps
• Trunk links can be CBF, multi-domain Layer 1/ Layer 2 links, …
– E.g. Layer 1 circuits with virtualised sub-rate channels,
sub-dividing 100G links in early stages
• Resiliency, where needed, provided at Layer 1/ Layer 2
– E.g. SONET/SDH Automated Protection Switching, Virtual Concatenation
• At later stage, automated Lightpath exchanges will enable a
flexible “stitching” of dynamic circuits
– See demonstration (proof of principle) at last GLIF meeting and SC10
16
One Possible Core Technology:
Carrier Ethernet
• IEEE standard 802.1Qay (PBB-TE)
– Separation of backbone and customer network through MAC-in-MAC
– No flooding, no Spanning Tree
– Scalable to 16 M services
• Provides OAM comparable to SONET/SDH
– 802.3ag, end-to-end service OAM
• Continuity Check Message, loopback, linktrace
– 802.3ah, link OAM
• Remote loopback, loopback control, remote failure indication
• Cost Effective
– e.g. NSP study indicates TCO ~43% lower for COE (PBB-TE) vs MPLS-TE
• 802.1Qay and ITU-T G.8031 Ethernet Linear Protection Standard
provides 1+1 and 1:1 protection switching
– Similar to SONET/SDH APS
– Works by Y.1731 message exchange (ITU-T standard)
17
Routed Edge
• End sites (might) require Layer 3 connectivity in the LAN
– Otherwise a true Layer 2 solution might be adequate
• Lightpaths terminate on a site’s router
– Site’s border router, or, preferably,
– Router closest to the storage elements
• All IP peerings are p2p, site-to-site
– Reduces convergence time, avoids issues with flapping
links
• Each site decides and negotiates with which remote site it
desires to peer (e.g. based on experiment’s connectivity
design)
• Router (BGP) advertises only the SE subnet(s) through the
configured Lightpath
18
Lightpath termination
• Avoid LAN connectivity issues
when terminating lightpath at
campus edge
• Lightpath should be terminated as close as possible to the
Storage Elements, but can be challenging if not impossible
(support a dedicated border router?)
• Or, provide a “local lightpath”
(e.g. a VLAN with proper
bandwidth, or a dedicated link
where possible); border router
does the “stitching”
19
IP backup
• Foresee IP routed paths as backup
– End-site’s BR is configured for both default IP connectivity, and direct
peering through Lightpath
– Direct peering takes precedence
• Works also for
dynamic Lightpaths
• For full dynamic
Lightpath setup,
dynamic end-site
configuration through
e.g. LambdaStation
or TeraPaths will be
used
20
Resiliency
• Resiliency in the core is provided by protection switching
depending on technology used between core nodes
– SONET/SDH or OTN protection switching (Layer 1)
– MPLS failover
– PBB-TE protection switching
– Ethernet LAG
• Sites can opt for additional resiliency (e.g. where protected
trunk links are not available) by forming transit agreements
with other site
– akin to the current LHC OPN use of CBF
21
Layer1 through Layer 3
22
Scalability
• Assuming Layer 2 point-to-point operations, a natural
scalability limitation is the 4k VLAN IDs
• This problem is naturally resolved in
– PBB-TE (802.3Qay), through MAC-in-MAC encapsulation
B-DA
B-SA
Ethertype
0x88A8
B-VID
Ethertype
0x88E7
I-SID
Customer
Frame incl.
Header+FCS
B-FCS
– dynamic bandwidth allocation with re-use of VLAN IDs
• Only constraint is no two connections through the same
network element to use the same VLAN
23
How do End-Sites Connect?
A Simple Example
• A Tier2 in Asia needs 1 Gbps connectivity (each) to 2 sites in
Europe, 2 in US and the ASGC Tier1
• 5 x 1G intercontinental circuits is cost-prohibitive
• The Tier2 could however afford a 1-2 Gbps (e.g. EoMPLS)
circuit to next GOLE (e.g. HKOP, KRLight, TaiwanLight, T-LEX)
– Through NREN(s) or commercial circuits
• The GOLE connects to Starlight, NetherLight (trunks) and has a
connection to ASGC (example)
• Static bandwidth allocation (first stage):
– The end-site has a 1Gbps link, with 5 VLANS, each one terminating at
one of the desired remote sites
– Bandwidth is allocated by the exchange points to fit the needs
• Dynamic allocation (early adopter + later stage):
– The end-site has a 1Gbps link, with configurable remote end-points and
bandwidth allocation
24
Monitoring and Reporting
• Pervasive monitoring of status and utilisation is a must!
–
–
–
–
–
Robust (100% monitoring up-time)
Resilient
Reliable
Real-time
End-to-end
• Candidate 1: MonALISA monitoring system, used in US
LHCNet, and at large scale e.g. in the ALICE experiment
– From US LHCNet experience: it has all the components, and is proven to
be scalable to satisfy the requirements
– See e.g. LHC OPN presentation on MonALISA in US LHCNet:
http://indico.cern.ch/getFile.py/access?subContId=1&contribId=15&resId
=0&materialId=slides&confId=80755
• Candidate 2: PerfSONAR, building up on set of community
developed tools
25
DYNAMIC LIGHTPATHS
26
Dynamic Lightpaths - Intro
• Kors’ requirements document: “[…] the backbone does not
need to support all possible connections at full speed all the
time. The backbone does need to support several full speed
connections between the leadership Tier2s simultaneously.”
• Dynamic Lightpaths provide temporary bandwidth allocation
on as-needed basis
– Connection reservation between any pair of sites for the requested
amount of time (only)
• Deployed in several R&E networks (ESnet, Internet2, SURFnet,
US LHCNet),
• Pilots being prepared in others (GEANT + selected NRENs)
• DYNES instrument, interconnecting ~40 US campuses will start
deployment in early 2011
27
Dynamic Lightpaths in the
proposed architecture
• Dynamic Network Resource Allocation is a powerful tool to
avoid permanent full-mesh topology, while providing flexible
connectivity and resource guarantees between end-systems
• Requires integration in the experiments’ software stack
• We foresee to include dynamic allocation in the final design,
complementing static Lightpaths between Leadership sites
– Starting with early adopters, including DYNES-connected
sites
28
DYNES Overview
• What is DYNES?
– A U.S-wide dynamic network “cyber-instrument” spanning ~40 US
universities and ~14 Internet2 connectors
– Extends Internet2’s dynamic network service “ION” into U.S. regional
networks and campuses; Aims to support LHC traffic (also internationally)
– Based on the implementation of the Inter-Domain Circuit protocol developed
by ESnet and Internet2; Cooperative development also with GEANT, GLIF
• Who is it?
– Collaborative team: Internet2, Caltech, Univ. of Michigan, Vanderbilt
– The LHC experiments, astrophysics community, WLCG, OSG, other VOs
– The community of US regional networks and campuses
• What are the goals?
– Support large, long-distance scientific data flows in the LHC, other programs
(e.g. LIGO, Virtual Observatory), & the broader scientific community
– Build a distributed virtual instrument at sites of interest to the LHC but
available to R&E community generally
29
DYNES Team
• Internet2,
Caltech,
Vanderbilt,
Univ. of Michigan
• PI: Eric Boyd
(Internet2)
• Co-PIs:
– Harvey Newman
(Caltech)
– Paul Sheldon
(Vanderbilt)
– Shawn McKee
(Univ. of
Michigan)
http://www.internet2.edu/dynes
30
DYNES System Description
• AIM: extend hybrid & dynamic capabilities to campus & regional networks.
– A DYNES instrument must provide two basic capabilities at the Tier 2S, Tier3s
and regional networks:
1. Network resource allocation such as
bandwidth to ensure transfer performance
2. Monitoring of the network and data transfer
performance
• All networks in the path require the ability
to allocate network resources and monitor
the transfer. This capability currently exists
on backbone networks such as Internet2 and
ESnet, but is not widespread at the campus
and regional level.
– In addition Tier 2 & 3 sites require:
Two typical transfers that DYNES
supports: one Tier2 - Tier3 and
3. Hardware at the end sites capable of making
another Tier1-Tier2.
optimal use of the available network resources
The clouds represent the network
domains involved in such a transfer.
31
DYNES: Regional Network Instrument Design
• Regional networks require
1. An Ethernet switch
2. An Inter-domain Controller (IDC)
• The configuration of the IDC
consists of OSCARS, DRAGON,
and perfSONAR. This allows
the regional network to provision
resources on-demand through
interaction with the other
instruments
• A regional network does not
require a disk array or FDT server
because they are providing
transport for the Tier 2 and Tier 3
data transfers, not initiating them.
At the network level, each regional connects the incoming
campus connection to the Ethernet switch provided.
Optionally, if a regional network already has a qualified switch
compatible with the dynamic software that they prefer, they
may use that instead, or in addition to the provided
equipment. The Ethernet switch provides a VLAN dynamically
allocated by OSCARS & DRAGON. The VLAN has quality of
service (QoS) parameters set to guarantee the bandwidth
requirements of the connection as defined in the VLAN. These
parameters are determined by the original circuit request from
the researcher / application. through this VLAN, the regional
provides transit between the campus IDCs connected in the
same region or to the global IDC infrastructure.
32
DYNES: Tier2 and Tier3
Instrument Design
• Each DYNES (sub-)instrument
at a Tier2 or Tier3 site consists
of the following hardware,
combining low cost & high
performance:
1. An Inter-domain Controller (IDC)
2. An Ethernet switch
3. A Fast Data Transfer (FDT)
server. Sites with 10GE
throughput capability will have a
dual-port Myricom 10GE
network interface in the server.
4. An optional attached disk array
with a Serial Attached SCSI
(SAS) controller capable of
several hundred MBytes/sec to
local storage.
The Fast Data Transfer (FDT) server connects to the disk array via
the SAS controller and runs FDT software developed by Caltech.
FDT is an asynchronous multithreaded system that automatically
adjusts I/O and network buffers to achieve maximum network
utilization. The disk array stores datasets to be transferred among
the sites in some cases. The FDT server serves as an aggregator/
throughput optimizer in this case, feeding smooth flows over the
networks directly to the Tier2 or Tier3 clusters. The IDC server
handles the allocation of network resources on the switch, interactions with other DYNES instruments related to network provisioning, and network performance monitoring. The IDC creates
virtual LANs (VLANs) as needed.
33
How can DYNES be leveraged?
• The Internet2 ION service has currently end-points at two GOLEs in
the US: MANLAN and StarLight
• A static Lightpath from any end-site to one of these two Lightpath
Exchanges can be extended through ION to any of the DYNES sites
(LHC Tier2 or Tier3)
34
MANAGEMENT AND ORGANIZATION
35
Governance structure
• The global scale of the LHC network basically excludes a single
administrative/management unit
• Needs to be under LHC community’s control
– Capacity planning
– Exchange point placement
• Open, federated governance
– Stakeholders in LHC computing shall be able to participate and
contribute
• LHC computing sites (Tier0/1/2/3) (directly? through WLCG? GDB?)
• R&E networks
– One coordinating body (open participation)
• Meet regularly
• Define and oversee service levels
• Perform planning functions
• MoUs with exchange point operators
36
Funding
• Each site is responsible for assuring funding for its own
– End-site equipment (possibly a router or port costs on
campus BR)
– Layer 2 connection to the next Lightpath exchange point
– Monitoring device
• Core network will necessitate some shared funding
– Centrally organised?
• Defining exchange point placement and core trunk capacities
– On regional basis?
• By end-sites connecting to same exchange point
37
Summary
• We propose a robust, scalable and comparatively low-cost
solution based on a switched core with routed edge
architecture
• Core consists of sufficient number of strategically placed
exchange points interconnected by properly sized trunk
circuits
– Scaling rapidly with time as in requirements document
• IP routing is implemented at the end-sites
• Sites are responsible for securing proper funding for their
connectivity to the core
• Initial deployment to use predominantly static Lightpaths,
later predominantly using dynamic resource allocation
• A federated governance model has to be used due to global
geographical extent and diversity of funding sources
38
QUESTIONS?
[email protected]
39