ESnet core - Internet2

Download Report

Transcript ESnet core - Internet2

ESNet Update
Joint Techs Meeting, July 19, 2004
William E. Johnston, ESnet Dept. Head and Senior Scientist
R. P. Singh, Federal Project Manager
Michael S. Collins, Stan Kluz,
Joseph Burrescia, and James V. Gagliardi, ESnet Leads
and the ESnet Team
Lawrence Berkeley National Laboratory
1
ESnet Connects DOE Facilities and Collaborators
CA*net4
KDDI (Japan)
France
Switzerland
Taiwan
(TANet2)
Australia
CA*net4
Taiwan
(TANet2)
Singaren
GEANT
- Germany
- France
- Italy
- UK
- etc
Sinet (Japan)
Japan – Russia(BINP)
CA*net4
CERN
MREN
Netherlands
Russia
StarTap
Taiwan
(ASCC)
LIGO
PNNL
ESnet IP
Japan
MIT
JGI
LBNL
NERSC
SLAC
FNAL
ANL-DC
INEEL-DC
ORAU-DC
ANL
LLNL/LANL-DC
SNLL
QWEST
ATM
LLNL
AMES
BNL
NY-NAP
PPPL
MAE-E
4xLAB-DC
GTN&NNSA
MAE-W
PAIX-E
KCP
YUCCA MT
JLAB
ORNL
LANL
SDSC
ALB
HUB
OSTI
ARM
SNLA
ORAU
NOAA
SRS
Allied
Signal
GA
42 end user sites
Office Of Science Sponsored (22)
NNSA Sponsored (12)
Joint Sponsored (3)
Other Sponsored (NSF LIGO, NOAA)
Laboratory Sponsored (6)
ESnet core: Packet over SONET Optical Ring
peering points
and Hubs
hubs
IPv6: backbone and numerous peers
International (high speed)
OC192 (10G/s optical)
OC48 (2.5 Gb/s optical)
Gigabit Ethernet (1 Gb/s)
OC12 ATM (622 Mb/s)
OC12
OC3 (155 Mb/s)
T3 (45 Mb/s)
T1-T3
T1 (1 Mb/s)
2
ESnet Logical Infrastructure
Connects the DOE Community With its Collaborators
Australia
CA*net4
Taiwan
(TANet2)
Singaren
PNW-GPOP
CA*net4
CERN
MREN
Netherlands
Russia
StarTap
Taiwan
(ASCC)
KDDI (Japan)
France
GEANT
- Germany
- France
- Italy
- UK
- etc
SInet (Japan)
KEK
Japan – Russia (BINP)
SEA HUB
2 PEERS
Distributed 6TAP
19 Peers
Abilene
Japan
1 PEER
LBNL
CalREN2
1 PEER
Abilene +
7 Universities
Abilene 2 PEERS
PAIX-W
3 PEERS
FIX-W
MAE-W
39 PEERS
CENIC
SDSC
NYC HUBS
5 PEERS
26 PEERS
MAX GPOP
MAE-E
PAIX-E
22 PEERS
20 PEERS
EQX-SJ
GA
TECHnet
Commercial
ESnet Peering
(connections to
other networks)
6 PEERS
LANL
University
International
Commercial
Abilene
ATL HUB
ESnet provides complete access to the Internet by
managing the full complement of Global Internet
routes (about 150,000) at 10 general/commercial
peering points + high-speed peerings w/ Abilene and
the international networks.
•
•
ESnet New Architecture Goal
MAN rings provide dual site and hub connectivity
A second backbone ring will multiply connect the
MAN rings to protect against hub failure
AsiaPacific
Europe
DOE sites
Sunnyvale (SNV)
New York (AOA)
ESnet
Core/Backbone
Washington, DC (DC)
Atlanta (ATL)
El Paso (ELP)
4
First Step: SF Bay Area ESnet MAN Ring
Seattle and
Chicago
Chicago
Joint
Genome
Institute
LBNL
•
Increased reliability and site
connection bandwidth
•
Phase 1
o
NERSC
SF BA MAN ring
topology – phase 1
SF Bay
Area
• Phase 2
o
mini ring
Ring should not connect
directly into ESnet SNV hub
(still working on physical
routing for this)
•
Have not yet identified both
legs of the mini ring
Qwest /
ESnet hub
NLR /
UltraScienceNet
LA and San Diego
Existing ESnet
Core Ring
El Paso
LLNL, SNL, and
UC Merced
•
SLAC
Level 3
hub
Connects the primary Office of
Science Labs in a MAN ring
5
Traffic Growth Continues
ESnet Monthly Accepted Traffic
ESnet is currently transporting about 250 terabytes/mo.
300
TBytes/Month
250
200
150
100
Annual growth in the past
five years has increased
from 1.7x annually to just
over 2.0x annually.
50
2004
2003
2002
2001
2000
1999
1998
1997
1996
1995
1994
1993
1992
1991
1990
0
6
Who Generates Traffic, and Where Does it Go?
ESnet Inter-Sector Traffic Summary,
Jan 2003 / Feb 2004 (1.7X overall traffic increase, 1.9X OSC increase)
(the international traffic is increasing due to BABAR at SLAC and the LHC tier 1 centers at
FNAL and BNL)
72/68%
DOE sites
DOE is a net supplier
of data because
DOE facilities are
used by universities
and commercial
entities, as well as by
DOE researchers
21/14%
ESnet
~25/18%
14/12%
17/10%
10/13%
Note that more that 90% of the ESnet traffic is
OSC traffic
ESnet Appropriate Use Policy (AUP)
All ESnet traffic must originate and/or terminate on
an ESnet an site (no transit traffic is allowed)
R&E (mostly
universities)
Peering Points
53/49%
DOE collaborator traffic, inc.
data
Commercial
9/26%
International
4/6%
Traffic coming into ESnet = Green
Traffic leaving ESnet = Blue
Traffic between sites
% = of total ingress or egress traffic
7
1 terabyte/day
ESnet Top 20 Data Flows, 24 hrs., 2004-04-20
A small number
of science
users account
for a significant
fraction of all
ESnet traffic
8
Top 50 Traffic Flows Monitoring – 24hr
2 Int’l and 2 Commercial Peering Points
10 flows
> 100 GBy/day
More than 50
flows
> 10 GBy/day
Disaster Recovery and Stability
LBNL
SNV HUB
Remote
Engineer
• partial duplicate
infrastructure
Engineers, 24x7 Network
Operations Center, generator
backed power
• Spectrum (net mgmt system)
• DNS (name – IP address
translation)
• Eng database
• Load database
• Config database
• Public and private Web
• E-mail (server and archive)
• PKI cert. repository and
revocation lists
• collaboratory
authorization
ALB
HUB
service
Remote Engineer
• partial duplicate
infrastructure
DNS
AMES
BNL
CHI HUB
NYC HUBS
PPPL
DC HUB
Remote
Engineer
Duplicate Infrastructure
Currently deploying full
replication of the NOC
databases and servers
and Science Services
databases in the NYC
Qwest carrier hub
• The network must be kept available even if, e.g., the West Coast
is disabled by a massive earthquake, etc.
Reliable operation of the network involves
• remote Network Operation Centers (3)
• replicated support infrastructure
• generator backed UPS power at all critical
network and infrastructure locations
• high physical security for all equipment
• non-interruptible core - ESnet core
operated without interruption through
o
o
o
N. Calif. Power blackout of 2000
the 9/11/2001 attacks, and
the Sept., 2003 NE States power blackout
10
Disaster Recovery and Stability
•
Duplicate NOC infrastructure to AoA hub in two
phases, complete by end of the year
o
9 servers – dns, www, www-eng and noc5 (eng.
databases), radius, aprisma (net monitoring), tts (trouble
tickets), pki-ldp (certificates), mail
11
Maintaining Science Mission Critical Infrastructure
in the Face of Cyberattack
•
A Phased Response to Cyberattack is being implemented to
protects the network and the ESnet sites
•
The phased response ranges from blocking certain site traffic
to a complete isolation of the network which allows the sites
to continue communicating among themselves in the face of
the most virulent attacks
o
Separates ESnet core routing functionality from external Internet
connections by means of a “peering” router that can have a policy
different from the core routers
o
Provide a rate limited path to the external Internet that will insure siteto-site communication during an external denial of service attack
o
Provide “lifeline” connectivity for downloading of patches, exchange of
e-mail and viewing web pages (i.e.; e-mail, dns, http, https, ssh, etc.)
with the external Internet prior to full isolation of the network
Phased Response to Cyberattack
ESnet first response –
filters to assist a site
ESnet second
response – filter traffic
from outside of ESnet
ESnet third response – shut down the
main peering paths and provide only
limited bandwidth paths for specific
“lifeline” services
X
X
router
ESnet
peering
router
router
LBNL
X
Lab first
response – filter
incoming traffic
at their ESnet
gateway router
gateway
router
border
router
attack
traffic
router
peering
router
border router
Lab
Sapphire/Slammer worm infection created a Gb/s of
traffic on the ESnet core until filters were put in place (both
into and out of sites) to damp it out.
Lab
gateway
router
13
Phased Response to Cyberattack
Phased Response to
Cyberattack
Architecture to allow
• phased response to cybersecurity
attacks
• lifeline communications during
lockdown conditions.
Design the Architecture
Software; site, core and peering
routers topology, and; hardware
configuration
1Q04
Design and test lifeline
filters
Configuration of filters specified
4Q04
Configure and test fail-over
and filters
Fail-over configuration is successful
4Q04
In production
The backbone and peering routers
have a cyberattack defensive
configuration
1Q05
14
Grid Middleware Services
•
ESnet is the natural provider for some “science
services” – services that support the practice of
science
o
ESnet is trusted, persistent, and has a large (almost
comprehensive within DOE) user base
o
ESnet has the facilities to provide reliable access and high
availability through assured network access to replicated
services at geographically diverse locations

However, service must be scalable in the sense that as its
user base grows, ESnet interaction with the users does
not grow (otherwise not practical for a small organization
like ESnet to operate)
15
Grid Middleware Requirements (DOE Workshop)
•
A DOE workshop examined science driven requirements for
network and middleware and identified twelve high priority
middleware services (see www.es.net/#research)
•
Some of these services have a central management
component and some do not
•
Most of the services that have central management fit the
criteria for ESnet support. These include, for example
o
o
o
o
o
o
o
Production, federated RADIUS authentication service
PKI federation services
Virtual Organization Management services to manage organization
membership, member attributes and privileges
Long-term PKI key and proxy credential management
End-to-end monitoring for Grid / distributed application debugging and
tuning
Some form of authorization service (e.g. based on RADIUS)
Knowledge management services that have the characteristics of an
ESnet service are also likely to be important (future)
16
Science Services: PKI Support for Grids
•
Public Key Infrastructure supports cross-site, crossorganization, and international trust relationships that permit
sharing computing and data resources and other Grid
services
•
DOEGrids Certification Authority service provides X.509
identity certificates to support Grid authentication provides an
example of this model
o
The service requires a highly trusted provider, and requires a
high degree of availability
o
Federation: ESnet as service provider is a centralized agent for
negotiating trust relationships, e.g. with European CAs
o
The service scales by adding site based or Virtual Organization
based Registration Agents that interact directly with the users
o
See DOEGrids CA (www.doegrids.org)
17
ESnet PKI Project
•
•
•
•
DOEGrids Project Milestones
o
DOEGrids CA in production June, 2003
o
Retirement of initial DOE Science Grid CA (Jan 2004)
o
“Black rack” installation completed for DOE Grids CA (Mar 2004)
New Registration Authorities
o
FNAL (Mar 2004)
o
LCG (LHC Computing Grid) catch-all: near completion
o
NCC-EPA: in progress
Deployment of NERSC “myProxy” CA
Grid Integrated RADIUS Authentication Fabric pilot
18
DOEGrids Security
Bro Intrusion Detection
RAs and
certificate
applicants
PKI Systems
Fire Wall
HSM
Internet
Secure racks
Secure Data Center
Vaulted Root CA
Building Security
LBNL Site security
19
Science Services: Public Key Infrastructure
•
The rapidly expanding customer base of this service will soon
make it ESnet’s largest collaboration service by customer
count
Registration Authorities
ANL
LBNL
ORNL
DOESG (DOE Science Grid)
ESG (Climate)
FNAL
PPDG (HEP)
Fusion Grid
iVDGL (NSF-DOE HEP collab.)
NERSC
PNNL
20
ESnet PKI Project (2)
•
New CA initiatives:
o
o
o
•
•
FusionGrid CA
ESnet SSL Server Certificate CA
Mozilla browser CA cert distribution
Script-based enrollment
Global Grid Forum documents
o
o
o
Policy Management Authority Charter
OCSP (Online Certificate Status Protocol) Requirements
For Grids
CA Policy Profiles
21
Grid Integrated RADIUS Authentication Fabric
• RADIUS routing of authentication requests
• Support One-Time Password initiatives
o
o
o
o
Gateway Grid and collaborative uses: standard UI and API
Provide secure federation point with O(n) agreements
Support multiple vendor / site OTP implementations
One token per user (SSO-like solution) for OTP
•
Collaboration between ESnet, NERSC, a RADIUS appliance
vendor, PNNL and ANL are also involved, others welcome
•
White paper/report ~ 01 Sep 2004 to support early
implementers, proceed to pilot
•
Project pre-proposal:
http://www.doegrids.org/CA/Research/GIRAF.pdf
22
Collaboration Service
•
H323 showing dramatic increase in usage
23
Grid Network Services Requirements (GGF, GHPN)
•
Grid High Performance Networking Research Group,
“Networking Issues of Grid Infrastructures” (draft-ggf-ghpnnetissues-3) – what networks should provide to Grids
o
High performance transport for bulk data transfer (over 1Gb/s
per flow)
o
Performance controllability to provide ad hoc quality of service
and traffic isolation.

Dynamic Network resource allocation and reservation
o
High availability when expensive computing or visualization
resources have been reserved
o
Security controllability to provide a trusted and efficient
communication environment when required
o
Multicast to efficiently distribute data to group of resources.
o
Integrated wireless network and sensor networks in Grid
environment
24
Priority Service
•
•
So, practically, what can be done?
With available tools can provide a small number of
provisioned, bandwidth guaranteed, circuits
o
secure and end-to-end (system to system)
o
various Quality of Service possible, including minimum
latency
o
a certain amount of route reliability (if redundant paths
exist in the network)
o
end systems can manage these circuits as single high
bandwidth paths or multiple lower bandwidth paths of (with
application level shapers)
o
non-interfering with production traffic, so aggressive
protocols may be used
25
Guaranteed Bandwidth as an ESNet Service
user
system1
policer
A DOE Network R&D funded project
authorization
•
bandwidth
broker
resource
manager
allocation will
probably be
relatively static
and ad hoc
site A
resource
manager
• will probably be service level
agreements among transit
networks allowing for a fixed
amount of priority traffic – so the
resource manager does minimal
checking and no authorization
• will do policing, but only at the full
bandwidth of the service
agreement (for self protection)
user
system2
Phase 1
resource
manager
user
system2
Phase 2
site B
26
Network Monitoring System
•
•
Alarms & Data Reduction
o
From June 2003 through April 2004 the total number
of NMS up/down alarms was 16,342 or 48.8 per day.
o
Path based outage reporting automatically isolated
1,448 customer relevant events during this period or
an average of 4.3 per day, more than a 10 fold
reduction.
o
Based on total outage duration in 2004,
approximately 63% of all customer relevant events
have been categorized as either “Planned” or
“Unplanned” and one of “ESnet”, “Site”, “Carrier” or
“Peer”
Gives us a better handle on availability metric
27
RS
SN C 9
LL 9.9
LL 99 88
NL .98
4
O DO 99.
R
9
E
8
A
U- 99 4
D .9
C
8
A 99 4
N .
Yu L 9 982
cc 9.
9
O a 9 82
IN RN 9.9
EE L 81
D L-D 99.
O
E- C 981
O 9
A AK 9. 98
NL
9 1
-D 9. 9
C
8
S
9 0
LL LAC 9.9
N
9 80
LA L -D 9. 9
N C 9 78
LD 9.9
C
7
B 99 7
NL .9
LA 99 77
N . 97
L
9 5
JG 9.9
74
I
LB 99.
9
B
ec L 9 74
ht 9.
e 9
N l 9 73
O 9.
A
A 972
G 99.
A
C 97
O 99 1
ST . 9
O I 9 70
R
A 9.9
U
6
SR 99 1
S .96
9 0
IS 9. 9
U
5
9 9
M 9.9
I
PN T 9 56
NL 9. 9
4
FN 99 9
.
A
9
PP L 9 39
9
PL .9
0
N
RE 99. 7
8
L
9
S
9 7
D NL 9. 8
O A
83
EA 99.
LB 85
JL 99 9
La ab . 85
m 99 9
on .8
t 9 47
I
N
A
NL EE 9.8
-W L 9 45
e 9
Pa st .79
nt 99 8
ex .79
99 8
.7
03
700
600
500
400
300
Unavailable Minutes
2004 Availability by Month
Jan. – June, 2004 – Corrected for Planned Outages
(More from Mike O’Connor)
2004 Availability by Month
900
>99.9% available
800
<99.9%
available
200
100
0
28
ESnet Abilene Measurements
•
We want to ensure that the ESnet/Abilene cross
connects are serving the needs of users in the
science community who are accessing DOE facilities
and resources from universities or accessing
university facilities from DOE labs.
•
Measurement sites in place:
•
•
3 ESnet Participants
•
3 Abilene Participants
o
LBL
o
SDSC
o
FERMI
o
NCSU
o
BNL
o
OSU
More from Joe Metzger
29
OWAMP One-Way Delay Tests Are Highly Sensitive
•
NCSU Metro DWDM reroute adds about 350 micro
seconds
ms
Fiber Re-Route
42.0
41.9
41.8
41.7
41.6
41.5
30
ESnet Trouble Ticket System
•
TTS used to track problem
reports for the Network,
ECS, DOEGrids, Asset
Management, NERSC, and
other services.
•
Running Remedy ARsystem
server and Oracle database
on a Sun Ultra workstation.
•
Total external ticket = 11750
(1995-2004), approx.
1300/year
•
Total internal tickets = 1300
(1999-2004), approx.
250/year
31
Conclusions
•
ESnet is an infrastructure that is critical to DOE’s
science mission and that serves all of DOE
•
•
Focused on the Office of Science Labs
•
QoS service is hard – but we believe that we have
enough experience to do pilot studies
•
Middleware services for large numbers of users are
hard – but they can be provided if careful attention is
paid to scaling
ESnet is working on providing the DOE mission
science networking requirements with several new
initiatives and a new architecture
32