Slides - TERENA Networking Conference 2008

Download Report

Transcript Slides - TERENA Networking Conference 2008

Enabling Grids for E-sciencE
Network Performance
Monitoring for the EGEE Grid
Jeremy Nowell
TNC2008, Bruges
19 May 2008
www.eu-egee.org
EGEE-II INFSO-RI-031688
EGEE and gLite are registered trademarks
Overview
Enabling Grids for E-sciencE
• EGEE Overview
• Why Network Monitoring for Grids?
• Requirements and Challenges
• Strategy and Architecture
• Tools Produced
• Issues Encountered
• Solutions Developed
• Summary
EGEE-II INFSO-RI-031688
Network Performance Monitoring for the EGEE Grid - Jeremy Nowell - TNC2008
2
EGEE Overview
Enabling Grids for E-sciencE
• The EGEE project:
– 4 year project, funded by the EU (EGEE, EGEE-II)
– Seamless Grid infrastructure for e-Science, available for
scientists 24 hours-a-day
• EGEE: 1 April 2004 – 31 March 2006
– 71 partners in 27 countries, federated in regional Grids
• EGEE-II: 1 April 2006 – 30 April 2008
– 92 partners in 32 countries grouped into 13 federations
• Objectives
– Large-scale, production-quality infrastructure for e-Science
– Improving and maintaining “gLite” Grid middleware
– Attracting new resources and users from industry as well as
science
EGEE-II INFSO-RI-031688
Network Performance Monitoring for the EGEE Grid - Jeremy Nowell - TNC2008
3
EGEE Infrastructure
Enabling Grids for E-sciencE
Baltic Grid
Country
participating
in EGEE
DEISA
TERAGRI
D
NAREGI
See-Grid
EUChinaGrid
EUMedGrid
OSG
EUIndiaGrid
EELA
~ 250 sites in 50 countries
~ 55 000 CPUs
~ 20 PB storage
> 150k jobs/day
> 200 Virtual Organizations
⇨The world’s largest multi-disciplinary Grid
infrastructure
EGEE-II INFSO-RI-031688
Network Performance Monitoring for the EGEE Grid - Jeremy Nowell - TNC2008
4
Why NPM?
Enabling Grids for E-sciencE
• For Site and Grid operations
– Help diagnose performance problems between sites
 This transfer is slow, what’s broken? – the network, the server, the
middleware…
 I can’t see site X, has the network gone down or is it just a particular
service or machine?
 My application’s performance varies with time of day – is there a network
bottleneck?
– Help diagnose problems within sites
 Most network problems, especially performance issues, are not backbone
related, they are in the “last mile”
– Help with planning and provisioning decisions
 Is an SLA I’ve arranged being adhered to by my providers?
• For Grid services and middleware
– I want to increase the performance of file transfers between sites
– I want to know which compute site is “closest” to my data to submit a
job to it
EGEE-II INFSO-RI-031688
Network Performance Monitoring for the EGEE Grid - Jeremy Nowell - TNC2008
5
Why NPM? (2)
Enabling Grids for E-sciencE
• What’s different about networks for the Grid?
–
–
–
–
–
Without the network there is no Grid…
Large amounts of application data, often continuous
Multiple connections and streams
New technology – eg provisioned light paths
End-to-end performance crucial
 What’s the use of a 10 Gb/s dedicated connection if your
application is only achieving a rate of 10 Mb/s?
EGEE-II INFSO-RI-031688
Network Performance Monitoring for the EGEE Grid - Jeremy Nowell - TNC2008
6
Why NPM? (3)
Enabling Grids for E-sciencE
Q: Why don’t we just throw some more bandwidth at the problem? - Upgrade
the links.
A: Bandwidth is bad for you. It’s like a narcotic…
• It’s very addictive. You start off with a little, but that’s not really doing it
for you; it’s not enough. You increase the dose, but it’s never as good as
you thought it would be.
• By analogy you can keep buying more and more bandwidth to make your
network faster but it's never quite as good as you thought it would be.
• Why? Because simple over-provisioning is not sufficient
• Doesn’t address the key issue of end-to-end performance:
– Network backbone in most cases is genuinely not the source of the problem.
– Last mile (campus networkend-user systemyour application) often cause of
the problem: firewall, network wiring, hard disc, application and many more
potential culprits.
This can get to be an expensive habit – dedicated high speed fibre is not
cheap
Also, If simple over-provisioning was a total solution, there would not be
so much other work going on, e.g. protocol research (high speed TCPs)
EGEE-II INFSO-RI-031688
Network Performance Monitoring for the EGEE Grid - Jeremy Nowell - TNC2008
7
Network Performance Factors
Enabling Grids for E-sciencE
•
End System Issues
–
–
–
–
–
–
–
•
Network Infrastructure Issues
–
–
–
–
–
–
•
Network Interface Card and Driver and their configuration
TCP and its configuration
Operating System and its configuration
Disk System
Processor speed
Bus speed and capability
Application eg old versions of scp
Obsolete network equipment
Configured bandwidth restrictions
Topology
Security restrictions (e.g., firewalls)
Sub-optimal routing
Transport Protocols
Network Capacity and the influence of Others!
– Many, many TCP connections
– Congestion
EGEE-II INFSO-RI-031688
Network Performance Monitoring for the EGEE Grid - Jeremy Nowell - TNC2008
8
How can NPM help?
Enabling Grids for E-sciencE
• Applications and sites can make operational decisions based on
previous network performance.
– Having the ‘right’ metrics available will allow ‘better’ decisions to be
made.
– Can monitor new network technology.
• NPM data let end users see the performance they should expect
from their Grid applications
– Misleading to infer network performance from application performance.
– Seldom the same as what they know (or think they know) about the
specification of their network connections.
• Faults and inefficiencies can be identified and solved if NPM data
are available.
– Of benefit to the whole site, as well as the Grid in general.
– Sometimes the data can show up strange configurations that even site
network admins are not aware of.
– Network admins will likely not investigate application problems without
hard evidence.
EGEE-II INFSO-RI-031688
Network Performance Monitoring for the EGEE Grid - Jeremy Nowell - TNC2008
9
NPM User Requirements
Enabling Grids for E-sciencE
Operation Centres
• NOCs and GOCs
–
–
–
–
Web-based GUI
Interface to define alarms
On-demand & historical data
Backbone & end-to-end data
• NOCs
– Display which tool gathered the
results and how
– Per hop data/ability to zoom in
SLA Monitoring
• Premium IP paths for specific
applications
– Need to monitor PIP traffic
– Frequent measurements (at
least every 10 minutes)
– Thresholds and alarms on
monitored metrics
– Need to monitor Total Downtime
if a metric crosses threshold
– On demand measurements
• GOCs
– High-level statistics
EGEE-II INFSO-RI-031688
Network Performance Monitoring for the EGEE Grid - Jeremy Nowell - TNC2008
10
NPM Metric Requirements
Enabling Grids for E-sciencE
Metric / Info
Relevant to Group
NOC
GOC
SLAs
TCP Achievable Bandwidth
Yes
Packet-loss
Yes
Yes
Yes
Round-trip time
Yes
Yes
Yes
Round-trip IPDV
Yes
One-way delay
Yes
One-way delay variation
Available bandwidth (path)
Yes
Yes
Yes
Yes
Yes
Yes
Available bandwidth (hop)
Yes
Yes
Yes
Yes
Packet reordering
Yes
Yes
Hop/list network topology
Yes
Yes
Availability
Yes
Yes
Path MTU
Yes
Yes
Yes
QoS Class
Yes
Yes
Yes
On-demand test on all metrics
Yes
Yes
Yes
EGEE-II INFSO-RI-031688
Network Performance Monitoring for the EGEE Grid - Jeremy Nowell - TNC2008
11
EGEE Challenges
Enabling Grids for E-sciencE
• Scale and heterogeneity of EGEE fabric poses a
requirement to support diversity of all kinds
– Multitude of ways of collecting monitoring data
 Different measurement types
• end-to-end
o Appropriate to experience of user and application, eg TCP achievable
bandwidth
• Backbone
o Lower level measurements, used to pin-point source of problems
 Different measurement tools
 Different data formats
– Many administrative domains
– Different user groups
EGEE-II INFSO-RI-031688
Network Performance Monitoring for the EGEE Grid - Jeremy Nowell - TNC2008
12
Strategy
Enabling Grids for E-sciencE
•
Aim to standardise access to NPM data across different domains and
frameworks
– Note – we are not building measurement tools, but rather facilitating access to data
collected by them
•
Interoperability pursued through use of OGF NM-WG
– EGEE NPM should accommodate the independent deployment of NPM frameworks
across the diverse EGEE fabric and the associated networks
– Use NM-WG interfaces where they have been adopted; facilitate their use elsewhere.
EGEE-II INFSO-RI-031688
Network Performance Monitoring for the EGEE Grid - Jeremy Nowell - TNC2008
13
NPM Architecture
Enabling Grids for E-sciencE
User
User Interface:
Web Interface (JSP)
• Path Selection
Diagnostic
Tool
Clients
• Metric Selection
• Plotting of results
Axis2 NM-WG Client
NM-WG Compliant
XML
Mediator:
Axis2 NM-WG Interface
• Single point of contact for clients
Mediator
Middleware
• Provides
metadata discovery
Axis2 NM-WG Client
• Brokers data requests
All messages
using NM-WG
schema
NM-WG Compliant
XML
e2emonit:
Frameworks
• Active
end-to-end data
EGEE-II INFSO-RI-031688
Axis2 NM-WG Interface
NM-WG Interface
e2emonit
perfSONAR
perfSONAR:
• Passive utilisation
data from networks
such as GÉANT2
Network Performance Monitoring for the EGEE Grid - Jeremy Nowell - TNC2008
14
What’s available - Metrics
Enabling Grids for E-sciencE
• Metrics depend on which tools you use!
– Possibility to support access to any relevant data, provided it is
available using an OGF NM-WG compliant interface
• e2emonit
– Provided by NPM team
– ping
 Connectivity
• Round trip time, packet loss
– iperf
 Real life application performance
• TCP achievable bandwidth
– udpmon
 Network health, congestion etc
• UDP achievable bandwidth, one-way delay variation, UDP packet loss
• perfSONAR
– Developed by GÉANT2, Internet2, ESnet and RNP
– Currently accessing utilisation data
EGEE-II INFSO-RI-031688
Network Performance Monitoring for the EGEE Grid - Jeremy Nowell - TNC2008
15
NPM Diagnostic Tool
Enabling Grids for E-sciencE
– The Diagnostic Tool can be
accessed using a standard
web browser, which users are
individually authorised to use.
– The intended user is a
NOC/GOC/ROC operator, but
anyone can use it to
investigate problems
–The sites and metrics
displayed depend on where
and which measurement tool
has been deployed
EGEE-II INFSO-RI-031688
Network Performance Monitoring for the EGEE Grid - Jeremy Nowell - TNC2008
16
NPM Diagnostic Tool (2)
Enabling Grids for E-sciencE
–The parameters used to
gather measurements are
shown - here, showing that
the iperf tool was used to
gather the achievable
bandwidth information.
– These parameters can be
useful in interpreting the
results.
EGEE-II INFSO-RI-031688
Network Performance Monitoring for the EGEE Grid - Jeremy Nowell - TNC2008
17
NPM Diagnostic Tool (3)
Enabling Grids for E-sciencE
– Information from multiple
paths may be plotted at the
same time.
– Here utilisation data for the
GÉANT2/JANET router is
plotted for both inbound and
outbound traffic over the
course of one week,
obtained from the GÉANT2
PerfSONAR Measurement
Archive.
EGEE-II INFSO-RI-031688
Network Performance Monitoring for the EGEE Grid - Jeremy Nowell - TNC2008
18
Tools and supported frameworks
Enabling Grids for E-sciencE
• Clients
– Diagnostic Tool
 For use by people
• Middleware
– Mediator




Single point of contact for clients
Discovery of metadata
Insulates clients from interface changes
Exposes NM-WG web-service interface
• Measurement Frameworks
– e2emonit
 End-to-end metrics
 Active measurement tools
– perfSONAR
 Passive utilisation data for router interfaces
EGEE-II INFSO-RI-031688
Network Performance Monitoring for the EGEE Grid - Jeremy Nowell - TNC2008
19
Deployment Challenges (1)
Enabling Grids for E-sciencE
• The usefulness of NPM depends on the data that is available
– Providing data federation tools not enough by itself
– Would like to use measurement data that is already collected
 Generally not sufficiently deployed across sites
• e2emonit could be an option, but not the only one
 Ideally individual federations or VOs make deployment decisions
• E.g. GridPP deployment of gridmon within UK
• Deployment of monitoring tools is not easy
– There has to be a clear benefit to the site before they install tools
– This benefit is not obvious until after an incident has occurred, by which
time it is too late…
– Firewall changes may be difficult (eg ICMP blocked by default)
– Tools need to be trivial to install and robust when running
 Sys-admins very busy
– Need to carefully consider scheduling for end-to-end tests
 Overlapping measurements
 Network overload
EGEE-II INFSO-RI-031688
Network Performance Monitoring for the EGEE Grid - Jeremy Nowell - TNC2008
20
Deployment Challenges (2)
Enabling Grids for E-sciencE
• Different user groups may have widely different
requirements for displaying data
– e.g. site or service admins may just want an alarm that tells them
“your network is broken”, and never look at the DT
– But network people would not contemplate investigating problems
without clear historical data to look at
• The network is still assumed by many to “just work”
EGEE-II INFSO-RI-031688
Network Performance Monitoring for the EGEE Grid - Jeremy Nowell - TNC2008
21
PCP – Probes Control Protocol
Enabling Grids for E-sciencE
• Developed to solve management overhead of running active
measurement probes
– eg manual cron jobs
• Token-based mechanism to co-ordinate periodic execution of
monitoring tasks
– But applicable to any kind of task requiring regular scheduling across
administrative domains
• Prevents overlapping measurements
– Probe will not run until token received
• Groups of sites form cliques
• Robust
– Can cope with sites in the clique being unreachable
• Secure
– Only pre-defined activities may be run
– VOMS/X.509 based authentication of users
EGEE-II INFSO-RI-031688
Network Performance Monitoring for the EGEE Grid - Jeremy Nowell - TNC2008
22
PCP Operation
Enabling Grids for E-sciencE
Site A
Site B
Site C
Token
15:00
15:05
Token Registered.
Pause for delay seconds.
Run pcp_test
Token
Token Registered.
Pause for delay seconds.
Lock job
Run pcp_test
15:10
Token
Token Registered.
Pause for delay seconds.
Lock job
15:15
15:30
Run pcp_test
Token arrives. Unlock job.
Pause until (time last run+period)
Run pcp_test
Lock job
Lock job
Token
Token arrives. Unlock job.
Pause until (time last run+period)
Run pcp_test
15:35
Lock job
Token arrives. Unlock job.
…. and so on
EGEE-II INFSO-RI-031688
Token
Token arrives. Unlock job.
Pause until (time last run+period)
Run pcp_test
15:40
Token
Token
Lock job
Network Performance Monitoring for the EGEE Grid - Jeremy Nowell - TNC2008
23
Information for site admins
Enabling Grids for E-sciencE
• Site or service admins may just want an alarm that tells
them “your network is broken”, and never look at the DT
• Provide access to such information through Nagios
– Widely used for monitoring services and machines
– Single view of all relevant information
• Simple TCP connection test for individual services
– May not be true indication of network health, but if all services at
a site or unavailable then good idea
• Use information from EGEE SA2’s ENOC
EGEE-II INFSO-RI-031688
Network Performance Monitoring for the EGEE Grid - Jeremy Nowell - TNC2008
24
Nagios publishing
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Network Performance Monitoring for the EGEE Grid - Jeremy Nowell - TNC2008
25
Conclusions
Enabling Grids for E-sciencE
• Provision of federated access to network measurement data has
been demonstrated
– Based on OGF NM-WG schema
• Getting access to data itself is much harder
– Deployment challenges
– Need to “sell” to sites the value of having data available
– Differences between metrics provided by network providers and those
that can be provided by individual sites
 end-to-end active vs. passive utilisation
• Should projects be attempting to do their own monitoring?
– If they don’t who else will?
– Only they can provide meaningful end-to-end measurements…
– What happens when a site is active in multiple projects?
EGEE-II INFSO-RI-031688
Network Performance Monitoring for the EGEE Grid - Jeremy Nowell - TNC2008
26