LHC-OPN “Paths” - Indico

Download Report

Transcript LHC-OPN “Paths” - Indico

LHC-OPN Monitoring
Working Group Update
Shawn McKee
LHC-OPN T0-T1 Meeting
Rome, Italy
April 4th, 2006
LHC-OPN Monitoring Overview
The LHC-OPN exists to share LHC data with, and
between, the T1 centers
Being able to monitor this network is vital to its
success and is required for “operations”.
Monitoring is important for:









Fault notification
Performance tracking
Problem diagnosis
Scheduling and prediction
Security
See previous (Amsterdam) talk for an overview
and details on all this…
The LHC-OPN Network
LHC-OPN Monitoring
View
The diagram to the right is a
logical representation of the
LHC-OPN showing
monitoring hosts
The LHC-OPN extends to
just inside the T1 “edge”
Read/query access should be
guaranteed on LHC-OPN
“owned” equipment.
We also request RO access
to devices along the path to
enable quick fault isolation
Status Update

During the Amsterdam meeting (Jan 2006)
we decided to focus on two areas:





Important/required metrics
Prototyping LHC-OPN monitoring
There is an updated LHC-OPN Monitoring
document on the LHC-OPN web page
emphasizing this new focus.
This Meeting
What metrics should be required for LHC-OPN?
We need to move forward on prototyping LHCOPN monitoring services …volunteer sites?
Monitoring Possibilities by Layer
For each “layer” we could monitor a number of
metrics of the LHC-OPN:
Layer-1:
 Optical power levels
Layer-2:
 Packet statistics (e.g., RMON)
Layer-3/4:
 Netflow
All Layers:
 Utilization (bandwidth in use,Mbits/sec)
 Availability (track accessibility of device over time)
 Error Rates
 Capacity
 Topology
LHC-OPN “Paths”; Multiple Layers



Each T0-T1 “path” has many views
Each OSI Layer (1-3) may have different
devices involved.
This diagram is likely simpler than most
cases in the LHC-OPN
Metrics for the LHC-OPN
(EGEE Network Performance Metrics V2)

For “edge-to-edge” monitoring this list of
relevant metrics include:








Availability (of T0-T1 path, each hop, T1-T1?)
Capacity (T0-T1, each hop)
Utilization (T0-T1, each hop)
Delays (T0-T1 paths, One-way, RTT, jitter)
Error Rates (T0-T1, each hop)
Topology (L3 traceroute, L1?, L2)
MTU (each path and hop)
What about Scheduled Downtime, Trouble
Tickets?
Availability

Availability (or “uptime”) measures the amount
of time the network is up and running.


Can be by “hop” or a complete “path”
Methodology:




Layer 1: Measure power levels/bit rate?
Layer 2: Utilize SNMP to check interface
Layer 3: ‘ping’
Units: Expressed as a percentage
Capacity

Capacity is the maximum amount of data per
unit time a hop or path can transport.


Can be listed by “hop” or “path”
Methodology:




Layer 1: Surveyed (operator entry)
Layer 2: SNMP query on interface
Layer 3: Minimum of component hops
Units: Bit rate (Bits[K,M,G] per second)
Utilization

Utilization is the amount of capacity being
consumed on a hop or path.


Can be listed by “hop” or “path”
Methodology:



Layer 2: Use of SNMP to query interface stats
Layer 3: List of utilization along path
Units: Bits per second
Delay



Delay metrics are at Layer 3 (IP) and are defined by RFC 2679 and
2681 and IPPM.
Delay related info are three types: one-way delay (OWD), one-way
delay variation (jitter) and round-trip time (RTT)
One way delay between two observation points is the time between
occurrence of the first bit of the packet on the first point and the last bit
of the packet at the second point.



Jitter is the one way delay difference along a given unidirectional path
(RFC 3393)



Methodology: application (OWAMP) generating defined size packet with
time-stamp to target end-host application.
Units: Time (seconds)
Methodology: statistical analysis of OWD application
Units: Time (positive or negative)
Round-trip time (RFC 2681) well defined


Methodology: ‘ping’
Units: Time (min/max/average) or a histogram of time
Error Rates

Error rates track the bit or packet error rate
(depending upon layer).


Can be listed by “hop” or “path”
Methodology:




Layer 1: Read (TL1) equipment error rate
Layer 2: SNMP access to interface error counter
Layer 3: Checksum errors on packets
Units: Fraction (erroneous/total for bits or packets)
Topology

Topology refers to the connectivity between
nodes in the network (varies by OSI layer)

Methodology:





Layer 1: Surveyed (input)
Layer 2: Surveyed (input)…possible L2 discovery?
Layer 3: Traceroute or equivalent
Units: Representation should record a vector of
node-link pairs representing the described path
May vary with time (that is what is interesting)
but that is probably only “trackable” at L3.
MTU

The Maximum Transmission Unit is defined
as the maximum size of a packet which an
interface can transmit without having to
fragment it.



Can be listed by “hop” or “path”
Methodology: Use Path MTU Discovery (RFC
1191)
Units: Bytes
LHC-OPN: Which Metrics Are
REQUIRED (if any)?




We should converge on a minimal set of metrics that the LHCOPN Monitoring needs to provide
Example: for each T0-T1 path:
 Availability (is path “up”?)
 Capacity (path bottleneck bandwidth)
 Utilization (current usage along path)
 Error rates? (bit errors along path)
 Delay?
 Topology?
 MTU?
Do we need/require “hop” level metrics at various layers?
How to represent/monitor downtime and trouble tickets? (Is this
in scope?)
REMINDER: T0 Site Requests
A robust machine meeting the following specs must be made available:













Dual cpu Xeon 3 GHz processors or dual opteron 2.2 GHz or better
4 Gigabytes of memory to support monitoring apps and large TCP buffers
1 or 10 Gigabit network interface on the LHC-OPN.
200 GB of disk space to allow for the LHCOPN apps & data repository.
A separate disk (200+ GB) to back-up the LHCOPN data repository.
OPTIONAL: An out-of-band link for maintenance/problem diagnosis.
Suitably privileged account(s) for software installation/access.
This machine should NOT be used for other services.
SNMP RO access for the above machine is required for all L2 and
L3 devices or proxies (in case of security/performance concerns)
Access to Netflow (or equiv.) LHC-OPN data from the edge device
Appropriate RO access (via proxy?) to the optical components (for
optical power monitoring) must allowed from this same host.
Access (testing/maint.) must be allowed from all LHC-OPN nets.
The Tier-0 needs a point-of-contact (POC) for LHC-OPN monitoring.
REMINDER: T1 Site Requests
A dedicated LHC-OPN monitoring host must be provided:
 A gigabyte of memory
 2 GHz Xeon or better CPU.
 1 Gigabit network interface on the LHC-OPN.
 At least 20 GB of disk space allocated for LHC-OPN monitoring apps.
 An suitably privileged account for software installation.
 OPTIONAL: An out-of-band network link for maintenance/problem
diagnosis
 OPTIONAL: This host should only be used for LHC-OPN monitoring
 OPTIONAL: each Tier-1 site should provide a machine similar to the Tier-0.
 SNMP RO access for the above machine is required for all T1 LHC-OPN
L2 and L3 devices or proxies (for security/performance concerns)
 Access to Netflow (or equiv.) LHC-OPN data from the edge device
 Appropriate RO access, possibly via proxy, to the T1 LHC-OPN optical
components (for optical power monitoring) must allowed from this host.
 Access (testing/maint.) should be allowed from all LHC-OPN networks.
 The Tier-1 needs to provide a point-of-contact (POC) for LHC-OPN
monitoring
REMINDER: NREN Desired Access
We expect that we will be unable to “require” anything for
all possible NRENs in the LHC-OPN. However the
following list represents what we would like to have for
the LHC-OPN:
SNMP (readonly) access to LHC-OPN related L2/L3 devices from
either a closely associated Tier-1 site or the Tier-0 site. We require
associated details about the device(s) involved with the LHC-OPN for
this NREN
 Suitable (readonly) access to the optical components along the
LHC-OPN path which are part of this NREN. We require associated
details about the devices involved.
 Topology information on how the LHC-OPN maps onto the NREN
 Information about planned service outages and interruptions. For
example URLs containing this information, mailing lists, applications
which manage them, etc.
Responsibility for each acquiring NREN information should be distributed
to the various Tier-1 POCs.

Prototype Deployments


We like to begin prototype distribution deployments
to at least two Tier-1’s and the Tier-0
The goal is to prototype various software which
might be used for LHC-OPN monitoring:







Active measurements (and scheduling?)
Various applications which can provide LHC-OPN metrics
(perhaps in different ways)
GUI interfaces to LHC-OPN data
Metric data management/searching for LHC-OPN
Alerts and automated problem handling applications
Interactions between all the preceding
This process should lead to a final LHC-OPN
monitoring “system” matched to our needs.
Prototype Deployment Needs

For sites volunteering to support the LHCOPN monitoring prototypes we need:






Suitable host (see requirements)
Account details (username/password). Can
provide SSH public key as alternative for passwd.
Any constraints or limitations about host usage.
Out-of-band access info (if any)
Each site should also provide a monitoring
point-of-contact.
VOLUNTEERS? (email [email protected])
Monitoring Site Requirements

Eventually each LHC-OPN site should
provide the following for monitoring:






Appropriate host(s) (see previous slides)
Point-of-contact for monitoring
L1/L2/L3 “Map” to Tier-0 listing relevant nodes
and links:
 Responsible for contacting intervening NRENs
 Map is used for topology and capacity information
 Should include node(device) address, description
and access information
Readonly access for LHC-OPN components
Suitable account(s) on monitoring host
Sooner rather than later…dictated by interest
Future Directions / Related Activities
There are a number of existing efforts we anticipate actively
prototyping for LHC-OPN monitoring (alphabetically):
EGEE JRA4/ EGEE-II SA1 Network Performance Monitoring - This project has been
working on an architecture and a series of prototype services intended to provide Grid
operators and middleware with both end-to-end and edge-to-edge performance data.
See http://egee-jra4.web.cern.ch/EGEE-JRA4/ and a demo at
https://egee.epcc.ed.ac.uk:28443/npm-dt/
IEPM -- Internet End-to-end Performance Monitoring. The IEPM effort has its origins in the
1995 WAN monitoring group at SLAC. IEPM-BW was developed to provide an
infrastructure more focused on a making active end-to-end performance measurements
for a few high-performance paths.
MonALISA – Monitoring Agents using a Large-scale Integrated Services Architecture. This
framework has been designed and implemented as a set of autonomous agent-based
dynamic services that collect and analyze real-time information from a wide variety of
sources (grid nodes, network routers and switches, optical switches, running jobs, etc.)
NMWG Schema - The NMWG (Network Measurement Working Group) focuses on
characteristics of interest to grid applications and works in collaboration with, other
standards groups such as the IETF IPPM WG and the Internet2 End-to-End
Performance Initiative. The NMWG will determine which of the network characteristics
are relevant to Grid applications, and pursue standardization of the attributes required to
describe these characteristics.
PerfSonar – This project plans to deploy a monitoring infrastructure across Abilene
(Internet2), ESnet, and GEANT. A standard set of measurement applications will
regularly measure these backbones and store their results in the Global Grid Forum
Network Measurement Working Group schema (see below).
Summary and Conclusion

The LHC-OPN monitoring document is updated to
reflect the new emphasis on:





Determining the appropriate metrics
Prototyping possible applications/systems
All sites should identify an LHC-OPN point-of-contact
to help expedite the monitoring effort.
We have a number of possibilities regarding metrics.
Defining which (if any) are required will help direct the
prototyping efforts.
Prototyping is ready to proceed -- we need to identify
sites which will host this effort.