THIS IS MAIN TITLE FOR THE WHOLE PRESENTATION
Download
Report
Transcript THIS IS MAIN TITLE FOR THE WHOLE PRESENTATION
perfSONAR Multi-Domain Monitoring Service
Deployment and Support: The LHC-OPN Use Case
Fausto Vetter, Domenico Vicinanza
DANTE
TNC 2010, Vilnius, 2 June 2010
connect • communicate • collaborate
Agenda
Large Hadron Collider Optical Private Network (LHC-OPN)
Multi-Domain monitoring challenge:
perfSONAR
GÉANT Multi Domain Monitoring Service
GÉANT Service Desk
The LHCOPN case:
Deployment
Support
Monitoring
connect • communicate • collaborate
LHC-OPN
Large Hadron Collider – Optical Private Network (LHC-OPN):
Dedicated network to support LHC experiment
Large amount of data in a grid environment
Network architecture is organized in Tiers
– 1 Tier0, 11 Tier1, 140+ Tier2
Primary users are researchers around different institutes
Requirement: Large amount of data being exchanged
Strategy: Keep traffic segregated from Internet
Solution: Optical Private Network (LHC-OPN) among Tier 0/1s
Challenge: monitoring effectively in a multi-domain environment
connect • communicate • collaborate
LHC-OPN Topology
Dual-star topology
10 Gb/s links
Cross-border fibers
resiliency
Multi-domain
LHC-OPN Topology
connect • communicate • collaborate
Monitoring the LHC-OPN:
The requirements
Focus of monitoring:
Network Layer (IP)
Physical Layer (Links)
Regular Active Point-to-Point Measurements
One-Way Delay, One-Way Delay Variation, Achievable Bandwidth,
Historical Traceroute Changes
Regular Passive Point-to-Point Measurements
Utilization, Input Errors, Packet Discards
End-to-End link monitoring
Managed service
Unified view of the network status and information across all sites
Homogeneous installations and centralized operations
connect • communicate • collaborate
Monitoring the LHC-OPN:
The solution - perfSONAR
The Tool: perfSONAR
GÉANT multi-domain monitoring (MDM) tool
Based on Open Grid Forum Standard Monitoring Protocol
Customized, fully managed and supported for LHCOPN
Objective:
Identify network problems across multiple domains
– Correctly, efficiently and quickly
– Allowing proactive actions
Strategy:
perform network monitoring actions in different network domains
make the information available thanks to a common protocol
– cross-domain monitoring capability
– access network performance metrics from across multiple
domains
connect • communicate • collaborate
perfSONAR as unifying layer
across domains
perfSONAR
Services
Domain 1
Domain 2
perfSONAR
Each domain has its
own local monitoring
Domain 3
Domain 4
perfSONAR UI
(visualization)
Scripts/API
connect • communicate • collaborate
Monitoring the LHC-OPN:
The benefits
Effective monitoring across the several LHC-OPN domains
perfSONAR enables multi-domain monitoring
– Problems can be tracked through the participating domains
from a single interface
…proactively solving problems across domains
– Effective, distributed monitoring can identify problems even
before users suffer them
… through a customized web portal
– Monitoring portal designed according to LHCOPN needs
– Easy to integrate into involved NOCs workflows
– Less disruptions and faster recovery
– Easy to take and foster collaborative efforts
Fully managed solution:
Low overhead for the Tier0/1 network operators involved
Configuration, Operation and Support carried out by GÉANT SD
connect • communicate • collaborate
perfSONAR at LHC-OPN
12 sites (1 Tier0, CERN, and 11 Tier1) involved
Several Countries around Europe, Asia and America
Access to network measurements data from multiple network domains
Customized version of perfSONAR MDM service for Tier0/1 sites (main
contributor to LHCOPN operations)
Customized visualization tool accessed:
Dedicated web portal
Specific weather maps and further diagnosis tools to visualize
measurements results
Monitoring tools, hardware and operating system packed in monitoring
boxes,
To be easily deployed at any location
Remotely accessible by the service desk for operations and support
connect • communicate • collaborate
GÉANT MDM Service Design
for LHCOPN
Two servers installed in each site (Tier0 and Tier1) :
Server 1 (HADES):
– one way delay, one way delay variation, achievable bandwidth,
historical traceroute changes
Server 2 (MDM):
– regular passive measurements carried out for collecting
interface utilisation, input error and packet discards statistics
from the sites network elements
Each site provided:
Gigabit port on the border router
Switch
Time Sources
DNS Servers
connect • communicate • collaborate
perfSONAR MDM in LHC-OPN
L2 MP
L2 MP
LHC-1
(HADES)
LHC-1
(HADES)
LHC-2
(MDM)
Tier-1
LHC-2
(MDM)
(NDGF-DK)
Tier-1
L2 MP
LHC-1
(HADES)
Tier-1
LHC-2
(MDM)
(PIC-ES)
(IN2P3-FR)
L2 MP
L2 MP
Tier-1
LHC-1
(HADES)
(TRIUMF-CN)
Tier-1
(CNAF-IT)
LHC-1
(HADES)
LHC-2
(MDM)
LHC-2
(MDM)
Tier-1
(GRIDKA-DE)
Tier-1
L2 MP
(FNAL-US)
LHC-1
(HADES)
LHC-1
(HADES)
LHC-2
(MDM)
LHC-2
(MDM)
Tier-0
(CERN-CH)
L2 MP
Tier-1
Tier-1
(RAL-UK)
(BNL-US)
LHC-1
(HADES)
LHC-1
(HADES)
L2 MP
LHC-2
(MDM)
LHC-1
(HADES)
L2 MP
LHC-2
(MDM)
LHC-2
(MDM)
Tier-1
Tier-1
L2 MP
(ASGC-KR)
(SARA-NL)
LHC-1
(HADES)
LHC-1
(HADES)
LHC-2
(MDM)
LHC-2
(MDM)
Management
Network
L2 MP
RHEL Network
Visualization
Network
L2 MP
CNM
HADES Central Server
Application
Monitoring Devices Repositories
CFEngine
System
perfSONAR-UI
Visualization Tools
BWCTL
Pinger
OWAMP
Tier 2
OWAMP, Pinger & BWCTL
Probes
connect • communicate • collaborate
The result as displayed by the
LHC-OPN Portal
connect • communicate • collaborate
Weather-map E2Emon Link Status
connect • communicate • collaborate
Weather-map E2Emon Link Status
connect • communicate • collaborate
GÉANT Application Service Desk
Deployment carried out by the GÉANT Application Service Desk
Dedicated Staff
Manage the Users Relationship
Responsible for Incident Management
Interact with Problem Management/Product Management to Improve
Products
Acts as a Single Point of Contact:
Usage of Products
Deployment of Products
Debugging Issues on Products
Focus on transition and operations of the services delivered
connect • communicate • collaborate
GÉANT MDM Service Transition
Service deployment: two workflows
Server 1: OS and Software installed and configured by a GÉANT partner
Server 2: OS and Software entirely installed and configured remotely
Phase details:
Pre-Shipment: gather information about deployment details
– Pre-Shipment Form
– Shipment: servers shipment to GÉANT partner and customer
– Receive Boxes: customer and configuration partner receives boxes
Preparation:
– Pre-Deployment Form
– Third party supplier prepares servers
– Physical Installation
Deployment: software installation
Configuration: service configuration
Validation
connect • communicate • collaborate
MDM Service Deployment Agenda
connect • communicate • collaborate
perfSONAR services monitoring
Service Monitoring Infrastructure (based on Nagios+Cacti):
Customised set of testing scripts and health checks
35 Checks per server, covering hardware, software and services
Automatic notification, detailed history
Three layer monitoring:
Hardware layer: CPU, MEM, disk space, network interfaces,
TCP/UDP traffic, temperature
Resource layer: login attempts, Tomcat RRT, eXist RTT, MySQL,
NTP
Service layer: perfSONAR services availability and performance
Additional tools:
Syslog server (with MySQL support)
security log auditing (with automatic email report tools)
connect • communicate • collaborate
GÉANT MDM Service Operations:
the monitoring interfaces
connect • communicate • collaborate
GÉANT MDM Service Operations:
incident management procedures
Well defined procedures for Incident Management:
Reporter
Reporter
Service
Service
DeskDesk
Operator
(Ticket Owner)
Object1
Responsible
Raise Issue
Reporter
Reporter
Open Ticket
Notify Reporter
Object1
Responsible
Reporter
Third Party Supplier
Assign Responsible
Notify Reporter
Handle Incident/Information
Report Problem
Require Information
Solution
Information
Identify Ticket
Document Solution
Handle Ticket
Notify Reporter
Close Ticket
Notify Ticket Closing
Incident Management
Third party supplier involved
connect • communicate • collaborate
Conclusions
GÉANT Application Service Desk:
Effective single point of contact in complex deployments
LHC-OPN use case:
great opportunity for service & support infrastructure
Reasons for a successful deployment:
Preparation phase is crucial
Adequate tools for event and incident management
Customer collaboration was the main player on the deployment.
Continuous service improvement
Periodic meetings with involved parties
Quality audits about the deployment
connect • communicate • collaborate
Final Remarks
Thanks to:
perfSONAR community
GÉANT partners
DANTE
perfSONAR development team
CERN and its partners
Thanks for your attention
Any questions and/or comments?
connect • communicate • collaborate