Advanced Monitoring Techniques for the ATLAS TDAQ Data Network

Download Report

Transcript Advanced Monitoring Techniques for the ATLAS TDAQ Data Network

Advanced Monitoring Techniques for
the ATLAS TDAQ Network
Matei Ciobotaru
CERN
University of California, Irvine
“Politehnica” University of Bucharest
on behalf of the ATLAS Networking Group:
B. Martin, A. Al-Shabibi, S. Batraneanu, S. Stancu, L. Leahu, L. Darlea, M. Ivanovici
HEPiX – 9 May 2008
Network monitoring in ATLAS – [email protected]
1
The ATLAS TDAQ Network – Role

The ATLAS Trigger and Data
Acquisition Network (TDAQ)
handles the data transfers from
the ATLAS detector to the
analysis and storage nodes

Built with Gigabit Ethernet
switches and routers

Sustained rates of 150 Gbit/s

The experiment relies on the
network to function 24/7 with a
minimal number of failures
HEPiX – 9 May 2008
Network monitoring in ATLAS – [email protected]
ATLAS
detector
TDAQ
system
2
The ATLAS TDAQ Network – Photos
2500 computers installed in 90 racks
2 concentrator switches per rack
5 “big” chassis-based devices at the core
HEPiX – 9 May 2008

Almost 3000 devices and 5000
network connections…

How to make sure everything is
working correctly?
Network monitoring in ATLAS – [email protected]
3
Inside this talk
HEPiX – 9 May 2008

Requirements in terms in network management

Commercial software we are using

Tools we developed in-house

Services for users, integration with ATLAS

Plans for the future

The big picture
Network monitoring in ATLAS – [email protected]
4
ATLAS Requirements

Manage a large local area network which has to be very
reliable and which has very high throughput requirements

Installation
– Ease the equipment registration, inventory and verification
– Configure the devices

Operation
–
–
–
–

Check the state of health of devices and links
Monitor traffic conditions, raise alarms when needed
Assist the user in navigating the realm of information
Integration with the ATLAS TDAQ software
Diagnostics
– Provide aids to the admin in case something goes wrong
– Be able to suggest solutions to problems
HEPiX – 9 May 2008
Network monitoring in ATLAS – [email protected]
5
Equipment registration

ATLAS equipment needs to be
registered in four databases

Only some databases support
batch registrations, others require
manual intervention  may lead
to inconsistencies

Developed a web application to
cope with this situation
– Central place for querying all the
information about a device
– Ability to cross-check the data
across all databases  detect
incomplete/incorrect registrations
HEPiX – 9 May 2008
Network monitoring in ATLAS – [email protected]
6
Equipment inventory

Network diagrams for ATLAS are made
in Microsoft Visio using the NetDesign
package

We created tools which discover what
really exists in the network (what is
connected where)
Visio
Network Discovery
HEPiX – 9 May 2008

Developed an application which compares the two data
sources (Visio and Auto-discovery)  mismatches are
detected and corrected in the field if necessary

For the network documentation – we also generate
automatically a printable “report” with all the connectivity
Network monitoring in ATLAS – [email protected]
7
Network configuration (1)

In ATLAS we have more than
200 switches
– Different vendors
– Different mechanisms for
configuration and monitoring
(telnet, SNMP, web)

Q: How to access all devices in a
transparent manner?
– A: Bring them all under a common
denominator (common interface)

Q: How to automatize network
management tasks?
– A: Write scripts (little programs)
switches + scripting = sw_script
http://cern.ch/ciobota/projects/sw_script/



HEPiX – 9 May 2008
sw_script = Set of Python modules
which can be used as building blocks
for network management solutions
Common programming interface to
all devices (object-oriented)
“Intelligent” tools for configuration
and monitoring can be developed
Network monitoring in ATLAS – [email protected]
8
Interactive session with sw_script
# Start the Python interpreter
$ python2.5
# Load the sw_script module
>>> import sw_script
# Create an object associated with the switch (a Cisco device in this case)
>>> sw = sw_script.Cisco_Catalyst_6500_Switch(ip_address = “192.168.100.59");
# List the ports available on this device
>>> sw.get_port_names();
[’1/1’, ’1/2’, ’1/3’, ’1/4’, ....
# Get all the information available for an interfacesw_script is responsible for more than a
>>> sw.get(“1/4");
of our 0.0),
network
management
toolbox
[(’rx_packets’, 519.0), (’rx_bytes’, 127937.0), half
(’rx_discards’,
(’rx_errors’,
0.0),
(’tx_packets’, 11199.0),(’tx_bytes’, 1111661.0), (’tx_discards’, 0.0), (’tx_errors’, 0.0),
(’description’, ’GigabitEthernet1/4’), (’link_state’,
(’mac_addr’, [’00:90:27:8F:94:E3’])]
 ’up’),
Features
– Supports devices from different vendors
# Set the description (ifAlias) of an interface
>>> sw.set_interface_alias(“1/4”, “Uplink to Core Router”)
– Network topology auto-discovery
# Show the serial number of this device
>>> print sw.get_serial_number()
FOC0913U075
HEPiX – 9 May 2008
– Can do traffic monitoring in real-time
– Works as a module, can be easily
embedded into other apps
Network monitoring in ATLAS – [email protected]
9
Network configuration (2)

In ATLAS, we have programs which use
sw_script to perform configuration changes
on devices:
– defining VLANs
– enabling protocols: spanning tree, time
synchronization, etc.
– setting interface aliases (descriptions)
HEPiX – 9 May 2008

We use Python scripts to perform unattended
firmware upgrades

For keeping track of configuration files we
plan to use ZipTie (open-source software)
Network monitoring in ATLAS – [email protected]
10
Basic monitoring

Spectrum from Computer Associates  software
for device health and traffic monitoring (used by the
CERN IT department)

Monitors devices, raises alarms in case of failures
Auto-discovery for almost all network connections
Historical info – Gathers statistics from all devices


– Throughput and error rates saved every 30 seconds

Spectrum GUI
Limitations
–
–
–
–
–
–
The Spectrum GUI is hard to use
It is not easy to integrate with 3rd party apps
Limited support for network performance monitoring
Basic support for querying historical traffic data
No support for device configuration
Virtually no features for diagnostics
HEPiX – 9 May 2008

We developed software
to fill-in the gaps
Network monitoring in ATLAS – [email protected]
11
Navigating in the realm of monitoring data

Spectrum produces 3 plots
for each network interface.
We shall have 5000 ports
and 15000 plots to look at…

We developed tools to
browse, query and analyze
the traffic plots.
HEPiX – 9 May 2008
Network monitoring in ATLAS – [email protected]
12
Network browser
HEPiX – 9 May 2008
Network monitoring in ATLAS – [email protected]
13
Searching and aggregating plots
HEPiX – 9 May 2008
Network monitoring in ATLAS – [email protected]
14
Scanning for traffic events
HEPiX – 9 May 2008
Network monitoring in ATLAS – [email protected]
15
Integration with ATLAS software

Network Panel
– Shows network monitoring
information relevant to an ATLAS
data acquisition run

Alarm Watcher
– Forwards alarms from Spectrum
into the ATLAS “official”
messaging channels

IS Feeder
– Publish network statistics to the
Information Services, a monitoring
sub-system in ATLAS
HEPiX – 9 May 2008
Network monitoring in ATLAS – [email protected]
The network
Panel
16
Network visualization – 2D approach





Application which shows a topological
map of the network
Colors the connections in real-time in
function of their state and usage
The overloaded links are detected easily
Good navigation features (zoom, pan)
Based on GUESS, a Java application
for visualizing graphs
– http://graphexploration.cond.org/

We developed a network monitoring
plug-in for GUESS
HEPiX – 9 May 2008
Network monitoring in ATLAS – [email protected]
17
Network visualization – 3D approach (1)






3D model of the network
Racks, switches and computers 
Furniture in the 3D space
Navigation similar to Google Earth
Each object contains a panel with traffic
information (updated in real-time)
Containers (racks, rooms) show
aggregate values
Technologies used: X3D, Java and the
Octaga Player
HEPiX – 9 May 2008
Network monitoring in ATLAS – [email protected]
18
Network visualization – 3D approach (2)
HEPiX – 9 May 2008
Network monitoring in ATLAS – [email protected]
19
Real-time traffic monitoring
Real-time global top
(most active connections)
Connections for one switch
(with traffic values)
The ATLAS applications
running now in the network
HEPiX – 9 May 2008
Network monitoring in ATLAS – [email protected]
20
Diagnostics

For immediate response, we look in Spectrum and in the
sw_script web pages

Human inspection of traffic plots (aggregates) – we search
for abnormal patterns and correlations between plots

We have a collection of scripts to test different things
– Checking that machines are configured properly and
connections are ok

For bandwidth-related issues we use iperf

All the network operations are documented in a knowledge
base (wiki)
HEPiX – 9 May 2008
Network monitoring in ATLAS – [email protected]
21
Plans for the future
HEPiX – 9 May 2008

Better visualization techniques for traffic plots

Analysis tools for monitoring data. Pattern
detection and recognition (periodic events,
monotonic variations, etc.)

Add support for sFlow, the standard for
statistical sampling – very useful to diagnose
network congestion

Design and implement an expert system which
will help us troubleshoot network issues
Network monitoring in ATLAS – [email protected]
22
The big picture
Browse, search and
aggregate
2D and 3D network
visualization
Dynamic web-pages
Historical
traffic data
Spectrum
Real-time
traffic info
sw_script & co.
Device health
monitoring
ATLAS software – network
status and alarms
Equipment auto-discovery,
inventory and registration
Commercial package
HEPiX – 9 May 2008
Equipment
configuration
In-house development
Network monitoring in ATLAS – [email protected]
23