Network Monitoring in the BaBar Experiment

Download Report

Transcript Network Monitoring in the BaBar Experiment

Network Monitoring in the
BaBar Experiment
S. Luitz, D. Millsom, D. Salomoni
CHEP2000 - Padova, February
2000
Summary
The BaBar Data Acquisition Network
A Typical Scenario...
Traffic Monitoring and Recording
Traffic Dump Analysis Tools
Real-Time Analysis of Traffic
Conclusions and Outlook
CHEP2000 - Padova, February
2000
The BaBar Data Acquisition
Network (1)
 ca. 200 VME single board computers (VxWorks): 100
Mbit/s full duplex Ethernet
 78 Sun Ultra 5 "farm" workstations for Level-3 trigger
and fast monitoring: 2 100 Mbit/s full duplex Ethernet
each ("dual homed")
 5 Sun Ultra 60 application servers (e.g. Run control):
100 Mbit/s full duplex Ethernet
 15 Sun Ultra 5 display console machines: 10 or 100
Mbit/s Ethernet
CHEP2000 - Padova, February
2000
The BaBar Data Acquisition
Network (2)
 1 Sun E 450 (4 CPU, 780 Gbyte RAID) central
boot/NFS/database/data buffer server: 2 x 1GBit/s
Ethernet
 various development and user workstations
 3 Cisco Cat 5500 switches
 2 VLANs / IP subnets:
dedicated real-time DAQ network (35-40 MByte/s)
general purpose / data transfer network
CHEP2000 - Padova, February
2000
CHEP2000 - Padova, February
2000
A Typical Scenario
 Problem:
 Shift crew reports: "Run control server problem ca. 45 min ago at
23:50"
 A look at the system logs shows NFS timeouts at 23:08 but no
network-related events (like spanning tree reconfigurations)
 Central network monitoring shows "normal" traffic
 What was going on? Did someone/something overload the NFS
server? Data base access? ...?
 Server based performance monitoring very poor !
 Wouldn´t it be nice to be able to have a close look
at the network traffic around 23:05?
CHEP2000 - Padova, February
2000
Traffic Monitoring and
Recording (1)
 We can! Even with free software tools!
 Configure switch to forward all traffic in the BaBar
general-purpose VLAN/subnet to a monitoring port
(SPAN)
 Standard protocol analyzers no good: small buffers,
what to trigger on?
 Sun E 250 with 72 Gbyte disk and Gigabit Ethernet
as traffic recorder and protocol analyzer
 Record packet headers into "circular" disk buffer
CHEP2000 - Padova, February
2000
Traffic Monitoring and
Recording (2)
 Use tcpdump (ftp://ftp.ee.lbl.gov) to capture packet
headers and write them to files
 In our environment:
We can´t monitor the real-time network, switch backplane
capacity could be exceeded at peak
We have 3 switches, however presently we only monitor the
switch where the file server is connected
Typical captured data rates during normal operation: 4
Gbytes / hour
CHEP2000 - Padova, February
2000
Analysis Tools (1)
 How to look at Gigabytes of recorded network data?
Use tcpdump to filter dump file (e.g. "host bbr-srv02 and
host bbr-srv03 and port nfs") into a smaller file
Use tcpslice (ftp://ftp.ee.lbl.gov) to isolate time intervals
from the dump files
Use tcptrace to automatically analyze TCP connections
and plot throughput graphs
http://jarok.cs.ohiou.edu/software/tcptrace/tcptrace.html
Look at low rate events directly with tcpdump
CHEP2000 - Padova, February
2000
Analysis Tools (2)
 Sample tcptrace output for a connection (NFS)
TCP connection 4:
host g:
BBR-SRV03.SLAC.Stanford.EDU:32769
host h:
BBR-SRV02.SLAC.Stanford.EDU:2049
complete conn: yes
first packet: Fri Jan 28 23:24:35.019938 2000
last packet: Fri Jan 28 23:24:35.027876 2000
elapsed time: 0:00:00.007938
total packets: 11
filename:
srv02srv03.dump
g->h:
h->g:
total packets:
6
total packets:
5
ack pkts sent:
5
ack pkts sent:
5
pure acks sent:
3
pure acks sent:
2
unique bytes sent:
44
unique bytes sent:
28
actual data pkts:
1
actual data pkts:
1
actual data bytes:
44
actual data bytes:
28
xmit time:
0.000 secs
idletime max:
4.4 ms
idletime max:
4.1 ms
throughput:
5543 Bps
throughput:
3527 Bps
NFS port
on server
Not much
happened!
data xmit time:
Much more info available, edited to fit ...
CHEP2000 - Padova, February
2000
0.000 secs
data
Analysis Tools (3)
Throughput
between two hosts
Yellow dots:
instantaneous rate,
quantization due to
time resolution of
packet time (GBit!)
Red line: Averaged
rates
CHEP2000 - Padova, February
2000
Analysis Tools (4)
 The network dump can e.g. answer the following
questions (and many more):
Who (UID,GID) has read the 25 Gbyte data file over
NFS?
Were NFS timeouts correlated to a high NFS
transaction volume/rate?
Which hosts were accessing the file server?
Do we have hosts/software with configuration
problems? (Wrong subnet masks, applications using
incorrect subnet broadcast addresses)
 However, the analysis of the files is complicated,
we´d like to have better
CHEP2000 -tools!
Padova, February
2000
Real-Time Analysis of Traffic
 A very interesting and promising free tool is NTOP
(www.ntop.org)
 Captures packets, analyzes the protocol headers in
real-time and dynamically generates web pages,
e.g.:
Protocols and their distribution
Hosts, host info, data sources and destinations
Throughput graphs
Traffic matrix
 Still in development, not perfectly stable yet
CHEP2000 - Padova, February
2000
Real-Time Monitoring
NTOP
example
CHEP2000 - Padova, February
2000
Conclusions and Outlook
 Network traffic recording and analysis
 is feasible (with some restrictions) even in high performance
switched network environments
looking forward to the next generation of gigabit-speeds-monitoringcapable switches and workstations
 has shown to be very helpful in understanding host and network
performance problems and computing infrastructure troubleshooting
 Powerful free software tools are available:
 but multiple programs, command line based, make analysis of
network traffic log files quite a complicated procedure
 The ultimate tool would be a PAW-like program for
networks which allows filtering and plotting with a
simple command language
CHEP2000 - Padova, February
2000