Transcript UPS - VUT

Brno University of Technology
CESNET z.s.p.o
University Campus Network Monitoring in Everyday Life
Tomáš Podermański, [email protected]
Brno University of Technology
•
•
•
•
•
•
•
http://www.vutbr.cz
One of the largest universities in the Czech Republic
founded in 1899, 110th anniversary will be celebrated this year
20,000 students and 2,000 employees
9 faculties
6 other organisation units
Student dormitory for 6,000 students
VUT FP, FEKT, Kolejní 4
VUT Koleje, Kolejní 2
VUT FCH, FEKT, Purkyňova 118
VUT Koleje, Mánesova 12
VUT FEKT, Technická 8
VUT FIT, Božetechova 2
VUT FSI, Technická 2
AV VFU, Palackého 1/3
VUT TI, Technická 4
VUT Koleje, Purk.
MU CESNET , Botanická 68a
AV ČR UPT
MZLU, Tauferova
VUT, Kounicova 67a
VUT Koleje , Kounicova 46/48
AV ČR UFM
VUT Rektorát, Antonínská 1
VUT FAST, Veveří 95
VUT FaVU, Údolní 19
VUT , Gorkého 13
VUT FEKT
Údolní 53
VUT FA, Poříčí 5
MU, Vinařská 5
VUT FaVU, Rybářská 13
AV ČR, Rybářská 13
Physical Layer
•
•
•
•
24 places connected to each other
Each place is connected at least from two directions (by separated cables)
Over 100 km of optical cables
Most of the cables are the property of the university
IPv4 layer
•
•
•
•
The network cores are based on Hewlett Packard
OSPF based routing
For multicast PIM SM and DM are used.
Most of the traffic is being transported thought this network
IPv6 layer
•
•
•
•
•
IPv6 functionality on HP devices available as beta release
Temporary solution based on 3com devices or PC routers with Xorp.
Dedicated IPv6 switch/router together with the main IPv4 switch/router.
For connections between IPv6 routers VLANs are used.
Temporary low cost solution until main devices will have full IPv6 support
Basic monitoring, active vs. passive
Active
Tomography
Passive
• Active monitoring
Statistical
NetFlow
Service
availability
Enviroment
Network
condition
Service
contidion
(response, …)
• Passive monitoring
Power status
Cooling
systems
Interfaces
status
Log
processing
• We sent a probe data and get
a response
• A probe of the device, network
etc.
•
Observer of the device, network
etc.
Components in a Monitoring System
Components in monitoring system
Agent and protocol
•
•
•
•
•
SNMP agent
• Get, Set, Walk, Traps
NetFlow, SFlow, IPFIX probe
• Accumulated statistics
For many systems specialized protocol
based on the main system
Role of a cache on the agent
Active monitoring
• We use an appropriate protocol or
data depending on a monitored
service
• Proxy service (view from the other
point)
Components in Monitoring System
Manager & Frontend
•
•
•
•
•
Manager collects and proceses data
from agents
Store and archive in datastore
• SQL, RRD, …
User interface
• Web, application
• Reports, SLA, …
• Configuration
• Historical view
System of alerts
• Email, SMS, phone call
The most popular systems
• Zabbix, Nagios, OpenView,
nfsen/dump, flowtools, rrdtool,
mrtg, cacti, munin, …
Quiz
What causes the most of troubles in IT?
– Power supply of systems
•
•
•
•
Overloaded circuits
Non managed UPS
Mess in eletricity instalations
Improper power supply could be a booby trap
– Cooling systems
•
•
•
•
Absence of a preventive monitoring
Frozen units
Jam by foliage
…
Physical infrastructure
LAYER 0,1
Power Supply with 1 + 1 Redundancy
PDU I
PDU II
ATS
UPS I
2x 16A
UPS II
Power Supply with 1 + 1 Redundancy
PDU I
PDU II
Load,
voltage
Load,
voltage on source 1,
voltage on source 2,
Selected source
ATS
UPS II
UPS I
2x 16A
Load,
Input voltage,
output voltage,
battery status
power system with 1 + 1 redundancy
ATS
UPS
2x 16A
power system with 1 + 1 redundancy
Load, current
Input voltage,
output voltage,
battery status
ATS
UPS
2x 16A
Load, current
voltage on source 1,
voltage on source 2,
Selected source
power system with 1 + 1 redundancy
Overloaded circuit
tripped circuit breaker
UPS
2x 16A
ATS
power system with 1 + 1 redundancy
When the power goes up again...
in a few
minutes UPS
is low
ATS
UPS
2x 16A
Second circuit is overloaded
tripped circuit breaker
Cooling Systems
LonWorks
Unit status/SNMP
Monitoring
system
Temperatue/SNMP
• In many cases a cooling system is a part of the building.
• Majority of cooling systems are difficult to monitor.
• Some devices have a support, but it costs a lot of money.
– In many cases monitoring is more expensive than the cooling device.
– There is no standard interface (RS485 with a closed protocol).
– Some devices have a binary output which indicates both error and running
state (via relay)
• Possible conversion to SNMP
• Another and the easiest solution -> monitoring of temperature in a
communication room.
• Thermometer with a SNMP output.
Monitoring in Data Center Rooms
•
•
•
•
•
More complex eletrical installation
Having UPS and ATS in every rack is ineffective
Devices with a 3-phase power
Circuits are divided to 3 groups (direct, genset, UPS)
More detailed information about the eletricity distribution is
very useful.
• It is necessary to monitor whether phases are balanced
– Genset could break down
Power in Data Center Rooms
Main power
A
Devices in racks
V
ATS
Genset
V
A
A
Bypass
HVAC
A
UPS
V
temperature in datacenter
temperature in datacenter
Server Monitoring
SNMP
IPMI
Monitoring
system
Other
• Hardware
– Manufacturers’ software support is required (Dell OpenManage, HP
InsightControl, …)
– Chassis temperature
– Fan condition
– Power status
• Operating system
– CPU, Load, Memory, Utilization, process
• Disk subsystem
– External disk array with own management port
– Raid status
– Disk condition (S.M.A.R.T.)
Network Device Monitoring
*Spd Mode
ProCurve
ProCurve Switch
4208vl-72GS
J9030A
off = 10Mbps
Self Test
flash = 100Mbps
on = 1000Mbps
Status
Reset
1
Clear
Console
Auxiliary Port
Fan
2
A
B C D E
Power
F G H
Act
FDx
Spd
Use vl modules only
!
LED Mode Select
Modules
Power
10/100/1000Base-T Ports - all ports are IEEE Auto MDI/MDI-X
10/100/1000Base-T Ports - all ports are IEEE Auto MDI/MDI-X
Fault
Use ProCurve mini-GBICs and SFPs only
1
ProCurve
24p Gig-T
vl Module
J8768A
3
5
7
9
11
13
15
17
19
21
A
23
vl
2
4
6
8
10
12
14
16
18
20
22
24
Module
ProCurve
Gig-T/SFP
vl Module
J9033A
1
3
5
7
9
11
13
15
17
19
21
23
2
4
6
8
10
12
14
16
18
20
22
24
B
vl
Module
10/100/1000Base-T Ports - all ports are IEEE Auto MDI/MDI-X
ProCurve
24p Gig-T
vl Module
J8768A
1
3
5
7
9
11
13
15
17
19
21
23
2
4
6
8
10
12
14
16
18
20
22
24
C
D
SNMP
vl
Module
E
F
G
H
• Hardware
– Chassis temperature
– Fan condition
– Power status
• State of the operating system
– CPU
– Load
– Memory
Monitoring
system
Network Connection – L1 Monitoring
*Spd Mode
ProCurve
ProCurve Switch
4208vl-72GS
J9030A
off = 10Mbps
Self Test
flash = 100Mbps
on = 1000Mbps
Status
Reset
1
Clear
Console
Auxiliary Port
Fan
2
A
B C D E
Power
F G H
Act
FDx
Spd
Use vl modules only
!
LED Mode Select
Modules
Power
10/100/1000Base-T Ports - all ports are IEEE Auto MDI/MDI-X
10/100/1000Base-T Ports - all ports are IEEE Auto MDI/MDI-X
Fault
Use ProCurve mini-GBICs and SFPs only
1
ProCurve
24p Gig-T
vl Module
J8768A
3
5
7
9
11
13
15
17
19
21
A
23
vl
2
4
6
8
18
20
22
24
1
3
5
7
10
9
12
11
14
13
16
15
17
19
21
23
2
4
6
8
10
12
14
16
18
20
22
24
Module
1
ProCurve
Gig-T/SFP
vl Module
J9033A
3
5
7
9
11
13
15
17
19
21
23
B
vl
2
4
6
8
10
12
14
16
18
20
22
24
Module
10/100/1000Base-T Ports - all ports are IEEE Auto MDI/MDI-X
C
D
*Spd Mode
ProCurve
ProCurve
24p Gig-T
vl Module
J8768A
ProCurve Switch
4208vl-72GS
J9030A
off = 10Mbps
Self Test
flash = 100Mbps
on = 1000Mbps
Status
Reset
vl
1
Clear
Console
Module
Auxiliary Port
Fan
2
A
B C D E
Power
F G H
Act
FDx
Spd
Use vl modules only
!
LED Mode Select
Modules
Power
10/100/1000Base-T Ports - all ports are IEEE Auto MDI/MDI-X
10/100/1000Base-T Ports - all ports are IEEE Auto MDI/MDI-X
Fault
Use ProCurve mini-GBICs and SFPs only
1
E
3
5
7
9
11
13
15
17
19
21
A
23
1
3
5
7
9
11
13
15
17
19
21
23
B
F
ProCurve
24p Gig-T
vl Module
J8768A
vl
2
4
6
8
18
20
22
24
1
3
5
7
10
9
12
11
14
13
16
15
17
19
21
23
2
4
6
8
10
12
14
16
18
20
22
24
Module
ProCurve
Gig-T/SFP
vl Module
J9033A
vl
2
4
6
8
10
12
14
16
18
20
22
24
Module
10/100/1000Base-T Ports - all ports are IEEE Auto MDI/MDI-X
G
D
H
ProCurve
24p Gig-T
vl Module
J8768A
vl
• Port status
–
–
–
–
C
Link UP/DOWN
Speed
Errors on interfaces
Traffic on interfaces
• Remote device status
– LLDP + data from MIB
– Remote interface, remote device, …
Module
E
F
G
H
Link
LAYER 2
Network Connection – L2 Monitoring
• L2 monitoring
– L2 ping could be very useful
– We have to use information obtained from other layers
(L1,L3)
– Unfortunately, there is no simple possibility to check
connectivity on a single VLAN
– One option is to obtain some information from MIB, but
it’s not sufficient
• SPT/MSPT information, root bridge
• VLAN on interfaces
Network Connection – L3 monitoring
147.229.6.1
147.229.6.2
• L3 monitoring
Data
– ICMP and PING are still the most important
– The problem is how to monitor broken paths (routing
protocol usually covers any problem)
• Check of the routing protocol state
• ICMP using the source routing
– Flow based monitoring
– Multicast monitoring
Network Connection – L3 monitoring
Master
BDR
DR
• L3 monitoring
Backup
– Checking the a router having the proper neighbor
– OSPF-MIB RFC-4750
• ospfNbrRtrId
– VRRP-MIB RFC-2787
• vrrpOperAdminState, vrrpOperState, vrrpOperMasterIpAddr
Multicast Monitoring
• Quite demanding task
– For each stream the <S,G> path has to be created
– Continuously received and transmitted stream doesn’t
have to discover problem on the RP
– Almost impossible to monitor local infrastructure
• The only one known tool – Multicast Beacon
– Written in perl
– Dead project
• Last release 2006
• Without VLAN support or support for multiple interfaces on a
single host
• Homepage unavailable
• Own solution : mcwatch
Multicast Agents
Data is periodically sent
to a server
Multicast Agent
VLAN
POSIX
SOCKET
APPLICATION
Multicast
Beacon
Multicast Agent
VLAN
POSIX
SOCKET
APPLICATION
mcwatch
NetFlow Monitoring
CESNET PoP
CRS-1/16
University
network
10G Ethernet
• Two NetFlow probes see on both external connectivity lines
• NetFlow probes connected directly to optical fiber via TAP
• Wire speed accelerated probes (FlowMon).
Flow Processing
Nfcapd
All administrators
Datastore
aggregated
SQL
Backbone administrator
• Two NetFlow probes see on both external connectivity lines
• NetFlow probes connected directly to optical fiber via TAP
• Wire speed accelerated probes (FlowMon).
Flow Processing
Data are stored on a storage server
– Data are kept for 30 days
– Analysis of security incidents, statistical proposes
– Big deal – how to get/select useful data and provide them to people who
need them.
– Security matter
– Full data are accessible only for small and trustful group of administrators
– For other IT staff (faculty administrators, IT managers) summarised data
are accessible via a web interface.
• Data are processed by common open source tools:
–
–
–
–
nfdump
A lot of troubles, but we don’t have any better solution
We are trying to do any optimalisation into the current impelentations
Several theses on this topic is in process
• Commercial tools - situation is not better
– Usually plenty of nice charts and statistics
– But performance is often terrible (sampling is required)
Transport, application and the others
LAYER 4-7
Layer 7
• Many own plugins
–
–
–
–
–
Eduroam/radius monitoring
DNS
Database status
Backup server status
….
• Collected data and avilable for administrators on
different level
–
–
–
–
Eduroam/Radius logs
Maillogs (DNSBL, spam clasification, statistics)
WiFi/VPN connections
….
Components in the Monitoring System
zabbix
SNMP
Zabbix
Spinel
SNMP
xwho, xhis
radius
mysql
icmp
snmp
wifilogs
radiuslogs
honeypots
incidents
…
netflow
nfdump
millogs
NetIs
xmon
Radius, DNS
Other services
zabbix
xwho, xhis
NetIs
nfdump
ICMP tests using source
routing option
OSPF, VRRP peers
Multicast traffic monitoring
SNMP, zabbix, NetFlow, radius,
ICMP, ICMPv6, Spinel, …
Physical
Link
Port statistics, link status,
number of errors
LLDP neighbour
Application
Power, Cooling systems,
Temperature
Server and disk arrays
Network devices
Internet
Monitoring : Layers & Technology
Actuall problems
• SNMP protocol
– No alternative
– Many bugs in various implementations
• Absence of the L2 testing tool
• Netflow
– We have plenty of the data but nobody knows how to
process it in the effective way
– In some cases the more detailed information is required
than Flow
• IPv6 brings some new problems and challenges
Brno University of Technology
CESNET z.s.p.o
University Campus Network Monitoring in Everyday Life
Tomáš Podermański, [email protected]