Transcript Slide 1

NMS requirement/recommendations
Belgrade, October 21 2009
Vidar Faltinsen, UNINETT
This talk reflects lessons learned up through the years
(19 years) of NMS development at UNINETT and
NTNU (the Norwegian University of Science and Technology)
and other universities around Norway
Lessons learned from a number of commited people
always aiming to improve network operations
2
Our context

The network is complex



No system is perfect


Errors will occur – incidents will hit us
Motto: be proactive and ahead


A lot of equipment
Heaps of traffic around the clock
The user should not call you – you should
be the first to know!
Keep in mind: If information is good…
(posted at the right time, kept up to
date)…
…the user is (more) patient!
3
Avoid a monolithic NMS



Not an absolute rule, but be a sceptic
If the system is too massive it tends to set the agenda.
 You should shape the system, not the other way
around.
 If too much resources must be invested into
understanding the system…
 …then even more resources must be put into
accommodating the system to your needs  
The NMS has no intrinsic value…


…it should be a useful tool for you
But remember nothing is for free – you must in any
case invest in understanding what your tools actually do
4
Not one tool - a set of tools

Special purpose tools with limited scope is good

Example of tool categories:











Tools should (ideally) not overlap
Have a well defined single authority as source for your data sets, i.e.;



inventory systems
trouble ticket systems
status monitors
measurements (and threshold monitors)
server/services focused
netflow analysis
security-focused
configuration tools
simulation
the set of equipment (with attributes) we manage is defined in
one place
similarly for our locations (with attributes), etc, etc
Autodetection is good

But in a controlled environment (be aware of weak SNMPv2
security)
5
Avoid complexity


A given tool should manage your whole
domain
Avoid a hierarchy of managers if possible



snmp polls can be done in parallel
Bandwidth is not a bottleneck
Throw ”iron” (CPU, memory, disk I/O,
battery backed disk controller) at NMS
utilization problems


If necessary segregate database on a
separate system, possibly also webfront
…but consider redundancy (more later)
6
Place your monitor strategic

A monitor placed in the periphery of your
network is more likely to be cut off





place in a central (network wise) location
redundant network access (VRRP,
HSRP…)
Redundant power, incl redundant source of
source (UPS/ideally standby generator)
Monitor the monitor!
Use SMS for alarms in addition to email

Place the SMS sending device physically
connected to the NMS
7
Classify your alarms


Think through: What are the most vital
alarms? What is less important?
Make sure the most vital alarms actually
reach you!



and not drown in 10.000 other alarms…
or stay saturated in an overworked
NMS…
Red and green lamps are good

in large environments in a hierarchal
display
8
Use a single event/alarm system

The set of tools/monitors you use should all report to one
event/alarm system


The central event/alarm system should scale



i.e. using snmp traps or email or…
coping with many events
make priorities / sort out important alarms
Correlate events – but be realistic





Detect ”in shadow” scenarios
Classify stateful alarms in pairs (down/up)
Suppress flapping alarms (line going up,down,up,down…)
Use hysteresis for threshold alarms. Set high and low
tresholds.
Again: keep robustness.


Rather one alarm to many than missing an important one
Allow a flexible setup for alarm profiles




every person tends to have his own preferences…
(but have a company policy)
alarms at night/weekend vs daytime
important alarms vs less important
alarms within vs outside the person’s scope of
duty/responsibility
9
Redundant NMS


Single point of failure is never good
Complete redundancy is not realistic


Too expensive
Complexity may bite you
10

Three possible ways to go:
1.
Monitor the monitor. Have a spare machine. Have
backup. 24x7 guard on duty. Replace ASAP.
2.
Do continous live replication of the NMS machine to
a hot spare.

3.
Manually (with few steps) set the hot spare in
operation (inherit the NMS IP address)
Use anycast combined with live replication

Secondary NMS automatically takes over when
primary NMS dies
Without numbers you are nothing






When an incident occurs – do you have enough data to
investigate – and actually pinpoint the cause?
Disk is cheap
Collect heaps of statistical data
Have a scheme for compressing data as time goes
(RRD/Stager method)
Focus on good search tools, reports and visualisation
methods to make traffic/statistical anomalies easy to detect
 Isolation and classification of an error tends to
consume most of the recovery time
Autodection of thresholds and more complex anomaly
detection is even better
 Remember to moderate the total flow of alarms
(classify alarms)
11
Logs are gold, scripts as well

Log, log, log


Syslog is also a management system 
Small (shell) scripts can be gold


A good idea can be only a few code lines
away…
A culture that motivates creativity, allows
continous implementation of new
scripts/add-ons will step by step improve
the overall management process!
12
Commit to open source




Open source development works
Sharing ideas and running code widely
improves the quality
Distributed contributions can speed
up implementation
(Poorly documented) single person
projects will eventually die
13
Adopt good naming standards


Do not underestimate the value of sound
names for your equipment, rooms and
locations
The name of the device should in itself give
an idea of what the device is (does) and
where it is placed


Example: mtfs-272-sw
(a switch in area ”mtfs”, wiring closet
”272”)
Also use a thought-through naming standard
for router interfaces and switch ports
14
NMS Security

Restrict access to NMS to authorized crew
only



Isolate management IP address of switches
and base stations to dedicated subnets
Firmly restrict SNMP access to the network
equipment – only from the NMS(es).


both network access and physical access
remember SNMP v2 security is weak
Be even more restrictive if you allow/use
SNMP Write

consider SNMP v3 or Netconf
15
MIB requirements

Your network equipment should support:




RFC 3418: SNMPv2-MIB
(system)
RFC 2863: IF-MIB
(interfaces, incl. 64 bit counters)
RFC 4293: IP-MIB
(IP-interfaces and ARP; IPv4 and IPv6)
RFC 4133: ENTITY MIB
(modules, optics, software,
serial numbers)



RFC 4188: BRIDGE-MIB
(bridge table)
RFC 4363: Q-BRIDGE MIB (bridge table per vlan, vlan config)



Not supported by Cisco 
RFC 3635: Etherlike-MIB
(duplex)
RFC 2368: MAU-MIB (medium)


Not supported by Juniper 
equipment support seems scarse  (HP has support)
Your NMS should whenever possible use standard/IETF MIBs rather
than vendor proprietory MIBs
16
Key points – in summary








Be proactive
Detect important alarms early
Inform the users
Log, log, log (snmp collect)
Use a number of tools
Adopt good naming standards
Value the engineer – small scripts are gold
Educate your crew!
(in both NMS operations and procedures)
17