Scripts - Nagios
Download
Report
Transcript Scripts - Nagios
Nagios On-Call
Rotation
James Clark
[email protected]
Topics Discussed
About Me / My Monitoring History
Monitoring History at Current Company
Prerequisites
Current Company Setup
Scripts
Nagios Configuration
2
About Me
Have been in the IT industry since 1988
In 2004 became server group manager
Have been using Nagios since ~2003
Switched to XI ~2010 (And loved every part of it)
Changed jobs in August 2012 and quickly convinced
new company to purchase XI
3
About Me
Private web page is http://www.bandits-home-on-theweb.com
On that page you will find some of the
Nagios modifications I have done
4
History of Monitoring and Alerting at new job
Many monitoring applications spread through-out
the IT department
CCSS for iSeries
Foglight for DB
SCOM for Windows
Three separate Nagios Core servers
IBM NetCool
Many departments had no monitoring
All of the applications forward to NetCool and
NetCool then forwards alerts to AlarmPoint (xMatters)
AlarmPoint holds the on-call schedule for the many
different groups in the IT department
5
History of Monitoring and Alerting at new job
CCSS for iSeries
Partial conversion to XI started
Foglight for DB
Complete conversion to XI hopeful
SCOM for Windows
Will either develop custom script to communicate
back and forth. Currently testing WMI and hope
to use that instead
Three separate Nagios Core servers
Converted to a single XI server
6
History of Monitoring and Alerting at new job
IBM NetCool
Removing from company
AlarmPoint
More than likely, removing from company
One primary XI server currently with 3
mod_gearman workers
One XI server for monitoring primary XI and a few
other devices
One XI server in our DR web data center
7
History of Monitoring and Alerting at new job
Besides AlarmPoint, On-Call schedule is kept in a
separate MS SharePoint site that the DC Operations
uses.
No fulltime administrator for either NetCool or
AlarmPoint.
When done switching everything to NagiosXI, a
significant savings will be realized.
One of the main hurdles to the switch, is on-call
rotation for alerting.
8
On Call Data - Prerequisites
On-call information stored in some application
On-call information able to be exported from the
application in a specific format
A job scheduler to run the jobs
9
On Call Data – Our Setup
SharePoint site to store on-call schedule
SharePoint admin created an application to export
the data needed and send the files to an FTP server.
Two files are sent, one for primary and one for
secondary.
We use Control-M to schedule the above program
and the two Linux scripts.
The job is run daily at 8am. Our on-call changes
Monday’s at 8am.
If changes are made to the on-call schedule, that
need to take effect immediately, the job is manually
run. Otherwise, it can wait until the next day at 8am.
10
On Call Data – Our Setup
Added ID to contacts table.
Added short name to On-Call
Groups table.
Set the SharePoint site to alert
me when any changes done to
those two tables so it can be
mirrored it in Nagios.
The scripts do handle
blanks. This will be
shown in a later slide.
11
On Call Data – Example files
Networking,network,smithj
System p Administration,aix_admins,doej
AE Direct,aed_infra,user1
Database,dba,clarks
System i Administration,system_i_admin,walenciejs
Wintel Administration,wintel_admins,hilderbrandr
System i Applications,system_i_apps,brownr
Client Server Applications,client_server,yatesp
DataWarehouse/Enterprise Rpts,datawarehouse,connerys
Store Applications,store_apps,probstj
The first field is what is displayed on the
SharePoint site and is the alias assigned in
Nagios. The second field is the name given to
the contact groups. The third field is of course
the ID of the user.
12
Scripts:
On Call Data – FTP Script
HOST=xxxxxxx
USER=xxxxxxx
PASS=xxxxxxx
#This is the FTP servers host or IP address.
#This is the FTP user that has access to the server.
#This is the password for the FTP user.
ftp -inv $HOST << EOF
user $USER $PASS
cd /nagiosftp
get primaryOnCall.txt
get secondaryOnCall.txt
delete primaryOnCall.txt
delete secondaryOnCall.txt
bye
EOF
exit 0
14
On Call Data – Data Manipulation Script
#!/usr/bin/perl
#Remove old config files
system ("find /usr/local/nagios/etc/static -type f -not -name 'xi*' not -name 'esc*' -not -name 'aed_*' | xargs rm");
#Process primary on-call file
open (INFILE, 'primaryOnCall.txt') or die $1;
while (<INFILE>) {
chomp;
($group, $alias, $id) = split(",");
if (($alias ne '') && ($group ne '') && ($id ne '')) {
open (OUTFILE, '>/usr/local/nagios/etc/static/' . $alias .
'_oncall_pri.cfg');
print OUTFILE "define contactgroup{\n";
print OUTFILE "contactgroup_name $alias" .
"_oncall_pri\n";
print OUTFILE "alias $group\n";
print OUTFILE "members $id\n";
print OUTFILE "}";
close (OUTFILE);
}
}
close (INFILE);
15
On Call Data – Data Manipulation Script(cont…)
#Process secondary on-call file
open (INFILE, 'secondaryOnCall.txt') or die $1;
while (<INFILE>) {
chomp;
($group, $alias, $id) = split(",");
if (($alias ne '') && ($group ne '') && ($id ne '')) {
open (OUTFILE, '>/usr/local/nagios/etc/static/' . $alias .
'_oncall_sec.cfg');
print OUTFILE "define contactgroup{\n";
print OUTFILE "contactgroup_name $alias" .
"_oncall_sec\n";
print OUTFILE "alias $group\n";
print OUTFILE "members $id\n";
print OUTFILE "}";
close (OUTFILE);
}
}
close (INFILE);
16
On Call Data – Data Manipulation Script(cont…)
#Change ownership and permissions of config files
system ("sudo /bin/chown apache:nagios
/usr/local/nagios/etc/static/*.cfg");
system ("sudo /bin/chmod 777 /usr/local/nagios/etc/static/*.cfg");
#Delete data files
system ("rm primaryOnCall.txt");
system ("rm secondaryOnCall.txt");
#Restart Nagios
system ("sudo su -l nagios -c 'cd /usr/local/nagiosxi/scripts/ &&
./reconfigure_nagios.sh'");
#Exit clean
exit 0;
17
On Call Data – List of Files Created
Due to a blank for secondary oncall in the file, only the primary file
for datawarehouse exists.
18
On Call Data – Files Created – Example Content
19
Nagios Configuration:
NagiosXI Configuration
No contacts or contact groups are assigned to the
hosts or services. Unless you want to always receive
alerts. i.e. Someone who needs alerted that is not a
member of the specific on-call group.
Users receive permissions to see hosts and services
by having an escalation for them
Escalations must be created for both hosts and
services. Services do not inherit escalations like they
do notifications
21
NagiosXI Configuration(cont…)
Escalations created as static config files.
Otherwise Nagios would error on the empty
contact groups.
All members of groups go into an ALL group.
This will be used to give users permissions
The group manager goes into a BOSS group.
This is used for alerting the manager after on-call
individuals fail to acknowledge an issue
22
Static Configuration Example - Hosts
define hostescalation{
hostgroup_name network_oncall
contact_groups network_oncall_pri
Created by script
first_notification
1
last_notification
0
notification_interval
15
}
define hostescalation{
hostgroup_name network_oncall
contact_groups network_oncall_sec
Created by script
first_notification
2
last_notification
0
notification_interval
15
}
define hostescalation{
hostgroup_name network_oncall
contact_groups network_boss
Created in XI and manager of
first_notification
4
group assigned as member
last_notification
0
notification_interval
15
}
define hostescalation{
hostgroup_name network_oncall
contact_groups network_all
Created in XI and all members of
first_notification
3
group assigned as members
last_notification
0
notification_interval
15
}
23
Static Configuration Example - Services
define serviceescalation{
hostgroup_name
network_oncall
service_description *
contact_groups network_oncall_pri
first_notification
1
last_notification
0
notification_interval
15
}
define serviceescalation{
hostgroup_name network_oncall
service_description *
contact_groups network_oncall_sec
first_notification
2
last_notification
0
notification_interval
15
}
define serviceescalation{
hostgroup_name network_oncall
service_description *
contact_groups network_all
first_notification
3
last_notification
0
notification_interval
15
}
define serviceescalation{
hostgroup_name network_oncall
service_description *
contact_groups network_boss
first_notification
4
last_notification
0
notification_interval
15
}
The way we set it up, it
uses the same
hostgroup used for all
the hosts and uses a
wildcard for service, to
include all services.
This could get very
complicated if different
groups/individuals
were needed on
different services on
the same host.
24
Static Configuration Example - Services
define serviceescalation{
host_name *
servicegroup_name dba_oncall
contact_groups dba_oncall_pri,dba_oncall_sec
first_notification
1
last_notification
0
notification_interval
15
}
define serviceescalation{
host_name *
servicegroup_name dba_oncall
contact_groups dba
first_notification
500
last_notification
0
notification_interval
15
}
25
Static Configuration Example - Services
The services can be an simple as the last slide, or as complex as you can imagine. This
attached file is a great example of the complexity that is capable.
26
Questions?
James Clark
Systems Monitoring Administrator
[email protected]